All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/20] Initial Xe driver submission
@ 2022-12-22 22:21 ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Hello,

This is a submission for Xe, a new driver for Intel GPUs that supports both
integrated and discrete platforms starting with Tiger Lake (first platform with
Intel Xe Architecture). The intention of this new driver is to have a fresh base
to work from that is unencumbered by older platforms, whilst also taking the
opportunity to rearchitect our driver to increase sharing across the drm
subsystem, both leveraging and allowing us to contribute more towards other
shared components like TTM and drm/scheduler. The memory model is based on VM
bind which is similar to the i915 implementation. Likewise the execbuf
implementation for Xe is very similar to execbuf3 in the i915 [1].

The code is at a stage where it is already functional and has experimental
support for multiple platforms starting from Tiger Lake, with initial support
implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
implementation will be released publicly early next year. We also have a suite
of IGTs for XE that will appear on the IGT list shortly.

It has been built with the assumption of supporting multiple architectures from
the get-go, right now with tests running both on X86 and ARM hosts. And we
intend to continue working on it and improving on it as part of the kernel
community upstream.

The new Xe driver leverages a lot from i915 and work on i915 continues as we
ready Xe for production throughout 2023.

As for display, the intent is to share the display code with the i915 driver so
that there is maximum reuse there. Currently this is being done by compiling the
display code twice, but alternatives to that are under consideration and we want
to have more discussion on what the best final solution will look like over the
next few months. Right now, work is ongoing in refactoring the display codebase
to remove as much as possible any unnecessary dependencies on i915 specific data
structures there..

We currently have 2 submission backends, execlists and GuC. The execlist is
meant mostly for testing and is not fully functional while GuC backend is fully
functional. As with the i915 and GuC submission, in Xe the GuC firmware is
required and should be placed in /lib/firmware/xe.

The GuC firmware can be found in the below location:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

The easiest way to setup firmware is:
cp -r /lib/firmware/i915 /lib/firmware/xe

The code has been organized such that we have all patches that touch areas
outside of drm/xe first for review, and then the actual new driver in a separate
commit. The code which is outside of drm/xe is included in this RFC while
drm/xe is not due to the size of the commit. The drm/xe is code is available in
a public repo listed below.

Xe driver commit:
https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7

Xe kernel repo:
https://cgit.freedesktop.org/drm/drm-xe/

There's a lot of work still to happen on Xe but we're very excited about it and
wanted to share it early and welcome feedback and discussion.

Cheers,
Matthew Brost

[1] https://patchwork.freedesktop.org/series/105879/
[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418

Maarten Lankhorst (12):
  drm/amd: Convert amdgpu to use suballocation helper.
  drm/radeon: Use the drm suballocation manager implementation.
  drm/i915: Remove gem and overlay frontbuffer tracking
  drm/i915/display: Neuter frontbuffer tracking harder
  drm/i915/display: Add more macros to remove all direct calls to uncore
  drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
  drm/i915: Rename find_section to find_bdb_section
  drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
  drm/i915/display: Fix a use-after-free when intel_edp_init_connector
    fails
  drm/i915/display: Remaining changes to make xe compile
  sound/hda: Allow XE as i915 replacement for sound
  mei/hdcp: Also enable for XE

Matthew Brost (5):
  drm/sched: Convert drm scheduler to use a work queue rather than
    kthread
  drm/sched: Add generic scheduler message interface
  drm/sched: Start run wq before TDR in drm_sched_start
  drm/sched: Submit job before starting TDR
  drm/sched: Add helper to set TDR timeout

Thomas Hellström (3):
  drm/suballoc: Introduce a generic suballocation manager
  drm: Add a gpu page-table walker helper
  drm/ttm: Don't print error message if eviction was interrupted

 drivers/gpu/drm/Kconfig                       |   5 +
 drivers/gpu/drm/Makefile                      |   4 +
 drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
 drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
 drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
 drivers/gpu/drm/i915/Makefile                 |   2 +-
 drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
 drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
 drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
 .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
 .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
 drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
 drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
 drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
 drivers/gpu/drm/i915/display/intel_color.c    |   1 +
 drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
 drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
 drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
 drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
 drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
 .../gpu/drm/i915/display/intel_display_core.h |   5 +-
 .../drm/i915/display/intel_display_debugfs.c  |   8 +
 .../drm/i915/display/intel_display_power.c    |  40 ++-
 .../drm/i915/display/intel_display_power.h    |   6 +
 .../i915/display/intel_display_power_map.c    |   7 +
 .../i915/display/intel_display_power_well.c   |  24 +-
 .../drm/i915/display/intel_display_reg_defs.h |   4 +
 .../drm/i915/display/intel_display_trace.h    |   6 +
 .../drm/i915/display/intel_display_types.h    |  32 +-
 drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
 drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
 drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
 drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
 drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
 drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
 drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
 drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
 drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
 drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
 drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
 drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
 drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
 drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
 .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
 .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
 drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
 drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
 drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
 .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
 .../drm/i915/display/intel_modeset_setup.c    |  11 +-
 drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
 drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
 .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
 .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
 drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
 .../drm/i915/display/intel_plane_initial.c    |   3 +-
 drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
 drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
 drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
 drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
 drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
 .../drm/i915/display/skl_universal_plane.c    |  52 ++-
 drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
 drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
 drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
 drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
 drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
 drivers/gpu/drm/i915/i915_driver.c            |   1 +
 drivers/gpu/drm/i915/i915_gem.c               |   8 -
 drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
 drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
 drivers/gpu/drm/i915/i915_vma.c               |  12 -
 drivers/gpu/drm/radeon/radeon.h               |  55 +--
 drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
 drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
 drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
 drivers/misc/mei/hdcp/Kconfig                 |   2 +-
 drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
 include/drm/drm_pt_walk.h                     | 161 +++++++++
 include/drm/drm_suballoc.h                    | 112 ++++++
 include/drm/gpu_scheduler.h                   |  41 ++-
 sound/hda/hdac_i915.c                         |  17 +-
 sound/pci/hda/hda_intel.c                     |  56 +--
 sound/soc/intel/avs/core.c                    |  13 +-
 sound/soc/sof/intel/hda.c                     |   7 +-
 98 files changed, 2076 insertions(+), 1325 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_pt_walk.c
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_pt_walk.h
 create mode 100644 include/drm/drm_suballoc.h

-- 
2.37.3


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2022-12-22 22:21 ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Hello,

This is a submission for Xe, a new driver for Intel GPUs that supports both
integrated and discrete platforms starting with Tiger Lake (first platform with
Intel Xe Architecture). The intention of this new driver is to have a fresh base
to work from that is unencumbered by older platforms, whilst also taking the
opportunity to rearchitect our driver to increase sharing across the drm
subsystem, both leveraging and allowing us to contribute more towards other
shared components like TTM and drm/scheduler. The memory model is based on VM
bind which is similar to the i915 implementation. Likewise the execbuf
implementation for Xe is very similar to execbuf3 in the i915 [1].

The code is at a stage where it is already functional and has experimental
support for multiple platforms starting from Tiger Lake, with initial support
implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
implementation will be released publicly early next year. We also have a suite
of IGTs for XE that will appear on the IGT list shortly.

It has been built with the assumption of supporting multiple architectures from
the get-go, right now with tests running both on X86 and ARM hosts. And we
intend to continue working on it and improving on it as part of the kernel
community upstream.

The new Xe driver leverages a lot from i915 and work on i915 continues as we
ready Xe for production throughout 2023.

As for display, the intent is to share the display code with the i915 driver so
that there is maximum reuse there. Currently this is being done by compiling the
display code twice, but alternatives to that are under consideration and we want
to have more discussion on what the best final solution will look like over the
next few months. Right now, work is ongoing in refactoring the display codebase
to remove as much as possible any unnecessary dependencies on i915 specific data
structures there..

We currently have 2 submission backends, execlists and GuC. The execlist is
meant mostly for testing and is not fully functional while GuC backend is fully
functional. As with the i915 and GuC submission, in Xe the GuC firmware is
required and should be placed in /lib/firmware/xe.

The GuC firmware can be found in the below location:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

The easiest way to setup firmware is:
cp -r /lib/firmware/i915 /lib/firmware/xe

The code has been organized such that we have all patches that touch areas
outside of drm/xe first for review, and then the actual new driver in a separate
commit. The code which is outside of drm/xe is included in this RFC while
drm/xe is not due to the size of the commit. The drm/xe is code is available in
a public repo listed below.

Xe driver commit:
https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7

Xe kernel repo:
https://cgit.freedesktop.org/drm/drm-xe/

There's a lot of work still to happen on Xe but we're very excited about it and
wanted to share it early and welcome feedback and discussion.

Cheers,
Matthew Brost

[1] https://patchwork.freedesktop.org/series/105879/
[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418

Maarten Lankhorst (12):
  drm/amd: Convert amdgpu to use suballocation helper.
  drm/radeon: Use the drm suballocation manager implementation.
  drm/i915: Remove gem and overlay frontbuffer tracking
  drm/i915/display: Neuter frontbuffer tracking harder
  drm/i915/display: Add more macros to remove all direct calls to uncore
  drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
  drm/i915: Rename find_section to find_bdb_section
  drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
  drm/i915/display: Fix a use-after-free when intel_edp_init_connector
    fails
  drm/i915/display: Remaining changes to make xe compile
  sound/hda: Allow XE as i915 replacement for sound
  mei/hdcp: Also enable for XE

Matthew Brost (5):
  drm/sched: Convert drm scheduler to use a work queue rather than
    kthread
  drm/sched: Add generic scheduler message interface
  drm/sched: Start run wq before TDR in drm_sched_start
  drm/sched: Submit job before starting TDR
  drm/sched: Add helper to set TDR timeout

Thomas Hellström (3):
  drm/suballoc: Introduce a generic suballocation manager
  drm: Add a gpu page-table walker helper
  drm/ttm: Don't print error message if eviction was interrupted

 drivers/gpu/drm/Kconfig                       |   5 +
 drivers/gpu/drm/Makefile                      |   4 +
 drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
 drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
 drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
 drivers/gpu/drm/i915/Makefile                 |   2 +-
 drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
 drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
 drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
 .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
 .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
 drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
 drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
 drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
 drivers/gpu/drm/i915/display/intel_color.c    |   1 +
 drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
 drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
 drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
 drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
 drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
 .../gpu/drm/i915/display/intel_display_core.h |   5 +-
 .../drm/i915/display/intel_display_debugfs.c  |   8 +
 .../drm/i915/display/intel_display_power.c    |  40 ++-
 .../drm/i915/display/intel_display_power.h    |   6 +
 .../i915/display/intel_display_power_map.c    |   7 +
 .../i915/display/intel_display_power_well.c   |  24 +-
 .../drm/i915/display/intel_display_reg_defs.h |   4 +
 .../drm/i915/display/intel_display_trace.h    |   6 +
 .../drm/i915/display/intel_display_types.h    |  32 +-
 drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
 drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
 drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
 drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
 drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
 drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
 drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
 drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
 drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
 drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
 drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
 drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
 drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
 drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
 .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
 .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
 drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
 drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
 drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
 .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
 .../drm/i915/display/intel_modeset_setup.c    |  11 +-
 drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
 drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
 .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
 .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
 drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
 .../drm/i915/display/intel_plane_initial.c    |   3 +-
 drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
 drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
 drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
 drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
 drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
 .../drm/i915/display/skl_universal_plane.c    |  52 ++-
 drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
 drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
 drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
 drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
 drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
 drivers/gpu/drm/i915/i915_driver.c            |   1 +
 drivers/gpu/drm/i915/i915_gem.c               |   8 -
 drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
 drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
 drivers/gpu/drm/i915/i915_vma.c               |  12 -
 drivers/gpu/drm/radeon/radeon.h               |  55 +--
 drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
 drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
 drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
 drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
 drivers/misc/mei/hdcp/Kconfig                 |   2 +-
 drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
 include/drm/drm_pt_walk.h                     | 161 +++++++++
 include/drm/drm_suballoc.h                    | 112 ++++++
 include/drm/gpu_scheduler.h                   |  41 ++-
 sound/hda/hdac_i915.c                         |  17 +-
 sound/pci/hda/hda_intel.c                     |  56 +--
 sound/soc/intel/avs/core.c                    |  13 +-
 sound/soc/sof/intel/hda.c                     |   7 +-
 98 files changed, 2076 insertions(+), 1325 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_pt_walk.c
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_pt_walk.h
 create mode 100644 include/drm/drm_suballoc.h

-- 
2.37.3


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 01/20] drm/suballoc: Introduce a generic suballocation manager
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Initially we tried to leverage the amdgpu suballocation manager.
It turnes out, however, that it tries extremely hard not to enable
signalling on the fences that hold the memory up for freeing, which makes
it hard to understand and to fix potential issues with it.

So in a simplification effort, introduce a drm suballocation manager as a
wrapper around an existing allocator (drm_mm) and to avoid using queues
for freeing, thus avoiding throttling on free which is an undesired
feature as typically the throttling needs to be done uninterruptible.

This variant is probably more cpu-hungry but can be improved at the cost
of additional complexity. Ideas for that are documented in the
drm_suballoc.c file.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/Kconfig        |   4 +
 drivers/gpu/drm/Makefile       |   3 +
 drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
 include/drm/drm_suballoc.h     | 112 ++++++++++++
 4 files changed, 420 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_suballoc.h

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 663ea8f9966d..ad231a68c2a5 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -233,6 +233,10 @@ config DRM_GEM_SHMEM_HELPER
 	help
 	  Choose this if you need the GEM shmem helper functions
 
+config DRM_SUBALLOC_HELPER
+	tristate
+	depends on DRM
+
 config DRM_SCHED
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 496fa5a6147a..23ad760884b2 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
 drm_shmem_helper-y := drm_gem_shmem_helper.o
 obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
 
+drm_suballoc_helper-y := drm_suballoc.o
+obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
+
 drm_vram_helper-y := drm_gem_vram_helper.o
 obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
 
diff --git a/drivers/gpu/drm/drm_suballoc.c b/drivers/gpu/drm/drm_suballoc.c
new file mode 100644
index 000000000000..6e0292dea548
--- /dev/null
+++ b/drivers/gpu/drm/drm_suballoc.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <drm/drm_suballoc.h>
+
+/**
+ * DOC:
+ * This suballocator intends to be a wrapper around a range allocator
+ * that is aware also of deferred range freeing with fences. Currently
+ * we hard-code the drm_mm as the range allocator.
+ * The approach, while rather simple, suffers from three performance
+ * issues that can all be fixed if needed at the tradeoff of more and / or
+ * more complex code:
+ *
+ * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
+ * much simpler range allocator, or let the caller decide by providing
+ * ops that wrap any range allocator. Also could avoid waking up unless
+ * there is a reasonable chance of enough space in the range manager.
+ *
+ * 2) We unnecessarily install the fence callbacks too early, forcing
+ * enable_signaling() too early causing extra driver effort. This is likely
+ * not an issue if used with the drm_scheduler since it calls
+ * enable_signaling() early anyway.
+ *
+ * 3) Long processing in irq (disabled) context. We've mostly worked around
+ * that already by using the idle_list. If that workaround is deemed to
+ * complex for little gain, we can remove it and use spin_lock_irq()
+ * throughout the manager. If we want to shorten processing in irq context
+ * even further, we can skip the spin_trylock in __drm_suballoc_free() and
+ * avoid freeing allocations from irq context altogeher. However drm_mm
+ * should be quite fast at freeing ranges.
+ *
+ * 4) Shrinker that starts processing the list items in 2) and 3) to play
+ * better with the system.
+ */
+
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager);
+
+/**
+ * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate
+ * @align: alignment for each suballocated chunk
+ *
+ * Prepares the suballocation manager for suballocations.
+ */
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align)
+{
+	spin_lock_init(&sa_manager->lock);
+	spin_lock_init(&sa_manager->idle_list_lock);
+	mutex_init(&sa_manager->alloc_mutex);
+	drm_mm_init(&sa_manager->mm, 0, size);
+	init_waitqueue_head(&sa_manager->wq);
+	sa_manager->range_size = size;
+	sa_manager->alignment = align;
+	INIT_LIST_HEAD(&sa_manager->idle_list);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_init);
+
+/**
+ * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ *
+ * Cleans up the suballocation manager after use. All fences added
+ * with drm_suballoc_free() must be signaled, or we cannot clean up
+ * the entire manager.
+ */
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
+{
+	drm_suballoc_process_idle(sa_manager);
+	drm_mm_takedown(&sa_manager->mm);
+	mutex_destroy(&sa_manager->alloc_mutex);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_fini);
+
+static void __drm_suballoc_free(struct drm_suballoc *sa)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	struct dma_fence *fence;
+
+	/*
+	 * In order to avoid protecting the potentially lengthy drm_mm manager
+	 * *allocation* processing with an irq-disabling lock,
+	 * defer touching the drm_mm for freeing until we're in task context,
+	 * with no irqs disabled, or happen to succeed in taking the manager
+	 * lock.
+	 */
+	if (!in_task() || irqs_disabled()) {
+		unsigned long irqflags;
+
+		if (spin_trylock(&sa_manager->lock))
+			goto locked;
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_add_tail(&sa->idle_link, &sa_manager->idle_list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+		wake_up(&sa_manager->wq);
+		return;
+	}
+
+	spin_lock(&sa_manager->lock);
+locked:
+	drm_mm_remove_node(&sa->node);
+
+	fence = sa->fence;
+	sa->fence = NULL;
+	spin_unlock(&sa_manager->lock);
+	/* Maybe only wake if first mm hole is sufficiently large? */
+	wake_up(&sa_manager->wq);
+	dma_fence_put(fence);
+	kfree(sa);
+}
+
+/* Free all deferred idle allocations */
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager)
+{
+	/*
+	 * prepare_to_wait() / wake_up() semantics ensure that any list
+	 * addition that was done before wake_up() is visible when
+	 * this code is called from the wait loop.
+	 */
+	if (!list_empty_careful(&sa_manager->idle_list)) {
+		struct drm_suballoc *sa, *next;
+		unsigned long irqflags;
+		LIST_HEAD(list);
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_splice_init(&sa_manager->idle_list, &list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+
+		list_for_each_entry_safe(sa, next, &list, idle_link)
+			__drm_suballoc_free(sa);
+	}
+}
+
+static void
+drm_suballoc_fence_signaled(struct dma_fence *fence, struct dma_fence_cb *cb)
+{
+	struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
+
+	__drm_suballoc_free(sa);
+}
+
+static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	int err;
+
+	drm_suballoc_process_idle(sa_manager);
+	spin_lock(&sa_manager->lock);
+	err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
+					 sa_manager->alignment, 0,
+					 DRM_MM_INSERT_EVICT);
+	spin_unlock(&sa_manager->lock);
+	return err;
+}
+
+/**
+ * drm_suballoc_new() - Make a suballocation.
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate.
+ * @gfp: Allocation context.
+ * @intr: Whether to sleep interruptibly if sleeping.
+ *
+ * Try to make a suballocation of size @size, which will be rounded
+ * up to the alignment specified in specified in drm_suballoc_manager_init().
+ *
+ * Returns a new suballocated bo, or an ERR_PTR.
+ */
+struct drm_suballoc*
+drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
+		 gfp_t gfp, bool intr)
+{
+	struct drm_suballoc *sa;
+	DEFINE_WAIT(wait);
+	int err = 0;
+
+	if (size > sa_manager->range_size)
+		return ERR_PTR(-ENOSPC);
+
+	sa = kzalloc(sizeof(*sa), gfp);
+	if (!sa)
+		return ERR_PTR(-ENOMEM);
+
+	/* Avoid starvation using the alloc_mutex */
+	if (intr)
+		err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
+	else
+		mutex_lock(&sa_manager->alloc_mutex);
+	if (err) {
+		kfree(sa);
+		return ERR_PTR(err);
+	}
+
+	sa->manager = sa_manager;
+	err = drm_suballoc_tryalloc(sa, size);
+	if (err != -ENOSPC)
+		goto out;
+
+	for (;;) {
+		prepare_to_wait(&sa_manager->wq, &wait,
+				intr ? TASK_INTERRUPTIBLE :
+				TASK_UNINTERRUPTIBLE);
+
+		err = drm_suballoc_tryalloc(sa, size);
+		if (err != -ENOSPC)
+			break;
+
+		if (intr && signal_pending(current)) {
+			err = -ERESTARTSYS;
+			break;
+		}
+
+		io_schedule();
+	}
+	finish_wait(&sa_manager->wq, &wait);
+
+out:
+	mutex_unlock(&sa_manager->alloc_mutex);
+	if (!sa->node.size) {
+		kfree(sa);
+		WARN_ON(!err);
+		sa = ERR_PTR(err);
+	}
+
+	return sa;
+}
+EXPORT_SYMBOL(drm_suballoc_new);
+
+/**
+ * drm_suballoc_free() - Free a suballocation
+ * @suballoc: pointer to the suballocation
+ * @fence: fence that signals when suballocation is idle
+ * @queue: the index to which queue the suballocation will be placed on the free list.
+ *
+ * Free the suballocation. The suballocation can be re-used after @fence
+ * signals.
+ */
+void
+drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
+{
+	if (!sa)
+		return;
+
+	if (!fence || dma_fence_is_signaled(fence)) {
+		__drm_suballoc_free(sa);
+		return;
+	}
+
+	sa->fence = dma_fence_get(fence);
+	if (dma_fence_add_callback(fence, &sa->cb, drm_suballoc_fence_signaled))
+		__drm_suballoc_free(sa);
+}
+EXPORT_SYMBOL(drm_suballoc_free);
+
+#ifdef CONFIG_DEBUG_FS
+
+/**
+ * drm_suballoc_dump_debug_info() - Dump the suballocator state
+ * @sa_manager: The suballoc manager.
+ * @p: Pointer to a drm printer for output.
+ * @suballoc_base: Constant to add to the suballocated offsets on printout.
+ *
+ * This function dumps the suballocator state. Note that the caller has
+ * to explicitly order frees and calls to this function in order for the
+ * freed node to show up as protected by a fence.
+ */
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base)
+{
+	const struct drm_mm_node *entry;
+
+	spin_lock(&sa_manager->lock);
+	drm_mm_for_each_node(entry, &sa_manager->mm) {
+		struct drm_suballoc *sa =
+			container_of(entry, typeof(*sa), node);
+
+		drm_printf(p, " ");
+		drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
+			   (unsigned long long)suballoc_base + entry->start,
+			   (unsigned long long)suballoc_base + entry->start +
+			   entry->size, (unsigned long long)entry->size);
+
+		if (sa->fence)
+			drm_printf(p, " protected by 0x%016llx on context %llu",
+				   (unsigned long long)sa->fence->seqno,
+				   (unsigned long long)sa->fence->context);
+
+		drm_printf(p, "\n");
+	}
+	spin_unlock(&sa_manager->lock);
+}
+EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
+#endif
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("Simple range suballocator helper");
+MODULE_LICENSE("GPL and additional rights");
diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
new file mode 100644
index 000000000000..910952b3383b
--- /dev/null
+++ b/include/drm/drm_suballoc.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#ifndef _DRM_SUBALLOC_H_
+#define _DRM_SUBALLOC_H_
+
+#include <drm/drm_mm.h>
+
+#include <linux/dma-fence.h>
+#include <linux/types.h>
+
+/**
+ * struct drm_suballoc_manager - Wrapper for fenced range allocations
+ * @mm: The range manager. Protected by @lock.
+ * @range_size: The total size of the range.
+ * @alignment: Range alignment.
+ * @wq: Wait queue for sleeping allocations on contention.
+ * @idle_list: List of idle but not yet freed allocations. Protected by
+ * @idle_list_lock.
+ * @task: Task waiting for allocation. Protected by @lock.
+ */
+struct drm_suballoc_manager {
+	/** @lock: Manager lock. Protects @mm. */
+	spinlock_t lock;
+	/**
+	 * @idle_list_lock: Lock to protect the idle_list.
+	 * Disable irqs when locking.
+	 */
+	spinlock_t idle_list_lock;
+	/** @alloc_mutex: Mutex to protect against stavation. */
+	struct mutex alloc_mutex;
+	struct drm_mm mm;
+	u64 range_size;
+	u64 alignment;
+	wait_queue_head_t wq;
+	struct list_head idle_list;
+};
+
+/**
+ * struct drm_suballoc: Suballocated range.
+ * @node: The drm_mm representation of the range.
+ * @fence: dma-fence indicating whether allocation is active or idle.
+ * Assigned on call to free the allocation so doesn't need protection.
+ * @cb: dma-fence callback structure. Used for callbacks when the fence signals.
+ * @manager: The struct drm_suballoc_manager the range belongs to. Immutable.
+ * @idle_link: Link for the manager idle_list. Progected by the
+ * drm_suballoc_manager::idle_lock.
+ */
+struct drm_suballoc {
+	struct drm_mm_node node;
+	struct dma_fence *fence;
+	struct dma_fence_cb cb;
+	struct drm_suballoc_manager *manager;
+	struct list_head idle_link;
+};
+
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align);
+
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager);
+
+struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager *sa_manager,
+				      u64 size, gfp_t gfp, bool intr);
+
+void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence);
+
+/**
+ * drm_suballoc_soffset - Range start.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The start of the allocated range.
+ */
+static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
+{
+	return sa->node.start;
+}
+
+/**
+ * drm_suballoc_eoffset - Range end.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The end of the allocated range + 1.
+ */
+static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
+{
+	return sa->node.start + sa->node.size;
+}
+
+/**
+ * drm_suballoc_size - Range size.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The size of the allocated range.
+ */
+static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
+{
+	return sa->node.size;
+}
+
+#ifdef CONFIG_DEBUG_FS
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base);
+#else
+static inline void
+drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+			     struct drm_printer *p, u64 suballoc_base)
+{ }
+
+#endif
+
+#endif /* _DRM_SUBALLOC_H_ */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 01/20] drm/suballoc: Introduce a generic suballocation manager
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Initially we tried to leverage the amdgpu suballocation manager.
It turnes out, however, that it tries extremely hard not to enable
signalling on the fences that hold the memory up for freeing, which makes
it hard to understand and to fix potential issues with it.

So in a simplification effort, introduce a drm suballocation manager as a
wrapper around an existing allocator (drm_mm) and to avoid using queues
for freeing, thus avoiding throttling on free which is an undesired
feature as typically the throttling needs to be done uninterruptible.

This variant is probably more cpu-hungry but can be improved at the cost
of additional complexity. Ideas for that are documented in the
drm_suballoc.c file.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/Kconfig        |   4 +
 drivers/gpu/drm/Makefile       |   3 +
 drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
 include/drm/drm_suballoc.h     | 112 ++++++++++++
 4 files changed, 420 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_suballoc.h

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 663ea8f9966d..ad231a68c2a5 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -233,6 +233,10 @@ config DRM_GEM_SHMEM_HELPER
 	help
 	  Choose this if you need the GEM shmem helper functions
 
+config DRM_SUBALLOC_HELPER
+	tristate
+	depends on DRM
+
 config DRM_SCHED
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 496fa5a6147a..23ad760884b2 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
 drm_shmem_helper-y := drm_gem_shmem_helper.o
 obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
 
+drm_suballoc_helper-y := drm_suballoc.o
+obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
+
 drm_vram_helper-y := drm_gem_vram_helper.o
 obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
 
diff --git a/drivers/gpu/drm/drm_suballoc.c b/drivers/gpu/drm/drm_suballoc.c
new file mode 100644
index 000000000000..6e0292dea548
--- /dev/null
+++ b/drivers/gpu/drm/drm_suballoc.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <drm/drm_suballoc.h>
+
+/**
+ * DOC:
+ * This suballocator intends to be a wrapper around a range allocator
+ * that is aware also of deferred range freeing with fences. Currently
+ * we hard-code the drm_mm as the range allocator.
+ * The approach, while rather simple, suffers from three performance
+ * issues that can all be fixed if needed at the tradeoff of more and / or
+ * more complex code:
+ *
+ * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
+ * much simpler range allocator, or let the caller decide by providing
+ * ops that wrap any range allocator. Also could avoid waking up unless
+ * there is a reasonable chance of enough space in the range manager.
+ *
+ * 2) We unnecessarily install the fence callbacks too early, forcing
+ * enable_signaling() too early causing extra driver effort. This is likely
+ * not an issue if used with the drm_scheduler since it calls
+ * enable_signaling() early anyway.
+ *
+ * 3) Long processing in irq (disabled) context. We've mostly worked around
+ * that already by using the idle_list. If that workaround is deemed to
+ * complex for little gain, we can remove it and use spin_lock_irq()
+ * throughout the manager. If we want to shorten processing in irq context
+ * even further, we can skip the spin_trylock in __drm_suballoc_free() and
+ * avoid freeing allocations from irq context altogeher. However drm_mm
+ * should be quite fast at freeing ranges.
+ *
+ * 4) Shrinker that starts processing the list items in 2) and 3) to play
+ * better with the system.
+ */
+
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager);
+
+/**
+ * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate
+ * @align: alignment for each suballocated chunk
+ *
+ * Prepares the suballocation manager for suballocations.
+ */
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align)
+{
+	spin_lock_init(&sa_manager->lock);
+	spin_lock_init(&sa_manager->idle_list_lock);
+	mutex_init(&sa_manager->alloc_mutex);
+	drm_mm_init(&sa_manager->mm, 0, size);
+	init_waitqueue_head(&sa_manager->wq);
+	sa_manager->range_size = size;
+	sa_manager->alignment = align;
+	INIT_LIST_HEAD(&sa_manager->idle_list);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_init);
+
+/**
+ * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ *
+ * Cleans up the suballocation manager after use. All fences added
+ * with drm_suballoc_free() must be signaled, or we cannot clean up
+ * the entire manager.
+ */
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
+{
+	drm_suballoc_process_idle(sa_manager);
+	drm_mm_takedown(&sa_manager->mm);
+	mutex_destroy(&sa_manager->alloc_mutex);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_fini);
+
+static void __drm_suballoc_free(struct drm_suballoc *sa)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	struct dma_fence *fence;
+
+	/*
+	 * In order to avoid protecting the potentially lengthy drm_mm manager
+	 * *allocation* processing with an irq-disabling lock,
+	 * defer touching the drm_mm for freeing until we're in task context,
+	 * with no irqs disabled, or happen to succeed in taking the manager
+	 * lock.
+	 */
+	if (!in_task() || irqs_disabled()) {
+		unsigned long irqflags;
+
+		if (spin_trylock(&sa_manager->lock))
+			goto locked;
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_add_tail(&sa->idle_link, &sa_manager->idle_list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+		wake_up(&sa_manager->wq);
+		return;
+	}
+
+	spin_lock(&sa_manager->lock);
+locked:
+	drm_mm_remove_node(&sa->node);
+
+	fence = sa->fence;
+	sa->fence = NULL;
+	spin_unlock(&sa_manager->lock);
+	/* Maybe only wake if first mm hole is sufficiently large? */
+	wake_up(&sa_manager->wq);
+	dma_fence_put(fence);
+	kfree(sa);
+}
+
+/* Free all deferred idle allocations */
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager)
+{
+	/*
+	 * prepare_to_wait() / wake_up() semantics ensure that any list
+	 * addition that was done before wake_up() is visible when
+	 * this code is called from the wait loop.
+	 */
+	if (!list_empty_careful(&sa_manager->idle_list)) {
+		struct drm_suballoc *sa, *next;
+		unsigned long irqflags;
+		LIST_HEAD(list);
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_splice_init(&sa_manager->idle_list, &list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+
+		list_for_each_entry_safe(sa, next, &list, idle_link)
+			__drm_suballoc_free(sa);
+	}
+}
+
+static void
+drm_suballoc_fence_signaled(struct dma_fence *fence, struct dma_fence_cb *cb)
+{
+	struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
+
+	__drm_suballoc_free(sa);
+}
+
+static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	int err;
+
+	drm_suballoc_process_idle(sa_manager);
+	spin_lock(&sa_manager->lock);
+	err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
+					 sa_manager->alignment, 0,
+					 DRM_MM_INSERT_EVICT);
+	spin_unlock(&sa_manager->lock);
+	return err;
+}
+
+/**
+ * drm_suballoc_new() - Make a suballocation.
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate.
+ * @gfp: Allocation context.
+ * @intr: Whether to sleep interruptibly if sleeping.
+ *
+ * Try to make a suballocation of size @size, which will be rounded
+ * up to the alignment specified in specified in drm_suballoc_manager_init().
+ *
+ * Returns a new suballocated bo, or an ERR_PTR.
+ */
+struct drm_suballoc*
+drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
+		 gfp_t gfp, bool intr)
+{
+	struct drm_suballoc *sa;
+	DEFINE_WAIT(wait);
+	int err = 0;
+
+	if (size > sa_manager->range_size)
+		return ERR_PTR(-ENOSPC);
+
+	sa = kzalloc(sizeof(*sa), gfp);
+	if (!sa)
+		return ERR_PTR(-ENOMEM);
+
+	/* Avoid starvation using the alloc_mutex */
+	if (intr)
+		err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
+	else
+		mutex_lock(&sa_manager->alloc_mutex);
+	if (err) {
+		kfree(sa);
+		return ERR_PTR(err);
+	}
+
+	sa->manager = sa_manager;
+	err = drm_suballoc_tryalloc(sa, size);
+	if (err != -ENOSPC)
+		goto out;
+
+	for (;;) {
+		prepare_to_wait(&sa_manager->wq, &wait,
+				intr ? TASK_INTERRUPTIBLE :
+				TASK_UNINTERRUPTIBLE);
+
+		err = drm_suballoc_tryalloc(sa, size);
+		if (err != -ENOSPC)
+			break;
+
+		if (intr && signal_pending(current)) {
+			err = -ERESTARTSYS;
+			break;
+		}
+
+		io_schedule();
+	}
+	finish_wait(&sa_manager->wq, &wait);
+
+out:
+	mutex_unlock(&sa_manager->alloc_mutex);
+	if (!sa->node.size) {
+		kfree(sa);
+		WARN_ON(!err);
+		sa = ERR_PTR(err);
+	}
+
+	return sa;
+}
+EXPORT_SYMBOL(drm_suballoc_new);
+
+/**
+ * drm_suballoc_free() - Free a suballocation
+ * @suballoc: pointer to the suballocation
+ * @fence: fence that signals when suballocation is idle
+ * @queue: the index to which queue the suballocation will be placed on the free list.
+ *
+ * Free the suballocation. The suballocation can be re-used after @fence
+ * signals.
+ */
+void
+drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
+{
+	if (!sa)
+		return;
+
+	if (!fence || dma_fence_is_signaled(fence)) {
+		__drm_suballoc_free(sa);
+		return;
+	}
+
+	sa->fence = dma_fence_get(fence);
+	if (dma_fence_add_callback(fence, &sa->cb, drm_suballoc_fence_signaled))
+		__drm_suballoc_free(sa);
+}
+EXPORT_SYMBOL(drm_suballoc_free);
+
+#ifdef CONFIG_DEBUG_FS
+
+/**
+ * drm_suballoc_dump_debug_info() - Dump the suballocator state
+ * @sa_manager: The suballoc manager.
+ * @p: Pointer to a drm printer for output.
+ * @suballoc_base: Constant to add to the suballocated offsets on printout.
+ *
+ * This function dumps the suballocator state. Note that the caller has
+ * to explicitly order frees and calls to this function in order for the
+ * freed node to show up as protected by a fence.
+ */
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base)
+{
+	const struct drm_mm_node *entry;
+
+	spin_lock(&sa_manager->lock);
+	drm_mm_for_each_node(entry, &sa_manager->mm) {
+		struct drm_suballoc *sa =
+			container_of(entry, typeof(*sa), node);
+
+		drm_printf(p, " ");
+		drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
+			   (unsigned long long)suballoc_base + entry->start,
+			   (unsigned long long)suballoc_base + entry->start +
+			   entry->size, (unsigned long long)entry->size);
+
+		if (sa->fence)
+			drm_printf(p, " protected by 0x%016llx on context %llu",
+				   (unsigned long long)sa->fence->seqno,
+				   (unsigned long long)sa->fence->context);
+
+		drm_printf(p, "\n");
+	}
+	spin_unlock(&sa_manager->lock);
+}
+EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
+#endif
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("Simple range suballocator helper");
+MODULE_LICENSE("GPL and additional rights");
diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
new file mode 100644
index 000000000000..910952b3383b
--- /dev/null
+++ b/include/drm/drm_suballoc.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#ifndef _DRM_SUBALLOC_H_
+#define _DRM_SUBALLOC_H_
+
+#include <drm/drm_mm.h>
+
+#include <linux/dma-fence.h>
+#include <linux/types.h>
+
+/**
+ * struct drm_suballoc_manager - Wrapper for fenced range allocations
+ * @mm: The range manager. Protected by @lock.
+ * @range_size: The total size of the range.
+ * @alignment: Range alignment.
+ * @wq: Wait queue for sleeping allocations on contention.
+ * @idle_list: List of idle but not yet freed allocations. Protected by
+ * @idle_list_lock.
+ * @task: Task waiting for allocation. Protected by @lock.
+ */
+struct drm_suballoc_manager {
+	/** @lock: Manager lock. Protects @mm. */
+	spinlock_t lock;
+	/**
+	 * @idle_list_lock: Lock to protect the idle_list.
+	 * Disable irqs when locking.
+	 */
+	spinlock_t idle_list_lock;
+	/** @alloc_mutex: Mutex to protect against stavation. */
+	struct mutex alloc_mutex;
+	struct drm_mm mm;
+	u64 range_size;
+	u64 alignment;
+	wait_queue_head_t wq;
+	struct list_head idle_list;
+};
+
+/**
+ * struct drm_suballoc: Suballocated range.
+ * @node: The drm_mm representation of the range.
+ * @fence: dma-fence indicating whether allocation is active or idle.
+ * Assigned on call to free the allocation so doesn't need protection.
+ * @cb: dma-fence callback structure. Used for callbacks when the fence signals.
+ * @manager: The struct drm_suballoc_manager the range belongs to. Immutable.
+ * @idle_link: Link for the manager idle_list. Progected by the
+ * drm_suballoc_manager::idle_lock.
+ */
+struct drm_suballoc {
+	struct drm_mm_node node;
+	struct dma_fence *fence;
+	struct dma_fence_cb cb;
+	struct drm_suballoc_manager *manager;
+	struct list_head idle_link;
+};
+
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align);
+
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager);
+
+struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager *sa_manager,
+				      u64 size, gfp_t gfp, bool intr);
+
+void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence);
+
+/**
+ * drm_suballoc_soffset - Range start.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The start of the allocated range.
+ */
+static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
+{
+	return sa->node.start;
+}
+
+/**
+ * drm_suballoc_eoffset - Range end.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The end of the allocated range + 1.
+ */
+static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
+{
+	return sa->node.start + sa->node.size;
+}
+
+/**
+ * drm_suballoc_size - Range size.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The size of the allocated range.
+ */
+static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
+{
+	return sa->node.size;
+}
+
+#ifdef CONFIG_DEBUG_FS
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base);
+#else
+static inline void
+drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+			     struct drm_printer *p, u64 suballoc_base)
+{ }
+
+#endif
+
+#endif /* _DRM_SUBALLOC_H_ */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 02/20] drm/amd: Convert amdgpu to use suballocation helper.
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Now that we have a generic suballocation helper, Use it in amdgpu.
The debug output is slightly different and suballocation may be
slightly more cpu-hungry.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Kconfig                    |   1 +
 drivers/gpu/drm/amd/amdgpu/Kconfig         |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     | 320 ++-------------------
 7 files changed, 43 insertions(+), 336 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index ad231a68c2a5..de45c3c059f0 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -78,6 +78,7 @@ config DRM_KUNIT_TEST
 	select DRM_DISPLAY_HELPER
 	select DRM_LIB_RANDOM
 	select DRM_KMS_HELPER
+	select DRM_SUBALLOC_HELPER
 	select DRM_BUDDY
 	select DRM_EXPORT_FOR_TESTS if m
 	select DRM_KUNIT_TEST_HELPERS
diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
index 5fcd510f1abb..eef179b81d0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/Kconfig
+++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
@@ -16,6 +16,7 @@ config DRM_AMDGPU
 	select BACKLIGHT_CLASS_DEVICE
 	select INTERVAL_TREE
 	select DRM_BUDDY
+	select DRM_SUBALLOC_HELPER
 	# amdgpu depends on ACPI_VIDEO when ACPI is enabled, for select to work
 	# ACPI_VIDEO's dependencies must also be selected.
 	select INPUT if ACPI
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 2644cd991210..009903e21d83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -422,29 +422,11 @@ struct amdgpu_clock {
  * alignment).
  */
 
-#define AMDGPU_SA_NUM_FENCE_LISTS	32
-
 struct amdgpu_sa_manager {
-	wait_queue_head_t	wq;
-	struct amdgpu_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[AMDGPU_SA_NUM_FENCE_LISTS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-/* sub-allocation buffer */
-struct amdgpu_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct amdgpu_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct dma_fence	        *fence;
+	struct drm_suballoc_manager	base;
+	struct amdgpu_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
 };
 
 int amdgpu_fence_slab_init(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index bcccc348dbe2..5621b63c7f42 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -69,7 +69,7 @@ int amdgpu_ib_get(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 
 	if (size) {
 		r = amdgpu_sa_bo_new(&adev->ib_pools[pool_type],
-				      &ib->sa_bo, size, 256);
+				      &ib->sa_bo, size);
 		if (r) {
 			dev_err(adev->dev, "failed to get a new IB (%d)\n", r);
 			return r;
@@ -309,8 +309,7 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
 
 	for (i = 0; i < AMDGPU_IB_POOL_MAX; i++) {
 		r = amdgpu_sa_bo_manager_init(adev, &adev->ib_pools[i],
-					      AMDGPU_IB_POOL_SIZE,
-					      AMDGPU_GPU_PAGE_SIZE,
+					      AMDGPU_IB_POOL_SIZE, 256,
 					      AMDGPU_GEM_DOMAIN_GTT);
 		if (r)
 			goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 93207badf83f..568baf15d5b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -336,15 +336,22 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
 /*
  * sub allocation
  */
+static inline struct amdgpu_sa_manager *
+to_amdgpu_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct amdgpu_sa_manager, base);
+}
 
-static inline uint64_t amdgpu_sa_bo_gpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline uint64_t amdgpu_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * amdgpu_sa_bo_cpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline void * amdgpu_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
@@ -355,11 +362,11 @@ void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
 				      struct amdgpu_sa_manager *sa_manager);
 int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align);
+		     struct drm_suballoc **sa_bo,
+		     unsigned size);
 void amdgpu_sa_bo_free(struct amdgpu_device *adev,
-			      struct amdgpu_sa_bo **sa_bo,
-			      struct dma_fence *fence);
+		       struct drm_suballoc **sa_bo,
+		       struct dma_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 					 struct seq_file *m);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index f752c7ae7f60..6dd2a3b7e434 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -27,6 +27,7 @@
 #include <drm/amdgpu_drm.h>
 #include <drm/gpu_scheduler.h>
 #include <drm/drm_print.h>
+#include <drm/drm_suballoc.h>
 
 struct amdgpu_device;
 struct amdgpu_ring;
@@ -92,7 +93,7 @@ enum amdgpu_ib_pool_type {
 };
 
 struct amdgpu_ib {
-	struct amdgpu_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
index 524d10b21041..e7b3539e0294 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
@@ -44,327 +44,61 @@
 
 #include "amdgpu.h"
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo);
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager);
-
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
 			      struct amdgpu_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain)
+			      unsigned size, u32 suballoc_align, u32 domain)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
+	int r;
 
-	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
+	r = amdgpu_bo_create_kernel(adev, size, AMDGPU_GPU_PAGE_SIZE, domain, &sa_manager->bo,
 				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
 	if (r) {
 		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
-	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
+	memset(sa_manager->cpu_ptr, 0, size);
+	drm_suballoc_manager_init(&sa_manager->base, size, suballoc_align);
 	return r;
 }
 
 void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 			       struct amdgpu_sa_manager *sa_manager)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
-
 	if (sa_manager->bo == NULL) {
 		dev_err(adev->dev, "no bo for sa manager\n");
 		return;
 	}
 
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		amdgpu_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		amdgpu_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 
 	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
-	sa_manager->size = 0;
 }
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)
-{
-	struct amdgpu_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	dma_fence_put(sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager)
+int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
 
-	sa_bo = list_entry(sa_manager->hole->next, struct amdgpu_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL ||
-		    !dma_fence_is_signaled(sa_bo->fence)) {
-			return;
-		}
-		amdgpu_sa_bo_remove_locked(sa_bo);
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned amdgpu_sa_bo_hole_soffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct amdgpu_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned amdgpu_sa_bo_hole_eoffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct amdgpu_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool amdgpu_sa_bo_try_alloc(struct amdgpu_sa_manager *sa_manager,
-				   struct amdgpu_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * amdgpu_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool amdgpu_sa_event(struct amdgpu_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		if (!list_empty(&sa_manager->flist[i]))
-			return true;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool amdgpu_sa_bo_next_hole(struct amdgpu_sa_manager *sa_manager,
-				   struct dma_fence **fences,
-				   unsigned *tries)
-{
-	struct amdgpu_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i) {
-		struct amdgpu_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i]))
-			continue;
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct amdgpu_sa_bo, flist);
-
-		if (!dma_fence_is_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		uint32_t idx = best_bo->fence->context;
-
-		idx %= AMDGPU_SA_NUM_FENCE_LISTS;
-		++tries[idx];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		amdgpu_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct dma_fence *fences[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned tries[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned count;
-	int i, r;
-	signed long t;
-
-	if (WARN_ON_ONCE(align > sa_manager->align))
-		return -EINVAL;
-
-	if (WARN_ON_ONCE(size > sa_manager->size))
-		return -EINVAL;
-
-	*sa_bo = kmalloc(sizeof(struct amdgpu_sa_bo), GFP_KERNEL);
-	if (!(*sa_bo))
-		return -ENOMEM;
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			tries[i] = 0;
-
-		do {
-			amdgpu_sa_bo_try_free(sa_manager);
-
-			if (amdgpu_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (amdgpu_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0, count = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			if (fences[i])
-				fences[count++] = dma_fence_get(fences[i]);
-
-		if (count) {
-			spin_unlock(&sa_manager->wq.lock);
-			t = dma_fence_wait_any_timeout(fences, count, false,
-						       MAX_SCHEDULE_TIMEOUT,
-						       NULL);
-			for (i = 0; i < count; ++i)
-				dma_fence_put(fences[i]);
-
-			r = (t > 0) ? 0 : t;
-			spin_lock(&sa_manager->wq.lock);
-		} else {
-			/* if we have nothing to wait for block */
-			r = wait_event_interruptible_locked(
-				sa_manager->wq,
-				amdgpu_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
+void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct drm_suballoc **sa_bo,
 		       struct dma_fence *fence)
 {
-	struct amdgpu_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !dma_fence_is_signaled(fence)) {
-		uint32_t idx;
-
-		(*sa_bo)->fence = dma_fence_get(fence);
-		idx = fence->context % AMDGPU_SA_NUM_FENCE_LISTS;
-		list_add_tail(&(*sa_bo)->flist, &sa_manager->flist[idx]);
-	} else {
-		amdgpu_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_free(*sa_bo, fence);
 	*sa_bo = NULL;
 }
 
@@ -373,26 +107,8 @@ void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct amdgpu_sa_bo *i;
-
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
+	struct drm_printer p = drm_seq_file_printer(m);
 
-		if (i->fence)
-			seq_printf(m, " protected by 0x%016llx on context %llu",
-				   i->fence->seqno, i->fence->context);
-
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 02/20] drm/amd: Convert amdgpu to use suballocation helper.
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Now that we have a generic suballocation helper, Use it in amdgpu.
The debug output is slightly different and suballocation may be
slightly more cpu-hungry.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Kconfig                    |   1 +
 drivers/gpu/drm/amd/amdgpu/Kconfig         |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     | 320 ++-------------------
 7 files changed, 43 insertions(+), 336 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index ad231a68c2a5..de45c3c059f0 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -78,6 +78,7 @@ config DRM_KUNIT_TEST
 	select DRM_DISPLAY_HELPER
 	select DRM_LIB_RANDOM
 	select DRM_KMS_HELPER
+	select DRM_SUBALLOC_HELPER
 	select DRM_BUDDY
 	select DRM_EXPORT_FOR_TESTS if m
 	select DRM_KUNIT_TEST_HELPERS
diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
index 5fcd510f1abb..eef179b81d0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/Kconfig
+++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
@@ -16,6 +16,7 @@ config DRM_AMDGPU
 	select BACKLIGHT_CLASS_DEVICE
 	select INTERVAL_TREE
 	select DRM_BUDDY
+	select DRM_SUBALLOC_HELPER
 	# amdgpu depends on ACPI_VIDEO when ACPI is enabled, for select to work
 	# ACPI_VIDEO's dependencies must also be selected.
 	select INPUT if ACPI
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 2644cd991210..009903e21d83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -422,29 +422,11 @@ struct amdgpu_clock {
  * alignment).
  */
 
-#define AMDGPU_SA_NUM_FENCE_LISTS	32
-
 struct amdgpu_sa_manager {
-	wait_queue_head_t	wq;
-	struct amdgpu_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[AMDGPU_SA_NUM_FENCE_LISTS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-/* sub-allocation buffer */
-struct amdgpu_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct amdgpu_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct dma_fence	        *fence;
+	struct drm_suballoc_manager	base;
+	struct amdgpu_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
 };
 
 int amdgpu_fence_slab_init(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index bcccc348dbe2..5621b63c7f42 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -69,7 +69,7 @@ int amdgpu_ib_get(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 
 	if (size) {
 		r = amdgpu_sa_bo_new(&adev->ib_pools[pool_type],
-				      &ib->sa_bo, size, 256);
+				      &ib->sa_bo, size);
 		if (r) {
 			dev_err(adev->dev, "failed to get a new IB (%d)\n", r);
 			return r;
@@ -309,8 +309,7 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
 
 	for (i = 0; i < AMDGPU_IB_POOL_MAX; i++) {
 		r = amdgpu_sa_bo_manager_init(adev, &adev->ib_pools[i],
-					      AMDGPU_IB_POOL_SIZE,
-					      AMDGPU_GPU_PAGE_SIZE,
+					      AMDGPU_IB_POOL_SIZE, 256,
 					      AMDGPU_GEM_DOMAIN_GTT);
 		if (r)
 			goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 93207badf83f..568baf15d5b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -336,15 +336,22 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
 /*
  * sub allocation
  */
+static inline struct amdgpu_sa_manager *
+to_amdgpu_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct amdgpu_sa_manager, base);
+}
 
-static inline uint64_t amdgpu_sa_bo_gpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline uint64_t amdgpu_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * amdgpu_sa_bo_cpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline void * amdgpu_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
@@ -355,11 +362,11 @@ void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
 				      struct amdgpu_sa_manager *sa_manager);
 int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align);
+		     struct drm_suballoc **sa_bo,
+		     unsigned size);
 void amdgpu_sa_bo_free(struct amdgpu_device *adev,
-			      struct amdgpu_sa_bo **sa_bo,
-			      struct dma_fence *fence);
+		       struct drm_suballoc **sa_bo,
+		       struct dma_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 					 struct seq_file *m);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index f752c7ae7f60..6dd2a3b7e434 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -27,6 +27,7 @@
 #include <drm/amdgpu_drm.h>
 #include <drm/gpu_scheduler.h>
 #include <drm/drm_print.h>
+#include <drm/drm_suballoc.h>
 
 struct amdgpu_device;
 struct amdgpu_ring;
@@ -92,7 +93,7 @@ enum amdgpu_ib_pool_type {
 };
 
 struct amdgpu_ib {
-	struct amdgpu_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
index 524d10b21041..e7b3539e0294 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
@@ -44,327 +44,61 @@
 
 #include "amdgpu.h"
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo);
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager);
-
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
 			      struct amdgpu_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain)
+			      unsigned size, u32 suballoc_align, u32 domain)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
+	int r;
 
-	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
+	r = amdgpu_bo_create_kernel(adev, size, AMDGPU_GPU_PAGE_SIZE, domain, &sa_manager->bo,
 				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
 	if (r) {
 		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
-	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
+	memset(sa_manager->cpu_ptr, 0, size);
+	drm_suballoc_manager_init(&sa_manager->base, size, suballoc_align);
 	return r;
 }
 
 void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 			       struct amdgpu_sa_manager *sa_manager)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
-
 	if (sa_manager->bo == NULL) {
 		dev_err(adev->dev, "no bo for sa manager\n");
 		return;
 	}
 
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		amdgpu_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		amdgpu_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 
 	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
-	sa_manager->size = 0;
 }
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)
-{
-	struct amdgpu_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	dma_fence_put(sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager)
+int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
 
-	sa_bo = list_entry(sa_manager->hole->next, struct amdgpu_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL ||
-		    !dma_fence_is_signaled(sa_bo->fence)) {
-			return;
-		}
-		amdgpu_sa_bo_remove_locked(sa_bo);
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned amdgpu_sa_bo_hole_soffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct amdgpu_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned amdgpu_sa_bo_hole_eoffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct amdgpu_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool amdgpu_sa_bo_try_alloc(struct amdgpu_sa_manager *sa_manager,
-				   struct amdgpu_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * amdgpu_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool amdgpu_sa_event(struct amdgpu_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		if (!list_empty(&sa_manager->flist[i]))
-			return true;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool amdgpu_sa_bo_next_hole(struct amdgpu_sa_manager *sa_manager,
-				   struct dma_fence **fences,
-				   unsigned *tries)
-{
-	struct amdgpu_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i) {
-		struct amdgpu_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i]))
-			continue;
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct amdgpu_sa_bo, flist);
-
-		if (!dma_fence_is_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		uint32_t idx = best_bo->fence->context;
-
-		idx %= AMDGPU_SA_NUM_FENCE_LISTS;
-		++tries[idx];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		amdgpu_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct dma_fence *fences[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned tries[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned count;
-	int i, r;
-	signed long t;
-
-	if (WARN_ON_ONCE(align > sa_manager->align))
-		return -EINVAL;
-
-	if (WARN_ON_ONCE(size > sa_manager->size))
-		return -EINVAL;
-
-	*sa_bo = kmalloc(sizeof(struct amdgpu_sa_bo), GFP_KERNEL);
-	if (!(*sa_bo))
-		return -ENOMEM;
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			tries[i] = 0;
-
-		do {
-			amdgpu_sa_bo_try_free(sa_manager);
-
-			if (amdgpu_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (amdgpu_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0, count = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			if (fences[i])
-				fences[count++] = dma_fence_get(fences[i]);
-
-		if (count) {
-			spin_unlock(&sa_manager->wq.lock);
-			t = dma_fence_wait_any_timeout(fences, count, false,
-						       MAX_SCHEDULE_TIMEOUT,
-						       NULL);
-			for (i = 0; i < count; ++i)
-				dma_fence_put(fences[i]);
-
-			r = (t > 0) ? 0 : t;
-			spin_lock(&sa_manager->wq.lock);
-		} else {
-			/* if we have nothing to wait for block */
-			r = wait_event_interruptible_locked(
-				sa_manager->wq,
-				amdgpu_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
+void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct drm_suballoc **sa_bo,
 		       struct dma_fence *fence)
 {
-	struct amdgpu_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !dma_fence_is_signaled(fence)) {
-		uint32_t idx;
-
-		(*sa_bo)->fence = dma_fence_get(fence);
-		idx = fence->context % AMDGPU_SA_NUM_FENCE_LISTS;
-		list_add_tail(&(*sa_bo)->flist, &sa_manager->flist[idx]);
-	} else {
-		amdgpu_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_free(*sa_bo, fence);
 	*sa_bo = NULL;
 }
 
@@ -373,26 +107,8 @@ void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct amdgpu_sa_bo *i;
-
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
+	struct drm_printer p = drm_seq_file_printer(m);
 
-		if (i->fence)
-			seq_printf(m, " protected by 0x%016llx on context %llu",
-				   i->fence->seqno, i->fence->context);
-
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 03/20] drm/radeon: Use the drm suballocation manager implementation.
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Use the generic suballocation helper.
Note that the generic suballocator only allows a single alignment,
so we may waste a few more bytes for radeon_semaphore, shouldn't
be a big deal, could be re-added if needed. Also, similar to amdgpu,
debug output changes slightly and suballocator cpu usage may be
slightly higher.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/radeon/radeon.h           |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c        |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h    |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c        | 314 ++--------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c |   6 +-
 5 files changed, 55 insertions(+), 357 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 57e20780a458..d19a4b1c1a8f 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -79,6 +79,7 @@
 
 #include <drm/drm_gem.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_suballoc.h>
 
 #include "radeon_family.h"
 #include "radeon_mode.h"
@@ -511,52 +512,12 @@ struct radeon_bo {
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
-/* sub-allocation manager, it has to be protected by another lock.
- * By conception this is an helper for other part of the driver
- * like the indirect buffer or semaphore, which both have their
- * locking.
- *
- * Principe is simple, we keep a list of sub allocation in offset
- * order (first entry has offset == 0, last entry has the highest
- * offset).
- *
- * When allocating new object we first check if there is room at
- * the end total_size - (last_object_offset + last_object_size) >=
- * alloc_size. If so we allocate new object there.
- *
- * When there is not enough room at the end, we start waiting for
- * each sub object until we reach object_offset+object_size >=
- * alloc_size, this object then become the sub object we return.
- *
- * Alignment can't be bigger than page size.
- *
- * Hole are not considered for allocation to keep things simple.
- * Assumption is that there won't be hole (all object on same
- * alignment).
- */
 struct radeon_sa_manager {
-	wait_queue_head_t	wq;
-	struct radeon_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[RADEON_NUM_RINGS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-struct radeon_sa_bo;
-
-/* sub-allocation buffer */
-struct radeon_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct radeon_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct radeon_fence		*fence;
+	struct drm_suballoc_manager	base;
+	struct radeon_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
+	u32 domain;
 };
 
 /*
@@ -587,7 +548,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
  * Semaphores.
  */
 struct radeon_semaphore {
-	struct radeon_sa_bo	*sa_bo;
+	struct drm_suballoc	*sa_bo;
 	signed			waiters;
 	uint64_t		gpu_addr;
 };
@@ -816,7 +777,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
  */
 
 struct radeon_ib {
-	struct radeon_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
index 62b116727b4f..63fcfe65d814 100644
--- a/drivers/gpu/drm/radeon/radeon_ib.c
+++ b/drivers/gpu/drm/radeon/radeon_ib.c
@@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 {
 	int r;
 
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size);
 	if (r) {
 		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
 		return r;
@@ -77,7 +77,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 		/* ib pool is bound at RADEON_VA_IB_OFFSET in virtual address
 		 * space and soffset is the offset inside the pool bo
 		 */
-		ib->gpu_addr = ib->sa_bo->soffset + RADEON_VA_IB_OFFSET;
+		ib->gpu_addr = drm_suballoc_soffset(ib->sa_bo) + RADEON_VA_IB_OFFSET;
 	} else {
 		ib->gpu_addr = radeon_sa_bo_gpu_addr(ib->sa_bo);
 	}
@@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
 {
 	radeon_sync_free(rdev, &ib->sync, ib->fence);
-	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
+	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
 	radeon_fence_unref(&ib->fence);
 }
 
@@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 
 	if (rdev->family >= CHIP_BONAIRE) {
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT,
 					      RADEON_GEM_GTT_WC);
 	} else {
@@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 		 * to the command stream checking
 		 */
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT, 0);
 	}
 	if (r) {
diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
index 0a6ef49e990a..b7c5087a7dbc 100644
--- a/drivers/gpu/drm/radeon/radeon_object.h
+++ b/drivers/gpu/drm/radeon/radeon_object.h
@@ -169,15 +169,22 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
 /*
  * sub allocation
  */
+static inline struct radeon_sa_manager *
+to_radeon_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct radeon_sa_manager, base);
+}
 
-static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
+static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
+static inline void * radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
@@ -190,12 +197,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
 				      struct radeon_sa_manager *sa_manager);
 extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 					struct radeon_sa_manager *sa_manager);
-extern int radeon_sa_bo_new(struct radeon_device *rdev,
-			    struct radeon_sa_manager *sa_manager,
-			    struct radeon_sa_bo **sa_bo,
-			    unsigned size, unsigned align);
-extern void radeon_sa_bo_free(struct radeon_device *rdev,
-			      struct radeon_sa_bo **sa_bo,
+extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+			    struct drm_suballoc **sa_bo,
+			    unsigned size);
+extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 			      struct radeon_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
index 0981948bd9ed..b5555750aa0d 100644
--- a/drivers/gpu/drm/radeon/radeon_sa.c
+++ b/drivers/gpu/drm/radeon/radeon_sa.c
@@ -44,53 +44,31 @@
 
 #include "radeon.h"
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
-
 int radeon_sa_bo_manager_init(struct radeon_device *rdev,
 			      struct radeon_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain, u32 flags)
+			      unsigned size, u32 sa_align, u32 domain, u32 flags)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
-	}
+	int r;
 
-	r = radeon_bo_create(rdev, size, align, true,
+	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
 			     domain, flags, NULL, NULL, &sa_manager->bo);
 	if (r) {
 		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
+	sa_manager->domain = domain;
+
+	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
+
 	return r;
 }
 
 void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
 			       struct radeon_sa_manager *sa_manager)
 {
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		radeon_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		radeon_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 	radeon_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
 }
 
 int radeon_sa_bo_manager_start(struct radeon_device *rdev,
@@ -139,260 +117,33 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 	return r;
 }
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
+int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct radeon_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	radeon_fence_unref(&sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
-{
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
-			return;
-		}
-		radeon_sa_bo_remove_locked(sa_bo);
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
-				   struct radeon_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * radeon_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		if (!list_empty(&sa_manager->flist[i])) {
-			return true;
-		}
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
-				   struct radeon_fence **fences,
-				   unsigned *tries)
-{
-	struct radeon_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		struct radeon_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i])) {
-			continue;
-		}
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct radeon_sa_bo, flist);
-
-		if (!radeon_fence_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		++tries[best_bo->fence->ring];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		radeon_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int radeon_sa_bo_new(struct radeon_device *rdev,
-		     struct radeon_sa_manager *sa_manager,
-		     struct radeon_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct radeon_fence *fences[RADEON_NUM_RINGS];
-	unsigned tries[RADEON_NUM_RINGS];
-	int i, r;
-
-	BUG_ON(align > sa_manager->align);
-	BUG_ON(size > sa_manager->size);
-
-	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
-	if ((*sa_bo) == NULL) {
-		return -ENOMEM;
-	}
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			tries[i] = 0;
-
-		do {
-			radeon_sa_bo_try_free(sa_manager);
-
-			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_ref(fences[i]);
-
-		spin_unlock(&sa_manager->wq.lock);
-		r = radeon_fence_wait_any(rdev, fences, false);
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_unref(&fences[i]);
-		spin_lock(&sa_manager->wq.lock);
-		/* if we have nothing to wait for block */
-		if (r == -ENOENT) {
-			r = wait_event_interruptible_locked(
-				sa_manager->wq, 
-				radeon_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
+void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 		       struct radeon_fence *fence)
 {
-	struct radeon_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !radeon_fence_signaled(fence)) {
-		(*sa_bo)->fence = radeon_fence_ref(fence);
-		list_add_tail(&(*sa_bo)->flist,
-			      &sa_manager->flist[fence->ring]);
-	} else {
-		radeon_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	if (fence)
+		drm_suballoc_free(*sa_bo, &fence->base);
+	else
+		drm_suballoc_free(*sa_bo, NULL);
+
 	*sa_bo = NULL;
 }
 
@@ -400,25 +151,8 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
 void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct radeon_sa_bo *i;
+	struct drm_printer p = drm_seq_file_printer(m);
 
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
-		if (i->fence) {
-			seq_printf(m, " protected by 0x%016llx on ring %d",
-				   i->fence->seq, i->fence->ring);
-		}
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
index 221e59476f64..3e2b0bf0d55d 100644
--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
+++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
@@ -40,8 +40,8 @@ int radeon_semaphore_create(struct radeon_device *rdev,
 	if (*semaphore == NULL) {
 		return -ENOMEM;
 	}
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
-			     &(*semaphore)->sa_bo, 8, 8);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
+			     &(*semaphore)->sa_bo, 8);
 	if (r) {
 		kfree(*semaphore);
 		*semaphore = NULL;
@@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
 		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
 			" hardware lockup imminent!\n", *semaphore);
 	}
-	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
+	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
 	kfree(*semaphore);
 	*semaphore = NULL;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 03/20] drm/radeon: Use the drm suballocation manager implementation.
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Use the generic suballocation helper.
Note that the generic suballocator only allows a single alignment,
so we may waste a few more bytes for radeon_semaphore, shouldn't
be a big deal, could be re-added if needed. Also, similar to amdgpu,
debug output changes slightly and suballocator cpu usage may be
slightly higher.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/radeon/radeon.h           |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c        |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h    |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c        | 314 ++--------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c |   6 +-
 5 files changed, 55 insertions(+), 357 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 57e20780a458..d19a4b1c1a8f 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -79,6 +79,7 @@
 
 #include <drm/drm_gem.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_suballoc.h>
 
 #include "radeon_family.h"
 #include "radeon_mode.h"
@@ -511,52 +512,12 @@ struct radeon_bo {
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
-/* sub-allocation manager, it has to be protected by another lock.
- * By conception this is an helper for other part of the driver
- * like the indirect buffer or semaphore, which both have their
- * locking.
- *
- * Principe is simple, we keep a list of sub allocation in offset
- * order (first entry has offset == 0, last entry has the highest
- * offset).
- *
- * When allocating new object we first check if there is room at
- * the end total_size - (last_object_offset + last_object_size) >=
- * alloc_size. If so we allocate new object there.
- *
- * When there is not enough room at the end, we start waiting for
- * each sub object until we reach object_offset+object_size >=
- * alloc_size, this object then become the sub object we return.
- *
- * Alignment can't be bigger than page size.
- *
- * Hole are not considered for allocation to keep things simple.
- * Assumption is that there won't be hole (all object on same
- * alignment).
- */
 struct radeon_sa_manager {
-	wait_queue_head_t	wq;
-	struct radeon_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[RADEON_NUM_RINGS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-struct radeon_sa_bo;
-
-/* sub-allocation buffer */
-struct radeon_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct radeon_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct radeon_fence		*fence;
+	struct drm_suballoc_manager	base;
+	struct radeon_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
+	u32 domain;
 };
 
 /*
@@ -587,7 +548,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
  * Semaphores.
  */
 struct radeon_semaphore {
-	struct radeon_sa_bo	*sa_bo;
+	struct drm_suballoc	*sa_bo;
 	signed			waiters;
 	uint64_t		gpu_addr;
 };
@@ -816,7 +777,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
  */
 
 struct radeon_ib {
-	struct radeon_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
index 62b116727b4f..63fcfe65d814 100644
--- a/drivers/gpu/drm/radeon/radeon_ib.c
+++ b/drivers/gpu/drm/radeon/radeon_ib.c
@@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 {
 	int r;
 
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size);
 	if (r) {
 		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
 		return r;
@@ -77,7 +77,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 		/* ib pool is bound at RADEON_VA_IB_OFFSET in virtual address
 		 * space and soffset is the offset inside the pool bo
 		 */
-		ib->gpu_addr = ib->sa_bo->soffset + RADEON_VA_IB_OFFSET;
+		ib->gpu_addr = drm_suballoc_soffset(ib->sa_bo) + RADEON_VA_IB_OFFSET;
 	} else {
 		ib->gpu_addr = radeon_sa_bo_gpu_addr(ib->sa_bo);
 	}
@@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
 {
 	radeon_sync_free(rdev, &ib->sync, ib->fence);
-	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
+	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
 	radeon_fence_unref(&ib->fence);
 }
 
@@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 
 	if (rdev->family >= CHIP_BONAIRE) {
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT,
 					      RADEON_GEM_GTT_WC);
 	} else {
@@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 		 * to the command stream checking
 		 */
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT, 0);
 	}
 	if (r) {
diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
index 0a6ef49e990a..b7c5087a7dbc 100644
--- a/drivers/gpu/drm/radeon/radeon_object.h
+++ b/drivers/gpu/drm/radeon/radeon_object.h
@@ -169,15 +169,22 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
 /*
  * sub allocation
  */
+static inline struct radeon_sa_manager *
+to_radeon_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct radeon_sa_manager, base);
+}
 
-static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
+static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
+static inline void * radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
@@ -190,12 +197,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
 				      struct radeon_sa_manager *sa_manager);
 extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 					struct radeon_sa_manager *sa_manager);
-extern int radeon_sa_bo_new(struct radeon_device *rdev,
-			    struct radeon_sa_manager *sa_manager,
-			    struct radeon_sa_bo **sa_bo,
-			    unsigned size, unsigned align);
-extern void radeon_sa_bo_free(struct radeon_device *rdev,
-			      struct radeon_sa_bo **sa_bo,
+extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+			    struct drm_suballoc **sa_bo,
+			    unsigned size);
+extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 			      struct radeon_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
index 0981948bd9ed..b5555750aa0d 100644
--- a/drivers/gpu/drm/radeon/radeon_sa.c
+++ b/drivers/gpu/drm/radeon/radeon_sa.c
@@ -44,53 +44,31 @@
 
 #include "radeon.h"
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
-
 int radeon_sa_bo_manager_init(struct radeon_device *rdev,
 			      struct radeon_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain, u32 flags)
+			      unsigned size, u32 sa_align, u32 domain, u32 flags)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
-	}
+	int r;
 
-	r = radeon_bo_create(rdev, size, align, true,
+	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
 			     domain, flags, NULL, NULL, &sa_manager->bo);
 	if (r) {
 		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
+	sa_manager->domain = domain;
+
+	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
+
 	return r;
 }
 
 void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
 			       struct radeon_sa_manager *sa_manager)
 {
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		radeon_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		radeon_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 	radeon_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
 }
 
 int radeon_sa_bo_manager_start(struct radeon_device *rdev,
@@ -139,260 +117,33 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 	return r;
 }
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
+int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct radeon_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	radeon_fence_unref(&sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
-{
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
-			return;
-		}
-		radeon_sa_bo_remove_locked(sa_bo);
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
-				   struct radeon_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * radeon_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		if (!list_empty(&sa_manager->flist[i])) {
-			return true;
-		}
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
-				   struct radeon_fence **fences,
-				   unsigned *tries)
-{
-	struct radeon_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		struct radeon_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i])) {
-			continue;
-		}
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct radeon_sa_bo, flist);
-
-		if (!radeon_fence_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		++tries[best_bo->fence->ring];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		radeon_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int radeon_sa_bo_new(struct radeon_device *rdev,
-		     struct radeon_sa_manager *sa_manager,
-		     struct radeon_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct radeon_fence *fences[RADEON_NUM_RINGS];
-	unsigned tries[RADEON_NUM_RINGS];
-	int i, r;
-
-	BUG_ON(align > sa_manager->align);
-	BUG_ON(size > sa_manager->size);
-
-	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
-	if ((*sa_bo) == NULL) {
-		return -ENOMEM;
-	}
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			tries[i] = 0;
-
-		do {
-			radeon_sa_bo_try_free(sa_manager);
-
-			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_ref(fences[i]);
-
-		spin_unlock(&sa_manager->wq.lock);
-		r = radeon_fence_wait_any(rdev, fences, false);
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_unref(&fences[i]);
-		spin_lock(&sa_manager->wq.lock);
-		/* if we have nothing to wait for block */
-		if (r == -ENOENT) {
-			r = wait_event_interruptible_locked(
-				sa_manager->wq, 
-				radeon_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
+void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 		       struct radeon_fence *fence)
 {
-	struct radeon_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !radeon_fence_signaled(fence)) {
-		(*sa_bo)->fence = radeon_fence_ref(fence);
-		list_add_tail(&(*sa_bo)->flist,
-			      &sa_manager->flist[fence->ring]);
-	} else {
-		radeon_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	if (fence)
+		drm_suballoc_free(*sa_bo, &fence->base);
+	else
+		drm_suballoc_free(*sa_bo, NULL);
+
 	*sa_bo = NULL;
 }
 
@@ -400,25 +151,8 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
 void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct radeon_sa_bo *i;
+	struct drm_printer p = drm_seq_file_printer(m);
 
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
-		if (i->fence) {
-			seq_printf(m, " protected by 0x%016llx on ring %d",
-				   i->fence->seq, i->fence->ring);
-		}
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
index 221e59476f64..3e2b0bf0d55d 100644
--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
+++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
@@ -40,8 +40,8 @@ int radeon_semaphore_create(struct radeon_device *rdev,
 	if (*semaphore == NULL) {
 		return -ENOMEM;
 	}
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
-			     &(*semaphore)->sa_bo, 8, 8);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
+			     &(*semaphore)->sa_bo, 8);
 	if (r) {
 		kfree(*semaphore);
 		*semaphore = NULL;
@@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
 		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
 			" hardware lockup imminent!\n", *semaphore);
 	}
-	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
+	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
 	kfree(*semaphore);
 	*semaphore = NULL;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  12 +-
 drivers/gpu/drm/scheduler/sched_main.c      | 124 ++++++++++++--------
 include/drm/gpu_scheduler.h                 |  13 +-
 4 files changed, 93 insertions(+), 70 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f60753f97ac5..9c2a10aeb0b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
-		kthread_park(ring->sched.thread);
+		drm_sched_run_wq_stop(&ring->sched);
 	}
 
 	seq_printf(m, "run ib test:\n");
@@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
-		kthread_unpark(ring->sched.thread);
+		drm_sched_run_wq_start(&ring->sched);
 	}
 
 	up_write(&adev->reset_domain->sem);
@@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
 	ring = adev->rings[val];
 
-	if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
+	if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
 		return -EINVAL;
 
 	/* the last preemption failed */
@@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 		goto pro_end;
 
 	/* stop the scheduler */
-	kthread_park(ring->sched.thread);
+	drm_sched_run_wq_stop(&ring->sched);
 
 	/* preempt the IB */
 	r = amdgpu_ring_preempt_ib(ring);
@@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
 failure:
 	/* restart the scheduler */
-	kthread_unpark(ring->sched.thread);
+	drm_sched_run_wq_start(&ring->sched);
 
 	up_read(&adev->reset_domain->sem);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 076ae400d099..9552929ccf87 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4577,7 +4577,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		spin_lock(&ring->sched.job_list_lock);
@@ -4708,7 +4708,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		/*clear job fence from fence drv to avoid force_completion
@@ -5247,7 +5247,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_stop(&ring->sched, job ? &job->base : NULL);
@@ -5321,7 +5321,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_start(&ring->sched, true);
@@ -5648,7 +5648,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_stop(&ring->sched, NULL);
@@ -5776,7 +5776,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		drm_sched_start(&ring->sched, true);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 27d52ffbb808..8c64045d0692 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -44,7 +44,6 @@
  * The jobs in a entity are always scheduled in the order that they were pushed.
  */
 
-#include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/completion.h>
@@ -251,6 +250,53 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
 }
 
+/**
+ * drm_sched_run_wq_stop - stop scheduler run worker
+ *
+ * @sched: scheduler instance to stop run worker
+ */
+void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched)
+{
+	sched->pause_run_wq = true;
+	smp_wmb();
+
+	cancel_work_sync(&sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_run_wq_stop);
+
+/**
+ * drm_sched_run_wq_start - start scheduler run worker
+ *
+ * @sched: scheduler instance to start run worker
+ */
+void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched)
+{
+	sched->pause_run_wq = false;
+	smp_wmb();
+
+	queue_work(sched->run_wq, &sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_run_wq_start);
+
+/**
+ * drm_sched_run_wq_queue - queue scheduler run worker
+ *
+ * @sched: scheduler instance to queue run worker
+ */
+static void drm_sched_run_wq_queue(struct drm_gpu_scheduler *sched)
+{
+	smp_rmb();
+
+	/*
+	 * Try not to schedule work if pause_run_wq set but not the end of world
+	 * if we do as either it will be cancelled by the above
+	 * cancel_work_sync, or drm_sched_main turns into a NOP while
+	 * pause_run_wq is set.
+	 */
+	if (!sched->pause_run_wq)
+		queue_work(sched->run_wq, &sched->work_run);
+}
+
 /**
  * drm_sched_job_done - complete a job
  * @s_job: pointer to the job which is done
@@ -270,7 +316,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job)
 	dma_fence_get(&s_fence->finished);
 	drm_sched_fence_finished(s_fence);
 	dma_fence_put(&s_fence->finished);
-	wake_up_interruptible(&sched->wake_up_worker);
+	drm_sched_run_wq_queue(sched);
 }
 
 /**
@@ -433,7 +479,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 {
 	struct drm_sched_job *s_job, *tmp;
 
-	kthread_park(sched->thread);
+	drm_sched_run_wq_stop(sched);
 
 	/*
 	 * Reinsert back the bad job here - now it's safe as
@@ -546,7 +592,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 		spin_unlock(&sched->job_list_lock);
 	}
 
-	kthread_unpark(sched->thread);
+	drm_sched_run_wq_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
@@ -831,7 +877,7 @@ static bool drm_sched_ready(struct drm_gpu_scheduler *sched)
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 {
 	if (drm_sched_ready(sched))
-		wake_up_interruptible(&sched->wake_up_worker);
+		drm_sched_run_wq_queue(sched);
 }
 
 /**
@@ -941,60 +987,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_blocked - check if the scheduler is blocked
- *
- * @sched: scheduler instance
- *
- * Returns true if blocked, otherwise false.
- */
-static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
-{
-	if (kthread_should_park()) {
-		kthread_parkme();
-		return true;
-	}
-
-	return false;
-}
-
 /**
  * drm_sched_main - main scheduler thread
  *
  * @param: scheduler instance
- *
- * Returns 0.
  */
-static int drm_sched_main(void *param)
+static void drm_sched_main(struct work_struct *w)
 {
-	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
+	struct drm_gpu_scheduler *sched =
+		container_of(w, struct drm_gpu_scheduler, work_run);
 	int r;
 
-	sched_set_fifo_low(current);
-
-	while (!kthread_should_stop()) {
-		struct drm_sched_entity *entity = NULL;
+	while (!READ_ONCE(sched->pause_run_wq)) {
+		struct drm_sched_entity *entity;
 		struct drm_sched_fence *s_fence;
 		struct drm_sched_job *sched_job;
 		struct dma_fence *fence;
-		struct drm_sched_job *cleanup_job = NULL;
+		struct drm_sched_job *cleanup_job;
 
-		wait_event_interruptible(sched->wake_up_worker,
-					 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
-					 (!drm_sched_blocked(sched) &&
-					  (entity = drm_sched_select_entity(sched))) ||
-					 kthread_should_stop());
+		cleanup_job = drm_sched_get_cleanup_job(sched);
+		entity = drm_sched_select_entity(sched);
 
 		if (cleanup_job)
 			sched->ops->free_job(cleanup_job);
 
-		if (!entity)
+		if (!entity) {
+			if (!cleanup_job)
+				break;
 			continue;
+		}
 
 		sched_job = drm_sched_entity_pop_job(entity);
 
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
+			if (!cleanup_job)
+				break;
 			continue;
 		}
 
@@ -1022,14 +1050,14 @@ static int drm_sched_main(void *param)
 					  r);
 		} else {
 			if (IS_ERR(fence))
-				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
+				dma_fence_set_error(&s_fence->finished,
+						    PTR_ERR(fence));
 
 			drm_sched_job_done(sched_job);
 		}
 
 		wake_up(&sched->job_scheduled);
 	}
-	return 0;
 }
 
 /**
@@ -1054,35 +1082,28 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev)
 {
-	int i, ret;
+	int i;
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
 	sched->timeout = timeout;
 	sched->timeout_wq = timeout_wq ? : system_wq;
+	sched->run_wq = system_wq;	/* FIXME: Let user pass this in */
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	sched->dev = dev;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
-	init_waitqueue_head(&sched->wake_up_worker);
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
+	INIT_WORK(&sched->work_run, drm_sched_main);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
-
-	/* Each scheduler will run on a seperate kernel thread */
-	sched->thread = kthread_run(drm_sched_main, sched, sched->name);
-	if (IS_ERR(sched->thread)) {
-		ret = PTR_ERR(sched->thread);
-		sched->thread = NULL;
-		DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name);
-		return ret;
-	}
+	sched->pause_run_wq = false;
 
 	sched->ready = true;
 	return 0;
@@ -1101,8 +1122,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	drm_sched_run_wq_stop(sched);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index ca857ec9e7eb..ff50f3c289cd 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -456,17 +456,16 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
- * @wake_up_worker: the wait queue on which the scheduler sleeps until a job
- *                  is ready to be scheduled.
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
+ * @run_wq: workqueue used to queue @work_run
  * @timeout_wq: workqueue used to queue @work_tdr
+ * @work_run: schedules jobs and cleans up entities
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
- * @thread: the kthread on which the scheduler which run.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
@@ -475,6 +474,7 @@ struct drm_sched_backend_ops {
  * @_score: score used when the driver doesn't provide one
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
+ * @pause_run_wq: pause queuing of @work_run on @run_wq
  * @dev: system &struct device
  *
  * One scheduler is implemented for each hardware ring.
@@ -485,13 +485,13 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
-	wait_queue_head_t		wake_up_worker;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
+	struct workqueue_struct		*run_wq;
 	struct workqueue_struct		*timeout_wq;
+	struct work_struct		work_run;
 	struct delayed_work		work_tdr;
-	struct task_struct		*thread;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
 	int				hang_limit;
@@ -499,6 +499,7 @@ struct drm_gpu_scheduler {
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
+	bool				pause_run_wq;
 	struct device			*dev;
 };
 
@@ -529,6 +530,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
+void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  12 +-
 drivers/gpu/drm/scheduler/sched_main.c      | 124 ++++++++++++--------
 include/drm/gpu_scheduler.h                 |  13 +-
 4 files changed, 93 insertions(+), 70 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f60753f97ac5..9c2a10aeb0b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
-		kthread_park(ring->sched.thread);
+		drm_sched_run_wq_stop(&ring->sched);
 	}
 
 	seq_printf(m, "run ib test:\n");
@@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
-		kthread_unpark(ring->sched.thread);
+		drm_sched_run_wq_start(&ring->sched);
 	}
 
 	up_write(&adev->reset_domain->sem);
@@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
 	ring = adev->rings[val];
 
-	if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
+	if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
 		return -EINVAL;
 
 	/* the last preemption failed */
@@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 		goto pro_end;
 
 	/* stop the scheduler */
-	kthread_park(ring->sched.thread);
+	drm_sched_run_wq_stop(&ring->sched);
 
 	/* preempt the IB */
 	r = amdgpu_ring_preempt_ib(ring);
@@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
 failure:
 	/* restart the scheduler */
-	kthread_unpark(ring->sched.thread);
+	drm_sched_run_wq_start(&ring->sched);
 
 	up_read(&adev->reset_domain->sem);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 076ae400d099..9552929ccf87 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4577,7 +4577,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		spin_lock(&ring->sched.job_list_lock);
@@ -4708,7 +4708,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		/*clear job fence from fence drv to avoid force_completion
@@ -5247,7 +5247,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_stop(&ring->sched, job ? &job->base : NULL);
@@ -5321,7 +5321,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_start(&ring->sched, true);
@@ -5648,7 +5648,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
+			if (!ring || !ring->sched.ready)
 				continue;
 
 			drm_sched_stop(&ring->sched, NULL);
@@ -5776,7 +5776,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->sched.ready)
 			continue;
 
 		drm_sched_start(&ring->sched, true);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 27d52ffbb808..8c64045d0692 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -44,7 +44,6 @@
  * The jobs in a entity are always scheduled in the order that they were pushed.
  */
 
-#include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/completion.h>
@@ -251,6 +250,53 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
 }
 
+/**
+ * drm_sched_run_wq_stop - stop scheduler run worker
+ *
+ * @sched: scheduler instance to stop run worker
+ */
+void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched)
+{
+	sched->pause_run_wq = true;
+	smp_wmb();
+
+	cancel_work_sync(&sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_run_wq_stop);
+
+/**
+ * drm_sched_run_wq_start - start scheduler run worker
+ *
+ * @sched: scheduler instance to start run worker
+ */
+void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched)
+{
+	sched->pause_run_wq = false;
+	smp_wmb();
+
+	queue_work(sched->run_wq, &sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_run_wq_start);
+
+/**
+ * drm_sched_run_wq_queue - queue scheduler run worker
+ *
+ * @sched: scheduler instance to queue run worker
+ */
+static void drm_sched_run_wq_queue(struct drm_gpu_scheduler *sched)
+{
+	smp_rmb();
+
+	/*
+	 * Try not to schedule work if pause_run_wq set but not the end of world
+	 * if we do as either it will be cancelled by the above
+	 * cancel_work_sync, or drm_sched_main turns into a NOP while
+	 * pause_run_wq is set.
+	 */
+	if (!sched->pause_run_wq)
+		queue_work(sched->run_wq, &sched->work_run);
+}
+
 /**
  * drm_sched_job_done - complete a job
  * @s_job: pointer to the job which is done
@@ -270,7 +316,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job)
 	dma_fence_get(&s_fence->finished);
 	drm_sched_fence_finished(s_fence);
 	dma_fence_put(&s_fence->finished);
-	wake_up_interruptible(&sched->wake_up_worker);
+	drm_sched_run_wq_queue(sched);
 }
 
 /**
@@ -433,7 +479,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 {
 	struct drm_sched_job *s_job, *tmp;
 
-	kthread_park(sched->thread);
+	drm_sched_run_wq_stop(sched);
 
 	/*
 	 * Reinsert back the bad job here - now it's safe as
@@ -546,7 +592,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 		spin_unlock(&sched->job_list_lock);
 	}
 
-	kthread_unpark(sched->thread);
+	drm_sched_run_wq_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
@@ -831,7 +877,7 @@ static bool drm_sched_ready(struct drm_gpu_scheduler *sched)
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
 {
 	if (drm_sched_ready(sched))
-		wake_up_interruptible(&sched->wake_up_worker);
+		drm_sched_run_wq_queue(sched);
 }
 
 /**
@@ -941,60 +987,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_blocked - check if the scheduler is blocked
- *
- * @sched: scheduler instance
- *
- * Returns true if blocked, otherwise false.
- */
-static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
-{
-	if (kthread_should_park()) {
-		kthread_parkme();
-		return true;
-	}
-
-	return false;
-}
-
 /**
  * drm_sched_main - main scheduler thread
  *
  * @param: scheduler instance
- *
- * Returns 0.
  */
-static int drm_sched_main(void *param)
+static void drm_sched_main(struct work_struct *w)
 {
-	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
+	struct drm_gpu_scheduler *sched =
+		container_of(w, struct drm_gpu_scheduler, work_run);
 	int r;
 
-	sched_set_fifo_low(current);
-
-	while (!kthread_should_stop()) {
-		struct drm_sched_entity *entity = NULL;
+	while (!READ_ONCE(sched->pause_run_wq)) {
+		struct drm_sched_entity *entity;
 		struct drm_sched_fence *s_fence;
 		struct drm_sched_job *sched_job;
 		struct dma_fence *fence;
-		struct drm_sched_job *cleanup_job = NULL;
+		struct drm_sched_job *cleanup_job;
 
-		wait_event_interruptible(sched->wake_up_worker,
-					 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
-					 (!drm_sched_blocked(sched) &&
-					  (entity = drm_sched_select_entity(sched))) ||
-					 kthread_should_stop());
+		cleanup_job = drm_sched_get_cleanup_job(sched);
+		entity = drm_sched_select_entity(sched);
 
 		if (cleanup_job)
 			sched->ops->free_job(cleanup_job);
 
-		if (!entity)
+		if (!entity) {
+			if (!cleanup_job)
+				break;
 			continue;
+		}
 
 		sched_job = drm_sched_entity_pop_job(entity);
 
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
+			if (!cleanup_job)
+				break;
 			continue;
 		}
 
@@ -1022,14 +1050,14 @@ static int drm_sched_main(void *param)
 					  r);
 		} else {
 			if (IS_ERR(fence))
-				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
+				dma_fence_set_error(&s_fence->finished,
+						    PTR_ERR(fence));
 
 			drm_sched_job_done(sched_job);
 		}
 
 		wake_up(&sched->job_scheduled);
 	}
-	return 0;
 }
 
 /**
@@ -1054,35 +1082,28 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev)
 {
-	int i, ret;
+	int i;
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
 	sched->timeout = timeout;
 	sched->timeout_wq = timeout_wq ? : system_wq;
+	sched->run_wq = system_wq;	/* FIXME: Let user pass this in */
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	sched->dev = dev;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
-	init_waitqueue_head(&sched->wake_up_worker);
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
+	INIT_WORK(&sched->work_run, drm_sched_main);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
-
-	/* Each scheduler will run on a seperate kernel thread */
-	sched->thread = kthread_run(drm_sched_main, sched, sched->name);
-	if (IS_ERR(sched->thread)) {
-		ret = PTR_ERR(sched->thread);
-		sched->thread = NULL;
-		DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name);
-		return ret;
-	}
+	sched->pause_run_wq = false;
 
 	sched->ready = true;
 	return 0;
@@ -1101,8 +1122,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	drm_sched_run_wq_stop(sched);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index ca857ec9e7eb..ff50f3c289cd 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -456,17 +456,16 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
- * @wake_up_worker: the wait queue on which the scheduler sleeps until a job
- *                  is ready to be scheduled.
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
+ * @run_wq: workqueue used to queue @work_run
  * @timeout_wq: workqueue used to queue @work_tdr
+ * @work_run: schedules jobs and cleans up entities
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
- * @thread: the kthread on which the scheduler which run.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
@@ -475,6 +474,7 @@ struct drm_sched_backend_ops {
  * @_score: score used when the driver doesn't provide one
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
+ * @pause_run_wq: pause queuing of @work_run on @run_wq
  * @dev: system &struct device
  *
  * One scheduler is implemented for each hardware ring.
@@ -485,13 +485,13 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
-	wait_queue_head_t		wake_up_worker;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
+	struct workqueue_struct		*run_wq;
 	struct workqueue_struct		*timeout_wq;
+	struct work_struct		work_run;
 	struct delayed_work		work_tdr;
-	struct task_struct		*thread;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
 	int				hang_limit;
@@ -499,6 +499,7 @@ struct drm_gpu_scheduler {
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
+	bool				pause_run_wq;
 	struct device			*dev;
 };
 
@@ -529,6 +530,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
+void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 05/20] drm/sched: Add generic scheduler message interface
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add generic schedule message interface which sends messages to backend
from the drm_gpu_scheduler main submission thread. The idea is some of
these messages modify some state in drm_sched_entity which is also
modified during submission. By scheduling these messages and submission
in the same thread their is not race changing states in
drm_sched_entity.

This interface will be used in XE, new Intel GPU driver, to cleanup,
suspend, resume, and change scheduling properties of a drm_sched_entity.

The interface is designed to be generic and extendable with only the
backend understanding the messages.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 58 +++++++++++++++++++++++++-
 include/drm/gpu_scheduler.h            | 29 ++++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 8c64045d0692..8e688c2fc482 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -987,6 +987,54 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
+/**
+ * drm_sched_add_msg - add scheduler message
+ *
+ * @sched: scheduler instance
+ * @msg: message to be added
+ *
+ * Can and will pass an jobs waiting on dependencies or in a runnable queue.
+ * Messages processing will stop if schedule run wq is stopped and resume when
+ * run wq is started.
+ */
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg)
+{
+	spin_lock(&sched->job_list_lock);
+	list_add_tail(&msg->link, &sched->msgs);
+	spin_unlock(&sched->job_list_lock);
+
+	/*
+	 * Same as above in drm_sched_run_wq_queue, try to kick worker if
+	 * paused, harmless if this races
+	 */
+	if (!sched->pause_run_wq)
+		queue_work(sched->run_wq, &sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_add_msg);
+
+/**
+ * drm_sched_get_msg - get scheduler message
+ *
+ * @sched: scheduler instance
+ *
+ * Returns NULL or message
+ */
+static struct drm_sched_msg *
+drm_sched_get_msg(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_msg *msg;
+
+	spin_lock(&sched->job_list_lock);
+	msg = list_first_entry_or_null(&sched->msgs,
+				       struct drm_sched_msg, link);
+	if (msg)
+		list_del(&msg->link);
+	spin_unlock(&sched->job_list_lock);
+
+	return msg;
+}
+
 /**
  * drm_sched_main - main scheduler thread
  *
@@ -1000,6 +1048,7 @@ static void drm_sched_main(struct work_struct *w)
 
 	while (!READ_ONCE(sched->pause_run_wq)) {
 		struct drm_sched_entity *entity;
+		struct drm_sched_msg *msg;
 		struct drm_sched_fence *s_fence;
 		struct drm_sched_job *sched_job;
 		struct dma_fence *fence;
@@ -1007,12 +1056,16 @@ static void drm_sched_main(struct work_struct *w)
 
 		cleanup_job = drm_sched_get_cleanup_job(sched);
 		entity = drm_sched_select_entity(sched);
+		msg = drm_sched_get_msg(sched);
 
 		if (cleanup_job)
 			sched->ops->free_job(cleanup_job);
 
+		if (msg)
+			sched->ops->process_msg(msg);
+
 		if (!entity) {
-			if (!cleanup_job)
+			if (!cleanup_job && !msg)
 				break;
 			continue;
 		}
@@ -1021,7 +1074,7 @@ static void drm_sched_main(struct work_struct *w)
 
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
-			if (!cleanup_job)
+			if (!cleanup_job && !msg)
 				break;
 			continue;
 		}
@@ -1097,6 +1150,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
+	INIT_LIST_HEAD(&sched->msgs);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index ff50f3c289cd..31448deb9412 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -369,6 +369,23 @@ enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_ENODEV,
 };
 
+/**
+ * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
+ * message
+ *
+ * Generic enough for backend defined messages, backend can expand if needed.
+ */
+struct drm_sched_msg {
+	/** @link: list link into the gpu scheduler list of messages */
+	struct list_head		link;
+	/**
+	 * @private_data: opaque pointer to message private data (backend defined)
+	 */
+	void				*private_data;
+	/** @opcode: opcode of message (backend defined) */
+	unsigned int			opcode;
+};
+
 /**
  * struct drm_sched_backend_ops - Define the backend operations
  *	called by the scheduler
@@ -446,6 +463,12 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @process_msg: Process a message. Allowed to block, it is this
+	 * function's responsibility to free message if dynamically allocated.
+	 */
+	void (*process_msg)(struct drm_sched_msg *msg);
 };
 
 /**
@@ -456,6 +479,7 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
+ * @msgs: list of messages to be processed in @work_run
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
@@ -463,7 +487,7 @@ struct drm_sched_backend_ops {
  * @job_id_count: used to assign unique id to the each job.
  * @run_wq: workqueue used to queue @work_run
  * @timeout_wq: workqueue used to queue @work_tdr
- * @work_run: schedules jobs and cleans up entities
+ * @work_run: schedules jobs, cleans up jobs, and processes messages
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
@@ -485,6 +509,7 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
+	struct list_head		msgs;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
@@ -530,6 +555,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg);
 void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 05/20] drm/sched: Add generic scheduler message interface
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add generic schedule message interface which sends messages to backend
from the drm_gpu_scheduler main submission thread. The idea is some of
these messages modify some state in drm_sched_entity which is also
modified during submission. By scheduling these messages and submission
in the same thread their is not race changing states in
drm_sched_entity.

This interface will be used in XE, new Intel GPU driver, to cleanup,
suspend, resume, and change scheduling properties of a drm_sched_entity.

The interface is designed to be generic and extendable with only the
backend understanding the messages.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 58 +++++++++++++++++++++++++-
 include/drm/gpu_scheduler.h            | 29 ++++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 8c64045d0692..8e688c2fc482 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -987,6 +987,54 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
+/**
+ * drm_sched_add_msg - add scheduler message
+ *
+ * @sched: scheduler instance
+ * @msg: message to be added
+ *
+ * Can and will pass an jobs waiting on dependencies or in a runnable queue.
+ * Messages processing will stop if schedule run wq is stopped and resume when
+ * run wq is started.
+ */
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg)
+{
+	spin_lock(&sched->job_list_lock);
+	list_add_tail(&msg->link, &sched->msgs);
+	spin_unlock(&sched->job_list_lock);
+
+	/*
+	 * Same as above in drm_sched_run_wq_queue, try to kick worker if
+	 * paused, harmless if this races
+	 */
+	if (!sched->pause_run_wq)
+		queue_work(sched->run_wq, &sched->work_run);
+}
+EXPORT_SYMBOL(drm_sched_add_msg);
+
+/**
+ * drm_sched_get_msg - get scheduler message
+ *
+ * @sched: scheduler instance
+ *
+ * Returns NULL or message
+ */
+static struct drm_sched_msg *
+drm_sched_get_msg(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_msg *msg;
+
+	spin_lock(&sched->job_list_lock);
+	msg = list_first_entry_or_null(&sched->msgs,
+				       struct drm_sched_msg, link);
+	if (msg)
+		list_del(&msg->link);
+	spin_unlock(&sched->job_list_lock);
+
+	return msg;
+}
+
 /**
  * drm_sched_main - main scheduler thread
  *
@@ -1000,6 +1048,7 @@ static void drm_sched_main(struct work_struct *w)
 
 	while (!READ_ONCE(sched->pause_run_wq)) {
 		struct drm_sched_entity *entity;
+		struct drm_sched_msg *msg;
 		struct drm_sched_fence *s_fence;
 		struct drm_sched_job *sched_job;
 		struct dma_fence *fence;
@@ -1007,12 +1056,16 @@ static void drm_sched_main(struct work_struct *w)
 
 		cleanup_job = drm_sched_get_cleanup_job(sched);
 		entity = drm_sched_select_entity(sched);
+		msg = drm_sched_get_msg(sched);
 
 		if (cleanup_job)
 			sched->ops->free_job(cleanup_job);
 
+		if (msg)
+			sched->ops->process_msg(msg);
+
 		if (!entity) {
-			if (!cleanup_job)
+			if (!cleanup_job && !msg)
 				break;
 			continue;
 		}
@@ -1021,7 +1074,7 @@ static void drm_sched_main(struct work_struct *w)
 
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
-			if (!cleanup_job)
+			if (!cleanup_job && !msg)
 				break;
 			continue;
 		}
@@ -1097,6 +1150,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
+	INIT_LIST_HEAD(&sched->msgs);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index ff50f3c289cd..31448deb9412 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -369,6 +369,23 @@ enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_ENODEV,
 };
 
+/**
+ * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
+ * message
+ *
+ * Generic enough for backend defined messages, backend can expand if needed.
+ */
+struct drm_sched_msg {
+	/** @link: list link into the gpu scheduler list of messages */
+	struct list_head		link;
+	/**
+	 * @private_data: opaque pointer to message private data (backend defined)
+	 */
+	void				*private_data;
+	/** @opcode: opcode of message (backend defined) */
+	unsigned int			opcode;
+};
+
 /**
  * struct drm_sched_backend_ops - Define the backend operations
  *	called by the scheduler
@@ -446,6 +463,12 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @process_msg: Process a message. Allowed to block, it is this
+	 * function's responsibility to free message if dynamically allocated.
+	 */
+	void (*process_msg)(struct drm_sched_msg *msg);
 };
 
 /**
@@ -456,6 +479,7 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
+ * @msgs: list of messages to be processed in @work_run
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
@@ -463,7 +487,7 @@ struct drm_sched_backend_ops {
  * @job_id_count: used to assign unique id to the each job.
  * @run_wq: workqueue used to queue @work_run
  * @timeout_wq: workqueue used to queue @work_tdr
- * @work_run: schedules jobs and cleans up entities
+ * @work_run: schedules jobs, cleans up jobs, and processes messages
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
@@ -485,6 +509,7 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
+	struct list_head		msgs;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
@@ -530,6 +555,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg);
 void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 06/20] drm/sched: Start run wq before TDR in drm_sched_start
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

If the TDR is set to a very small value it can fire before the run wq is
started in the function drm_sched_start. The run wq is expected to
running when the TDR fires, fix this ordering so this expectation is
always met.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 8e688c2fc482..f39fdc01c37b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -586,13 +586,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			drm_sched_job_done(s_job);
 	}
 
+	drm_sched_run_wq_start(sched);
+
 	if (full_recovery) {
 		spin_lock(&sched->job_list_lock);
 		drm_sched_start_timeout(sched);
 		spin_unlock(&sched->job_list_lock);
 	}
-
-	drm_sched_run_wq_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 06/20] drm/sched: Start run wq before TDR in drm_sched_start
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

If the TDR is set to a very small value it can fire before the run wq is
started in the function drm_sched_start. The run wq is expected to
running when the TDR fires, fix this ordering so this expectation is
always met.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 8e688c2fc482..f39fdc01c37b 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -586,13 +586,13 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			drm_sched_job_done(s_job);
 	}
 
+	drm_sched_run_wq_start(sched);
+
 	if (full_recovery) {
 		spin_lock(&sched->job_list_lock);
 		drm_sched_start_timeout(sched);
 		spin_unlock(&sched->job_list_lock);
 	}
-
-	drm_sched_run_wq_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 07/20] drm/sched: Submit job before starting TDR
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

If the TDR is set to a value, it can fire before a job is submitted in
drm_sched_main. The job should be always be submitted before the TDR
fires, fix this ordering.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f39fdc01c37b..fa25541bb477 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1082,10 +1082,10 @@ static void drm_sched_main(struct work_struct *w)
 		s_fence = sched_job->s_fence;
 
 		atomic_inc(&sched->hw_rq_count);
-		drm_sched_job_begin(sched_job);
 
 		trace_drm_run_job(sched_job, entity);
 		fence = sched->ops->run_job(sched_job);
+		drm_sched_job_begin(sched_job);
 		complete_all(&entity->entity_idle);
 		drm_sched_fence_scheduled(s_fence);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 07/20] drm/sched: Submit job before starting TDR
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

If the TDR is set to a value, it can fire before a job is submitted in
drm_sched_main. The job should be always be submitted before the TDR
fires, fix this ordering.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f39fdc01c37b..fa25541bb477 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1082,10 +1082,10 @@ static void drm_sched_main(struct work_struct *w)
 		s_fence = sched_job->s_fence;
 
 		atomic_inc(&sched->hw_rq_count);
-		drm_sched_job_begin(sched_job);
 
 		trace_drm_run_job(sched_job, entity);
 		fence = sched->ops->run_job(sched_job);
+		drm_sched_job_begin(sched_job);
 		complete_all(&entity->entity_idle);
 		drm_sched_fence_scheduled(s_fence);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 08/20] drm/sched: Add helper to set TDR timeout
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index fa25541bb477..bdf0541ad818 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -345,6 +345,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 31448deb9412..b9967af1788b 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -553,6 +553,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 08/20] drm/sched: Add helper to set TDR timeout
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index fa25541bb477..bdf0541ad818 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -345,6 +345,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 31448deb9412..b9967af1788b 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -553,6 +553,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 09/20] drm: Add a gpu page-table walker helper
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Add a gpu page table walker similar in functionality to the cpu page-table
walker in mm/pagewalk.c. This is made a drm helper in the hope that it
might prove useful to other drivers, but we could of course make it
single-driver only and rename the functions initially.

Also if remaining a DRM helper, we should consider making it a helper
kernel module of its own.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Makefile      |   1 +
 drivers/gpu/drm/drm_pt_walk.c | 159 +++++++++++++++++++++++++++++++++
 include/drm/drm_pt_walk.h     | 161 ++++++++++++++++++++++++++++++++++
 3 files changed, 321 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_pt_walk.c
 create mode 100644 include/drm/drm_pt_walk.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 23ad760884b2..d030c2885dd8 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -39,6 +39,7 @@ drm-y := \
 	drm_prime.o \
 	drm_print.o \
 	drm_property.o \
+	drm_pt_walk.o \
 	drm_syncobj.o \
 	drm_sysfs.o \
 	drm_trace_points.o \
diff --git a/drivers/gpu/drm/drm_pt_walk.c b/drivers/gpu/drm/drm_pt_walk.c
new file mode 100644
index 000000000000..1a0b147a3acc
--- /dev/null
+++ b/drivers/gpu/drm/drm_pt_walk.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#include <drm/drm_pt_walk.h>
+
+/**
+ * DOC: GPU page-table tree walking.
+ * The utilities in this file are similar to the CPU page-table walk
+ * utilities in mm/pagewalk.c. The main difference is that we distinguish
+ * the various levels of a page-table tree with an unsigned integer rather
+ * than by name. 0 is the lowest level, and page-tables with level 0 can
+ * not be directories pointing to lower levels, whereas all other levels
+ * can. The user of the utilities determines the highest level.
+ *
+ * Nomenclature:
+ * Each struct drm_pt, regardless of level is referred to as a page table, and
+ * multiple page tables typically form a page table tree with page tables at
+ * intermediate levels being page directories pointing at page tables at lower
+ * levels. A shared page table for a given address range is a page-table which
+ * is neither fully within nor fully outside the address range and that can
+ * thus be shared by two or more address ranges.
+ */
+static u64 drm_pt_addr_end(u64 addr, u64 end, unsigned int level,
+			   const struct drm_pt_walk *walk)
+{
+	u64 size = 1ull << walk->shifts[level];
+	u64 tmp = round_up(addr + 1, size);
+
+	return min_t(u64, tmp, end);
+}
+
+static bool drm_pt_next(pgoff_t *offset, u64 *addr, u64 next, u64 end,
+			unsigned int level, const struct drm_pt_walk *walk)
+{
+	pgoff_t step = 1;
+
+	/* Shared pt walk skips to the last pagetable */
+	if (unlikely(walk->shared_pt_mode)) {
+		unsigned int shift = walk->shifts[level];
+		u64 skip_to = round_down(end, 1ull << shift);
+
+		if (skip_to > next) {
+			step += (skip_to - next) >> shift;
+			next = skip_to;
+		}
+	}
+
+	*addr = next;
+	*offset += step;
+
+	return next != end;
+}
+
+/**
+ * drm_pt_walk_range() - Walk a range of a gpu page table tree with callbacks
+ * for each page-table entry in all levels.
+ * @parent: The root page table for walk start.
+ * @level: The root page table level.
+ * @addr: Virtual address start.
+ * @end: Virtual address end + 1.
+ * @walk: Walk info.
+ *
+ * Similar to the CPU page-table walker, this is a helper to walk
+ * a gpu page table and call a provided callback function for each entry.
+ *
+ * Return: 0 on success, negative error code on error. The error is
+ * propagated from the callback and on error the walk is terminated.
+ */
+int drm_pt_walk_range(struct drm_pt *parent, unsigned int level,
+		      u64 addr, u64 end, struct drm_pt_walk *walk)
+{
+	pgoff_t offset = drm_pt_offset(addr, level, walk);
+	struct drm_pt **entries = parent->dir ? parent->dir->entries : NULL;
+	const struct drm_pt_walk_ops *ops = walk->ops;
+	enum page_walk_action action;
+	struct drm_pt *child;
+	int err = 0;
+	u64 next;
+
+	do {
+		next = drm_pt_addr_end(addr, end, level, walk);
+		if (walk->shared_pt_mode && drm_pt_covers(addr, next, level,
+							  walk))
+			continue;
+again:
+		action = ACTION_SUBTREE;
+		child = entries ? entries[offset] : NULL;
+		err = ops->pt_entry(parent, offset, level, addr, next,
+				    &child, &action, walk);
+		if (err)
+			break;
+
+		/* Probably not needed yet for gpu pagetable walk. */
+		if (unlikely(action == ACTION_AGAIN))
+			goto again;
+
+		if (likely(!level || !child || action == ACTION_CONTINUE))
+			continue;
+
+		err = drm_pt_walk_range(child, level - 1, addr, next, walk);
+
+		if (!err && ops->pt_post_descend)
+			err = ops->pt_post_descend(parent, offset, level, addr,
+						   next, &child, &action, walk);
+		if (err)
+			break;
+
+	} while (drm_pt_next(&offset, &addr, next, end, level, walk));
+
+	return err;
+}
+EXPORT_SYMBOL(drm_pt_walk_range);
+
+/**
+ * drm_pt_walk_shared() - Walk shared page tables of a page-table tree.
+ * @parent: Root page table directory.
+ * @level: Level of the root.
+ * @addr: Start address.
+ * @end: Last address + 1.
+ * @walk: Walk info.
+ *
+ * This function is similar to drm_pt_walk_range() but it skips page tables
+ * that are private to the range. Since the root (or @parent) page table is
+ * typically also a shared page table this function is different in that it
+ * calls the pt_entry callback and the post_descend callback also for the
+ * root. The root can be detected in the callbacks by checking whether
+ * parent == *child.
+ * Walking only the shared page tables is common for unbind-type operations
+ * where the page-table entries for an address range are cleared or detached
+ * from the main page-table tree.
+ *
+ * Return: 0 on success, negative error code on error: If a callback
+ * returns an error, the walk will be terminated and the error returned by
+ * this function.
+ */
+int drm_pt_walk_shared(struct drm_pt *parent, unsigned int level,
+		       u64 addr, u64 end, struct drm_pt_walk *walk)
+{
+	const struct drm_pt_walk_ops *ops = walk->ops;
+	enum page_walk_action action = ACTION_SUBTREE;
+	struct drm_pt *child = parent;
+	int err;
+
+	walk->shared_pt_mode = true;
+	err = walk->ops->pt_entry(parent, 0, level + 1, addr, end,
+				  &child, &action, walk);
+
+	if (err || action != ACTION_SUBTREE)
+		return err;
+
+	err = drm_pt_walk_range(parent, level, addr, end, walk);
+	if (!err && ops->pt_post_descend) {
+		err = ops->pt_post_descend(parent, 0, level + 1, addr, end,
+					   &child, &action, walk);
+	}
+	return err;
+}
+EXPORT_SYMBOL(drm_pt_walk_shared);
diff --git a/include/drm/drm_pt_walk.h b/include/drm/drm_pt_walk.h
new file mode 100644
index 000000000000..64e7a418217c
--- /dev/null
+++ b/include/drm/drm_pt_walk.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#ifndef __DRM_PT_WALK__
+#define __DRM_PT_WALK__
+
+#include <linux/pagewalk.h>
+#include <linux/types.h>
+
+struct drm_pt_dir;
+
+/**
+ * struct drm_pt - base class for driver pagetable subclassing.
+ * @dir: Pointer to an array of children if any.
+ *
+ * Drivers could subclass this, and if it's a page-directory, typically
+ * embed the drm_pt_dir::entries array in the same allocation.
+ */
+struct drm_pt {
+	struct drm_pt_dir *dir;
+};
+
+/**
+ * struct drm_pt_dir - page directory structure
+ * @entries: Array holding page directory children.
+ *
+ * It is the responsibility of the user to ensure @entries is
+ * correctly sized.
+ */
+struct drm_pt_dir {
+	struct drm_pt *entries[0];
+};
+
+/**
+ * struct drm_pt_walk - Embeddable struct for walk parameters
+ */
+struct drm_pt_walk {
+	/** @ops: The walk ops used for the pagewalk */
+	const struct drm_pt_walk_ops *ops;
+	/**
+	 * @shifts: Array of page-table entry shifts used for the
+	 * different levels, starting out with the leaf level 0
+	 * page-shift as the first entry. It's legal for this pointer to be
+	 * changed during the walk.
+	 */
+	const u64 *shifts;
+	/** @max_level: Highest populated level in @sizes */
+	unsigned int max_level;
+	/**
+	 * @shared_pt_mode: Whether to skip all entries that are private
+	 * to the address range and called only for entries that are
+	 * shared with other address ranges. Such entries are referred to
+	 * as shared pagetables.
+	 */
+	bool shared_pt_mode;
+};
+
+/**
+ * typedef drm_pt_entry_fn - gpu page-table-walk callback-function
+ * @parent: The parent page.table.
+ * @offset: The offset (number of entries) into the page table.
+ * @level: The level of @parent.
+ * @addr: The virtual address.
+ * @next: The virtual address for the next call, or end address.
+ * @child: Pointer to pointer to child page-table at this @offset. The
+ * function may modify the value pointed to if, for example, allocating a
+ * child page table.
+ * @action: The walk action to take upon return. See <linux/pagewalk.h>.
+ * @walk: The walk parameters.
+ */
+typedef int (*drm_pt_entry_fn)(struct drm_pt *parent, pgoff_t offset,
+			       unsigned int level, u64 addr, u64 next,
+			       struct drm_pt **child,
+			       enum page_walk_action *action,
+			       struct drm_pt_walk *walk);
+
+/**
+ * struct drm_pt_walk_ops - Walk callbacks.
+ */
+struct drm_pt_walk_ops {
+	/**
+	 * @pt_entry: Callback to be called for each page table entry prior
+	 * to descending to the next level. The returned value of the action
+	 * function parameter is honored.
+	 */
+	drm_pt_entry_fn pt_entry;
+	/**
+	 * @pt_post_descend: Callback to be called for each page table entry
+	 * after return from descending to the next level. The returned value
+	 * of the action function parameter is ignored.
+	 */
+	drm_pt_entry_fn pt_post_descend;
+};
+
+int drm_pt_walk_range(struct drm_pt *parent, unsigned int level,
+		      u64 addr, u64 end, struct drm_pt_walk *walk);
+
+int drm_pt_walk_shared(struct drm_pt *parent, unsigned int level,
+		       u64 addr, u64 end, struct drm_pt_walk *walk);
+
+/**
+ * drm_pt_covers - Whether the address range covers an entire entry in @level
+ * @addr: Start of the range.
+ * @end: End of range + 1.
+ * @level: Page table level.
+ * @walk: Page table walk info.
+ *
+ * This function is a helper to aid in determining whether a leaf page table
+ * entry can be inserted at this @level.
+ *
+ * Return: Whether the range provided covers exactly an entry at this level.
+ */
+static inline bool drm_pt_covers(u64 addr, u64 end, unsigned int level,
+				 const struct drm_pt_walk *walk)
+{
+	u64 pt_size = 1ull << walk->shifts[level];
+
+	return end - addr == pt_size && IS_ALIGNED(addr, pt_size);
+}
+
+/**
+ * drm_pt_num_entries: Number of page-table entries of a given range at this
+ * level
+ * @addr: Start address.
+ * @end: End address.
+ * @level: Page table level.
+ * @walk: Walk info.
+ *
+ * Return: The number of page table entries at this level between @start and
+ * @end.
+ */
+static inline pgoff_t
+drm_pt_num_entries(u64 addr, u64 end, unsigned int level,
+		   const struct drm_pt_walk *walk)
+{
+	u64 pt_size = 1ull << walk->shifts[level];
+
+	return (round_up(end, pt_size) - round_down(addr, pt_size)) >>
+		walk->shifts[level];
+}
+
+/**
+ * drm_pt_offset: Offset of the page-table entry for a given address.
+ * @addr: The address.
+ * @level: Page table level.
+ * @walk: Walk info.
+ *
+ * Return: The page table entry offset for the given address in a
+ * page table with size indicated by @level.
+ */
+static inline pgoff_t
+drm_pt_offset(u64 addr, unsigned int level, const struct drm_pt_walk *walk)
+{
+	if (level < walk->max_level)
+		addr &= ((1ull << walk->shifts[level + 1]) - 1);
+
+	return addr >> walk->shifts[level];
+}
+
+#endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 09/20] drm: Add a gpu page-table walker helper
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Add a gpu page table walker similar in functionality to the cpu page-table
walker in mm/pagewalk.c. This is made a drm helper in the hope that it
might prove useful to other drivers, but we could of course make it
single-driver only and rename the functions initially.

Also if remaining a DRM helper, we should consider making it a helper
kernel module of its own.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Makefile      |   1 +
 drivers/gpu/drm/drm_pt_walk.c | 159 +++++++++++++++++++++++++++++++++
 include/drm/drm_pt_walk.h     | 161 ++++++++++++++++++++++++++++++++++
 3 files changed, 321 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_pt_walk.c
 create mode 100644 include/drm/drm_pt_walk.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 23ad760884b2..d030c2885dd8 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -39,6 +39,7 @@ drm-y := \
 	drm_prime.o \
 	drm_print.o \
 	drm_property.o \
+	drm_pt_walk.o \
 	drm_syncobj.o \
 	drm_sysfs.o \
 	drm_trace_points.o \
diff --git a/drivers/gpu/drm/drm_pt_walk.c b/drivers/gpu/drm/drm_pt_walk.c
new file mode 100644
index 000000000000..1a0b147a3acc
--- /dev/null
+++ b/drivers/gpu/drm/drm_pt_walk.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#include <drm/drm_pt_walk.h>
+
+/**
+ * DOC: GPU page-table tree walking.
+ * The utilities in this file are similar to the CPU page-table walk
+ * utilities in mm/pagewalk.c. The main difference is that we distinguish
+ * the various levels of a page-table tree with an unsigned integer rather
+ * than by name. 0 is the lowest level, and page-tables with level 0 can
+ * not be directories pointing to lower levels, whereas all other levels
+ * can. The user of the utilities determines the highest level.
+ *
+ * Nomenclature:
+ * Each struct drm_pt, regardless of level is referred to as a page table, and
+ * multiple page tables typically form a page table tree with page tables at
+ * intermediate levels being page directories pointing at page tables at lower
+ * levels. A shared page table for a given address range is a page-table which
+ * is neither fully within nor fully outside the address range and that can
+ * thus be shared by two or more address ranges.
+ */
+static u64 drm_pt_addr_end(u64 addr, u64 end, unsigned int level,
+			   const struct drm_pt_walk *walk)
+{
+	u64 size = 1ull << walk->shifts[level];
+	u64 tmp = round_up(addr + 1, size);
+
+	return min_t(u64, tmp, end);
+}
+
+static bool drm_pt_next(pgoff_t *offset, u64 *addr, u64 next, u64 end,
+			unsigned int level, const struct drm_pt_walk *walk)
+{
+	pgoff_t step = 1;
+
+	/* Shared pt walk skips to the last pagetable */
+	if (unlikely(walk->shared_pt_mode)) {
+		unsigned int shift = walk->shifts[level];
+		u64 skip_to = round_down(end, 1ull << shift);
+
+		if (skip_to > next) {
+			step += (skip_to - next) >> shift;
+			next = skip_to;
+		}
+	}
+
+	*addr = next;
+	*offset += step;
+
+	return next != end;
+}
+
+/**
+ * drm_pt_walk_range() - Walk a range of a gpu page table tree with callbacks
+ * for each page-table entry in all levels.
+ * @parent: The root page table for walk start.
+ * @level: The root page table level.
+ * @addr: Virtual address start.
+ * @end: Virtual address end + 1.
+ * @walk: Walk info.
+ *
+ * Similar to the CPU page-table walker, this is a helper to walk
+ * a gpu page table and call a provided callback function for each entry.
+ *
+ * Return: 0 on success, negative error code on error. The error is
+ * propagated from the callback and on error the walk is terminated.
+ */
+int drm_pt_walk_range(struct drm_pt *parent, unsigned int level,
+		      u64 addr, u64 end, struct drm_pt_walk *walk)
+{
+	pgoff_t offset = drm_pt_offset(addr, level, walk);
+	struct drm_pt **entries = parent->dir ? parent->dir->entries : NULL;
+	const struct drm_pt_walk_ops *ops = walk->ops;
+	enum page_walk_action action;
+	struct drm_pt *child;
+	int err = 0;
+	u64 next;
+
+	do {
+		next = drm_pt_addr_end(addr, end, level, walk);
+		if (walk->shared_pt_mode && drm_pt_covers(addr, next, level,
+							  walk))
+			continue;
+again:
+		action = ACTION_SUBTREE;
+		child = entries ? entries[offset] : NULL;
+		err = ops->pt_entry(parent, offset, level, addr, next,
+				    &child, &action, walk);
+		if (err)
+			break;
+
+		/* Probably not needed yet for gpu pagetable walk. */
+		if (unlikely(action == ACTION_AGAIN))
+			goto again;
+
+		if (likely(!level || !child || action == ACTION_CONTINUE))
+			continue;
+
+		err = drm_pt_walk_range(child, level - 1, addr, next, walk);
+
+		if (!err && ops->pt_post_descend)
+			err = ops->pt_post_descend(parent, offset, level, addr,
+						   next, &child, &action, walk);
+		if (err)
+			break;
+
+	} while (drm_pt_next(&offset, &addr, next, end, level, walk));
+
+	return err;
+}
+EXPORT_SYMBOL(drm_pt_walk_range);
+
+/**
+ * drm_pt_walk_shared() - Walk shared page tables of a page-table tree.
+ * @parent: Root page table directory.
+ * @level: Level of the root.
+ * @addr: Start address.
+ * @end: Last address + 1.
+ * @walk: Walk info.
+ *
+ * This function is similar to drm_pt_walk_range() but it skips page tables
+ * that are private to the range. Since the root (or @parent) page table is
+ * typically also a shared page table this function is different in that it
+ * calls the pt_entry callback and the post_descend callback also for the
+ * root. The root can be detected in the callbacks by checking whether
+ * parent == *child.
+ * Walking only the shared page tables is common for unbind-type operations
+ * where the page-table entries for an address range are cleared or detached
+ * from the main page-table tree.
+ *
+ * Return: 0 on success, negative error code on error: If a callback
+ * returns an error, the walk will be terminated and the error returned by
+ * this function.
+ */
+int drm_pt_walk_shared(struct drm_pt *parent, unsigned int level,
+		       u64 addr, u64 end, struct drm_pt_walk *walk)
+{
+	const struct drm_pt_walk_ops *ops = walk->ops;
+	enum page_walk_action action = ACTION_SUBTREE;
+	struct drm_pt *child = parent;
+	int err;
+
+	walk->shared_pt_mode = true;
+	err = walk->ops->pt_entry(parent, 0, level + 1, addr, end,
+				  &child, &action, walk);
+
+	if (err || action != ACTION_SUBTREE)
+		return err;
+
+	err = drm_pt_walk_range(parent, level, addr, end, walk);
+	if (!err && ops->pt_post_descend) {
+		err = ops->pt_post_descend(parent, 0, level + 1, addr, end,
+					   &child, &action, walk);
+	}
+	return err;
+}
+EXPORT_SYMBOL(drm_pt_walk_shared);
diff --git a/include/drm/drm_pt_walk.h b/include/drm/drm_pt_walk.h
new file mode 100644
index 000000000000..64e7a418217c
--- /dev/null
+++ b/include/drm/drm_pt_walk.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#ifndef __DRM_PT_WALK__
+#define __DRM_PT_WALK__
+
+#include <linux/pagewalk.h>
+#include <linux/types.h>
+
+struct drm_pt_dir;
+
+/**
+ * struct drm_pt - base class for driver pagetable subclassing.
+ * @dir: Pointer to an array of children if any.
+ *
+ * Drivers could subclass this, and if it's a page-directory, typically
+ * embed the drm_pt_dir::entries array in the same allocation.
+ */
+struct drm_pt {
+	struct drm_pt_dir *dir;
+};
+
+/**
+ * struct drm_pt_dir - page directory structure
+ * @entries: Array holding page directory children.
+ *
+ * It is the responsibility of the user to ensure @entries is
+ * correctly sized.
+ */
+struct drm_pt_dir {
+	struct drm_pt *entries[0];
+};
+
+/**
+ * struct drm_pt_walk - Embeddable struct for walk parameters
+ */
+struct drm_pt_walk {
+	/** @ops: The walk ops used for the pagewalk */
+	const struct drm_pt_walk_ops *ops;
+	/**
+	 * @shifts: Array of page-table entry shifts used for the
+	 * different levels, starting out with the leaf level 0
+	 * page-shift as the first entry. It's legal for this pointer to be
+	 * changed during the walk.
+	 */
+	const u64 *shifts;
+	/** @max_level: Highest populated level in @sizes */
+	unsigned int max_level;
+	/**
+	 * @shared_pt_mode: Whether to skip all entries that are private
+	 * to the address range and called only for entries that are
+	 * shared with other address ranges. Such entries are referred to
+	 * as shared pagetables.
+	 */
+	bool shared_pt_mode;
+};
+
+/**
+ * typedef drm_pt_entry_fn - gpu page-table-walk callback-function
+ * @parent: The parent page.table.
+ * @offset: The offset (number of entries) into the page table.
+ * @level: The level of @parent.
+ * @addr: The virtual address.
+ * @next: The virtual address for the next call, or end address.
+ * @child: Pointer to pointer to child page-table at this @offset. The
+ * function may modify the value pointed to if, for example, allocating a
+ * child page table.
+ * @action: The walk action to take upon return. See <linux/pagewalk.h>.
+ * @walk: The walk parameters.
+ */
+typedef int (*drm_pt_entry_fn)(struct drm_pt *parent, pgoff_t offset,
+			       unsigned int level, u64 addr, u64 next,
+			       struct drm_pt **child,
+			       enum page_walk_action *action,
+			       struct drm_pt_walk *walk);
+
+/**
+ * struct drm_pt_walk_ops - Walk callbacks.
+ */
+struct drm_pt_walk_ops {
+	/**
+	 * @pt_entry: Callback to be called for each page table entry prior
+	 * to descending to the next level. The returned value of the action
+	 * function parameter is honored.
+	 */
+	drm_pt_entry_fn pt_entry;
+	/**
+	 * @pt_post_descend: Callback to be called for each page table entry
+	 * after return from descending to the next level. The returned value
+	 * of the action function parameter is ignored.
+	 */
+	drm_pt_entry_fn pt_post_descend;
+};
+
+int drm_pt_walk_range(struct drm_pt *parent, unsigned int level,
+		      u64 addr, u64 end, struct drm_pt_walk *walk);
+
+int drm_pt_walk_shared(struct drm_pt *parent, unsigned int level,
+		       u64 addr, u64 end, struct drm_pt_walk *walk);
+
+/**
+ * drm_pt_covers - Whether the address range covers an entire entry in @level
+ * @addr: Start of the range.
+ * @end: End of range + 1.
+ * @level: Page table level.
+ * @walk: Page table walk info.
+ *
+ * This function is a helper to aid in determining whether a leaf page table
+ * entry can be inserted at this @level.
+ *
+ * Return: Whether the range provided covers exactly an entry at this level.
+ */
+static inline bool drm_pt_covers(u64 addr, u64 end, unsigned int level,
+				 const struct drm_pt_walk *walk)
+{
+	u64 pt_size = 1ull << walk->shifts[level];
+
+	return end - addr == pt_size && IS_ALIGNED(addr, pt_size);
+}
+
+/**
+ * drm_pt_num_entries: Number of page-table entries of a given range at this
+ * level
+ * @addr: Start address.
+ * @end: End address.
+ * @level: Page table level.
+ * @walk: Walk info.
+ *
+ * Return: The number of page table entries at this level between @start and
+ * @end.
+ */
+static inline pgoff_t
+drm_pt_num_entries(u64 addr, u64 end, unsigned int level,
+		   const struct drm_pt_walk *walk)
+{
+	u64 pt_size = 1ull << walk->shifts[level];
+
+	return (round_up(end, pt_size) - round_down(addr, pt_size)) >>
+		walk->shifts[level];
+}
+
+/**
+ * drm_pt_offset: Offset of the page-table entry for a given address.
+ * @addr: The address.
+ * @level: Page table level.
+ * @walk: Walk info.
+ *
+ * Return: The page table entry offset for the given address in a
+ * page table with size indicated by @level.
+ */
+static inline pgoff_t
+drm_pt_offset(u64 addr, unsigned int level, const struct drm_pt_walk *walk)
+{
+	if (level < walk->max_level)
+		addr &= ((1ull << walk->shifts[level + 1]) - 1);
+
+	return addr >> walk->shifts[level];
+}
+
+#endif
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 10/20] drm/ttm: Don't print error message if eviction was interrupted
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Avoid printing an error message if eviction was interrupted by,
for example, the user pressing CTRL-C. That may happen if eviction
is waiting for something, like for example a free batch-buffer.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index cd266a067773..e60aaa3299e7 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -465,7 +465,8 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo,
 	if (ret == -EMULTIHOP) {
 		ret = ttm_bo_bounce_temp_buffer(bo, &evict_mem, ctx, &hop);
 		if (ret) {
-			pr_err("Buffer eviction failed\n");
+			if (ret != -ERESTARTSYS && ret != -EINTR)
+				pr_err("Buffer eviction failed\n");
 			ttm_resource_free(bo, &evict_mem);
 			goto out;
 		}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 10/20] drm/ttm: Don't print error message if eviction was interrupted
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Avoid printing an error message if eviction was interrupted by,
for example, the user pressing CTRL-C. That may happen if eviction
is waiting for something, like for example a free batch-buffer.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index cd266a067773..e60aaa3299e7 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -465,7 +465,8 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo,
 	if (ret == -EMULTIHOP) {
 		ret = ttm_bo_bounce_temp_buffer(bo, &evict_mem, ctx, &hop);
 		if (ret) {
-			pr_err("Buffer eviction failed\n");
+			if (ret != -ERESTARTSYS && ret != -EINTR)
+				pr_err("Buffer eviction failed\n");
 			ttm_resource_free(bo, &evict_mem);
 			goto out;
 		}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 11/20] drm/i915: Remove gem and overlay frontbuffer tracking
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Frontbuffer update handling should be done explicitly by using dirtyfb
calls only.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/i9xx_plane.c     |  1 +
 drivers/gpu/drm/i915/display/intel_drrs.c     |  1 +
 drivers/gpu/drm/i915/display/intel_fb.c       |  1 +
 drivers/gpu/drm/i915/display/intel_overlay.c  | 14 -----------
 .../drm/i915/display/intel_plane_initial.c    |  1 +
 drivers/gpu/drm/i915/display/intel_psr.c      |  1 +
 .../drm/i915/display/skl_universal_plane.c    |  1 +
 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |  4 ---
 drivers/gpu/drm/i915/gem/i915_gem_domain.c    |  7 ------
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 --
 drivers/gpu/drm/i915/gem/i915_gem_object.c    | 25 -------------------
 drivers/gpu/drm/i915/gem/i915_gem_object.h    | 22 ----------------
 drivers/gpu/drm/i915/gem/i915_gem_phys.c      |  4 ---
 drivers/gpu/drm/i915/i915_driver.c            |  1 +
 drivers/gpu/drm/i915/i915_gem.c               |  8 ------
 drivers/gpu/drm/i915/i915_gem_gtt.c           |  1 -
 drivers/gpu/drm/i915/i915_vma.c               | 12 ---------
 17 files changed, 7 insertions(+), 99 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/i9xx_plane.c b/drivers/gpu/drm/i915/display/i9xx_plane.c
index ecaeb7dc196b..633e462d96a0 100644
--- a/drivers/gpu/drm/i915/display/i9xx_plane.c
+++ b/drivers/gpu/drm/i915/display/i9xx_plane.c
@@ -17,6 +17,7 @@
 #include "intel_display_types.h"
 #include "intel_fb.h"
 #include "intel_fbc.h"
+#include "intel_frontbuffer.h"
 #include "intel_sprite.h"
 
 /* Primary plane formats for gen <= 3 */
diff --git a/drivers/gpu/drm/i915/display/intel_drrs.c b/drivers/gpu/drm/i915/display/intel_drrs.c
index 5b9e44443814..3503d112387d 100644
--- a/drivers/gpu/drm/i915/display/intel_drrs.c
+++ b/drivers/gpu/drm/i915/display/intel_drrs.c
@@ -9,6 +9,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_drrs.h"
+#include "intel_frontbuffer.h"
 #include "intel_panel.h"
 
 /**
diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
index 63137ae5ab21..7cf31c87884c 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -12,6 +12,7 @@
 #include "intel_display_types.h"
 #include "intel_dpt.h"
 #include "intel_fb.h"
+#include "intel_frontbuffer.h"
 
 #define check_array_bounds(i915, a, i) drm_WARN_ON(&(i915)->drm, (i) >= ARRAY_SIZE(a))
 
diff --git a/drivers/gpu/drm/i915/display/intel_overlay.c b/drivers/gpu/drm/i915/display/intel_overlay.c
index c12bdca8da9b..5b86563ce577 100644
--- a/drivers/gpu/drm/i915/display/intel_overlay.c
+++ b/drivers/gpu/drm/i915/display/intel_overlay.c
@@ -186,7 +186,6 @@ struct intel_overlay {
 	struct intel_crtc *crtc;
 	struct i915_vma *vma;
 	struct i915_vma *old_vma;
-	struct intel_frontbuffer *frontbuffer;
 	bool active;
 	bool pfit_active;
 	u32 pfit_vscale_ratio; /* shifted-point number, (1<<12) == 1.0 */
@@ -287,20 +286,9 @@ static void intel_overlay_flip_prepare(struct intel_overlay *overlay,
 				       struct i915_vma *vma)
 {
 	enum pipe pipe = overlay->crtc->pipe;
-	struct intel_frontbuffer *frontbuffer = NULL;
 
 	drm_WARN_ON(&overlay->i915->drm, overlay->old_vma);
 
-	if (vma)
-		frontbuffer = intel_frontbuffer_get(vma->obj);
-
-	intel_frontbuffer_track(overlay->frontbuffer, frontbuffer,
-				INTEL_FRONTBUFFER_OVERLAY(pipe));
-
-	if (overlay->frontbuffer)
-		intel_frontbuffer_put(overlay->frontbuffer);
-	overlay->frontbuffer = frontbuffer;
-
 	intel_frontbuffer_flip_prepare(overlay->i915,
 				       INTEL_FRONTBUFFER_OVERLAY(pipe));
 
@@ -810,8 +798,6 @@ static int intel_overlay_do_put_image(struct intel_overlay *overlay,
 		goto out_pin_section;
 	}
 
-	i915_gem_object_flush_frontbuffer(new_bo, ORIGIN_DIRTYFB);
-
 	if (!overlay->active) {
 		const struct intel_crtc_state *crtc_state =
 			overlay->crtc->config;
diff --git a/drivers/gpu/drm/i915/display/intel_plane_initial.c b/drivers/gpu/drm/i915/display/intel_plane_initial.c
index 76be796df255..cad9c8884af3 100644
--- a/drivers/gpu/drm/i915/display/intel_plane_initial.c
+++ b/drivers/gpu/drm/i915/display/intel_plane_initial.c
@@ -9,6 +9,7 @@
 #include "intel_display.h"
 #include "intel_display_types.h"
 #include "intel_fb.h"
+#include "intel_frontbuffer.h"
 #include "intel_plane_initial.h"
 
 static bool
diff --git a/drivers/gpu/drm/i915/display/intel_psr.c b/drivers/gpu/drm/i915/display/intel_psr.c
index 9820e5fdd087..bc998b526d88 100644
--- a/drivers/gpu/drm/i915/display/intel_psr.c
+++ b/drivers/gpu/drm/i915/display/intel_psr.c
@@ -33,6 +33,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_dp_aux.h"
+#include "intel_frontbuffer.h"
 #include "intel_hdmi.h"
 #include "intel_psr.h"
 #include "intel_snps_phy.h"
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index 4b79c2d2d617..2f5524f380b0 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -16,6 +16,7 @@
 #include "intel_display_types.h"
 #include "intel_fb.h"
 #include "intel_fbc.h"
+#include "intel_frontbuffer.h"
 #include "intel_psr.h"
 #include "intel_sprite.h"
 #include "skl_scaler.h"
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
index b3b398fe689c..df2db78b10ca 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
@@ -6,8 +6,6 @@
 
 #include <drm/drm_cache.h>
 
-#include "display/intel_frontbuffer.h"
-
 #include "i915_drv.h"
 #include "i915_gem_clflush.h"
 #include "i915_sw_fence_work.h"
@@ -22,8 +20,6 @@ static void __do_clflush(struct drm_i915_gem_object *obj)
 {
 	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
 	drm_clflush_sg(obj->mm.pages);
-
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 }
 
 static void clflush_work(struct dma_fence_work *base)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_domain.c b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
index 9969e687ad85..cd5505da4884 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_domain.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
@@ -4,7 +4,6 @@
  * Copyright © 2014-2016 Intel Corporation
  */
 
-#include "display/intel_frontbuffer.h"
 #include "gt/intel_gt.h"
 
 #include "i915_drv.h"
@@ -65,8 +64,6 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
 				intel_gt_flush_ggtt_writes(vma->vm->gt);
 		}
 		spin_unlock(&obj->vma.lock);
-
-		i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 		break;
 
 	case I915_GEM_DOMAIN_WC:
@@ -629,9 +626,6 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 out_unlock:
 	i915_gem_object_unlock(obj);
 
-	if (!err && write_domain)
-		i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
-
 out:
 	i915_gem_object_put(obj);
 	return err;
@@ -742,7 +736,6 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
 	}
 
 out:
-	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
 	obj->mm.dirty = true;
 	/* return with the pages pinned */
 	return 0;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index f98600ca7557..08f84d4f4f92 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -11,8 +11,6 @@
 
 #include <drm/drm_syncobj.h>
 
-#include "display/intel_frontbuffer.h"
-
 #include "gem/i915_gem_ioctls.h"
 #include "gt/intel_context.h"
 #include "gt/intel_gpu_commands.h"
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index 1a0886b8aaa1..d2fef38cd12e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -27,7 +27,6 @@
 
 #include <drm/drm_cache.h>
 
-#include "display/intel_frontbuffer.h"
 #include "pxp/intel_pxp.h"
 
 #include "i915_drv.h"
@@ -400,30 +399,6 @@ static void i915_gem_free_object(struct drm_gem_object *gem_obj)
 		queue_work(i915->wq, &i915->mm.free_work);
 }
 
-void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
-					 enum fb_op_origin origin)
-{
-	struct intel_frontbuffer *front;
-
-	front = __intel_frontbuffer_get(obj);
-	if (front) {
-		intel_frontbuffer_flush(front, origin);
-		intel_frontbuffer_put(front);
-	}
-}
-
-void __i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
-					      enum fb_op_origin origin)
-{
-	struct intel_frontbuffer *front;
-
-	front = __intel_frontbuffer_get(obj);
-	if (front) {
-		intel_frontbuffer_invalidate(front, origin);
-		intel_frontbuffer_put(front);
-	}
-}
-
 static void
 i915_gem_object_read_from_page_kmap(struct drm_i915_gem_object *obj, u64 offset, void *dst, int size)
 {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 3db53769864c..90dba761889c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -11,7 +11,6 @@
 #include <drm/drm_file.h>
 #include <drm/drm_device.h>
 
-#include "display/intel_frontbuffer.h"
 #include "intel_memory_region.h"
 #include "i915_gem_object_types.h"
 #include "i915_gem_gtt.h"
@@ -573,27 +572,6 @@ int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 				  unsigned int flags,
 				  const struct i915_sched_attr *attr);
 
-void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
-					 enum fb_op_origin origin);
-void __i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
-					      enum fb_op_origin origin);
-
-static inline void
-i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
-				  enum fb_op_origin origin)
-{
-	if (unlikely(rcu_access_pointer(obj->frontbuffer)))
-		__i915_gem_object_flush_frontbuffer(obj, origin);
-}
-
-static inline void
-i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
-				       enum fb_op_origin origin)
-{
-	if (unlikely(rcu_access_pointer(obj->frontbuffer)))
-		__i915_gem_object_invalidate_frontbuffer(obj, origin);
-}
-
 int i915_gem_object_read_from_page(struct drm_i915_gem_object *obj, u64 offset, void *dst, int size);
 
 bool i915_gem_object_is_shmem(const struct drm_i915_gem_object *obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_phys.c b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
index 68453572275b..4cf57676e180 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_phys.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
@@ -156,15 +156,11 @@ int i915_gem_object_pwrite_phys(struct drm_i915_gem_object *obj,
 	 * We manually control the domain here and pretend that it
 	 * remains coherent i.e. in the GTT domain, like shmem_pwrite.
 	 */
-	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
-
 	if (copy_from_user(vaddr, user_data, args->size))
 		return -EFAULT;
 
 	drm_clflush_virt_range(vaddr, args->size);
 	intel_gt_chipset_flush(to_gt(i915));
-
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index c1e427ba57ae..f4201f9c5f84 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -346,6 +346,7 @@ static int i915_driver_early_probe(struct drm_i915_private *dev_priv)
 
 	spin_lock_init(&dev_priv->irq_lock);
 	spin_lock_init(&dev_priv->gpu_error.lock);
+	spin_lock_init(&dev_priv->display.fb_tracking.lock);
 	mutex_init(&dev_priv->display.backlight.lock);
 
 	mutex_init(&dev_priv->sb_lock);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 969581e7106f..594891291735 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -40,7 +40,6 @@
 #include <drm/drm_vma_manager.h>
 
 #include "display/intel_display.h"
-#include "display/intel_frontbuffer.h"
 
 #include "gem/i915_gem_clflush.h"
 #include "gem/i915_gem_context.h"
@@ -569,8 +568,6 @@ i915_gem_gtt_pwrite_fast(struct drm_i915_gem_object *obj,
 		goto out_rpm;
 	}
 
-	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
-
 	user_data = u64_to_user_ptr(args->data_ptr);
 	offset = args->offset;
 	remain = args->size;
@@ -613,7 +610,6 @@ i915_gem_gtt_pwrite_fast(struct drm_i915_gem_object *obj,
 	}
 
 	intel_gt_flush_ggtt_writes(ggtt->vm.gt);
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 
 	i915_gem_gtt_cleanup(obj, &node, vma);
 out_rpm:
@@ -700,8 +696,6 @@ i915_gem_shmem_pwrite(struct drm_i915_gem_object *obj,
 		offset = 0;
 	}
 
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
-
 	i915_gem_object_unpin_pages(obj);
 	return ret;
 
@@ -1272,8 +1266,6 @@ void i915_gem_init_early(struct drm_i915_private *dev_priv)
 {
 	i915_gem_init__mm(dev_priv);
 	i915_gem_init__contexts(dev_priv);
-
-	spin_lock_init(&dev_priv->display.fb_tracking.lock);
 }
 
 void i915_gem_cleanup_early(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 7bd1861ddbdf..a9662cc6ed1e 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -15,7 +15,6 @@
 #include <asm/set_memory.h>
 #include <asm/smp.h>
 
-#include "display/intel_frontbuffer.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_requests.h"
 
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 7d044888ac33..e3b73175805b 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -26,7 +26,6 @@
 #include <linux/dma-fence-array.h>
 #include <drm/drm_gem.h>
 
-#include "display/intel_frontbuffer.h"
 #include "gem/i915_gem_lmem.h"
 #include "gem/i915_gem_tiling.h"
 #include "gt/intel_engine.h"
@@ -1901,17 +1900,6 @@ int _i915_vma_move_to_active(struct i915_vma *vma,
 			return err;
 	}
 
-	if (flags & EXEC_OBJECT_WRITE) {
-		struct intel_frontbuffer *front;
-
-		front = __intel_frontbuffer_get(obj);
-		if (unlikely(front)) {
-			if (intel_frontbuffer_invalidate(front, ORIGIN_CS))
-				i915_active_add_request(&front->write, rq);
-			intel_frontbuffer_put(front);
-		}
-	}
-
 	if (fence) {
 		struct dma_fence *curr;
 		enum dma_resv_usage usage;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 11/20] drm/i915: Remove gem and overlay frontbuffer tracking
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Frontbuffer update handling should be done explicitly by using dirtyfb
calls only.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/i9xx_plane.c     |  1 +
 drivers/gpu/drm/i915/display/intel_drrs.c     |  1 +
 drivers/gpu/drm/i915/display/intel_fb.c       |  1 +
 drivers/gpu/drm/i915/display/intel_overlay.c  | 14 -----------
 .../drm/i915/display/intel_plane_initial.c    |  1 +
 drivers/gpu/drm/i915/display/intel_psr.c      |  1 +
 .../drm/i915/display/skl_universal_plane.c    |  1 +
 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |  4 ---
 drivers/gpu/drm/i915/gem/i915_gem_domain.c    |  7 ------
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 --
 drivers/gpu/drm/i915/gem/i915_gem_object.c    | 25 -------------------
 drivers/gpu/drm/i915/gem/i915_gem_object.h    | 22 ----------------
 drivers/gpu/drm/i915/gem/i915_gem_phys.c      |  4 ---
 drivers/gpu/drm/i915/i915_driver.c            |  1 +
 drivers/gpu/drm/i915/i915_gem.c               |  8 ------
 drivers/gpu/drm/i915/i915_gem_gtt.c           |  1 -
 drivers/gpu/drm/i915/i915_vma.c               | 12 ---------
 17 files changed, 7 insertions(+), 99 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/i9xx_plane.c b/drivers/gpu/drm/i915/display/i9xx_plane.c
index ecaeb7dc196b..633e462d96a0 100644
--- a/drivers/gpu/drm/i915/display/i9xx_plane.c
+++ b/drivers/gpu/drm/i915/display/i9xx_plane.c
@@ -17,6 +17,7 @@
 #include "intel_display_types.h"
 #include "intel_fb.h"
 #include "intel_fbc.h"
+#include "intel_frontbuffer.h"
 #include "intel_sprite.h"
 
 /* Primary plane formats for gen <= 3 */
diff --git a/drivers/gpu/drm/i915/display/intel_drrs.c b/drivers/gpu/drm/i915/display/intel_drrs.c
index 5b9e44443814..3503d112387d 100644
--- a/drivers/gpu/drm/i915/display/intel_drrs.c
+++ b/drivers/gpu/drm/i915/display/intel_drrs.c
@@ -9,6 +9,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_drrs.h"
+#include "intel_frontbuffer.h"
 #include "intel_panel.h"
 
 /**
diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
index 63137ae5ab21..7cf31c87884c 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -12,6 +12,7 @@
 #include "intel_display_types.h"
 #include "intel_dpt.h"
 #include "intel_fb.h"
+#include "intel_frontbuffer.h"
 
 #define check_array_bounds(i915, a, i) drm_WARN_ON(&(i915)->drm, (i) >= ARRAY_SIZE(a))
 
diff --git a/drivers/gpu/drm/i915/display/intel_overlay.c b/drivers/gpu/drm/i915/display/intel_overlay.c
index c12bdca8da9b..5b86563ce577 100644
--- a/drivers/gpu/drm/i915/display/intel_overlay.c
+++ b/drivers/gpu/drm/i915/display/intel_overlay.c
@@ -186,7 +186,6 @@ struct intel_overlay {
 	struct intel_crtc *crtc;
 	struct i915_vma *vma;
 	struct i915_vma *old_vma;
-	struct intel_frontbuffer *frontbuffer;
 	bool active;
 	bool pfit_active;
 	u32 pfit_vscale_ratio; /* shifted-point number, (1<<12) == 1.0 */
@@ -287,20 +286,9 @@ static void intel_overlay_flip_prepare(struct intel_overlay *overlay,
 				       struct i915_vma *vma)
 {
 	enum pipe pipe = overlay->crtc->pipe;
-	struct intel_frontbuffer *frontbuffer = NULL;
 
 	drm_WARN_ON(&overlay->i915->drm, overlay->old_vma);
 
-	if (vma)
-		frontbuffer = intel_frontbuffer_get(vma->obj);
-
-	intel_frontbuffer_track(overlay->frontbuffer, frontbuffer,
-				INTEL_FRONTBUFFER_OVERLAY(pipe));
-
-	if (overlay->frontbuffer)
-		intel_frontbuffer_put(overlay->frontbuffer);
-	overlay->frontbuffer = frontbuffer;
-
 	intel_frontbuffer_flip_prepare(overlay->i915,
 				       INTEL_FRONTBUFFER_OVERLAY(pipe));
 
@@ -810,8 +798,6 @@ static int intel_overlay_do_put_image(struct intel_overlay *overlay,
 		goto out_pin_section;
 	}
 
-	i915_gem_object_flush_frontbuffer(new_bo, ORIGIN_DIRTYFB);
-
 	if (!overlay->active) {
 		const struct intel_crtc_state *crtc_state =
 			overlay->crtc->config;
diff --git a/drivers/gpu/drm/i915/display/intel_plane_initial.c b/drivers/gpu/drm/i915/display/intel_plane_initial.c
index 76be796df255..cad9c8884af3 100644
--- a/drivers/gpu/drm/i915/display/intel_plane_initial.c
+++ b/drivers/gpu/drm/i915/display/intel_plane_initial.c
@@ -9,6 +9,7 @@
 #include "intel_display.h"
 #include "intel_display_types.h"
 #include "intel_fb.h"
+#include "intel_frontbuffer.h"
 #include "intel_plane_initial.h"
 
 static bool
diff --git a/drivers/gpu/drm/i915/display/intel_psr.c b/drivers/gpu/drm/i915/display/intel_psr.c
index 9820e5fdd087..bc998b526d88 100644
--- a/drivers/gpu/drm/i915/display/intel_psr.c
+++ b/drivers/gpu/drm/i915/display/intel_psr.c
@@ -33,6 +33,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_dp_aux.h"
+#include "intel_frontbuffer.h"
 #include "intel_hdmi.h"
 #include "intel_psr.h"
 #include "intel_snps_phy.h"
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index 4b79c2d2d617..2f5524f380b0 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -16,6 +16,7 @@
 #include "intel_display_types.h"
 #include "intel_fb.h"
 #include "intel_fbc.h"
+#include "intel_frontbuffer.h"
 #include "intel_psr.h"
 #include "intel_sprite.h"
 #include "skl_scaler.h"
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
index b3b398fe689c..df2db78b10ca 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
@@ -6,8 +6,6 @@
 
 #include <drm/drm_cache.h>
 
-#include "display/intel_frontbuffer.h"
-
 #include "i915_drv.h"
 #include "i915_gem_clflush.h"
 #include "i915_sw_fence_work.h"
@@ -22,8 +20,6 @@ static void __do_clflush(struct drm_i915_gem_object *obj)
 {
 	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
 	drm_clflush_sg(obj->mm.pages);
-
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 }
 
 static void clflush_work(struct dma_fence_work *base)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_domain.c b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
index 9969e687ad85..cd5505da4884 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_domain.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
@@ -4,7 +4,6 @@
  * Copyright © 2014-2016 Intel Corporation
  */
 
-#include "display/intel_frontbuffer.h"
 #include "gt/intel_gt.h"
 
 #include "i915_drv.h"
@@ -65,8 +64,6 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
 				intel_gt_flush_ggtt_writes(vma->vm->gt);
 		}
 		spin_unlock(&obj->vma.lock);
-
-		i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 		break;
 
 	case I915_GEM_DOMAIN_WC:
@@ -629,9 +626,6 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 out_unlock:
 	i915_gem_object_unlock(obj);
 
-	if (!err && write_domain)
-		i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
-
 out:
 	i915_gem_object_put(obj);
 	return err;
@@ -742,7 +736,6 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
 	}
 
 out:
-	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
 	obj->mm.dirty = true;
 	/* return with the pages pinned */
 	return 0;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index f98600ca7557..08f84d4f4f92 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -11,8 +11,6 @@
 
 #include <drm/drm_syncobj.h>
 
-#include "display/intel_frontbuffer.h"
-
 #include "gem/i915_gem_ioctls.h"
 #include "gt/intel_context.h"
 #include "gt/intel_gpu_commands.h"
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index 1a0886b8aaa1..d2fef38cd12e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -27,7 +27,6 @@
 
 #include <drm/drm_cache.h>
 
-#include "display/intel_frontbuffer.h"
 #include "pxp/intel_pxp.h"
 
 #include "i915_drv.h"
@@ -400,30 +399,6 @@ static void i915_gem_free_object(struct drm_gem_object *gem_obj)
 		queue_work(i915->wq, &i915->mm.free_work);
 }
 
-void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
-					 enum fb_op_origin origin)
-{
-	struct intel_frontbuffer *front;
-
-	front = __intel_frontbuffer_get(obj);
-	if (front) {
-		intel_frontbuffer_flush(front, origin);
-		intel_frontbuffer_put(front);
-	}
-}
-
-void __i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
-					      enum fb_op_origin origin)
-{
-	struct intel_frontbuffer *front;
-
-	front = __intel_frontbuffer_get(obj);
-	if (front) {
-		intel_frontbuffer_invalidate(front, origin);
-		intel_frontbuffer_put(front);
-	}
-}
-
 static void
 i915_gem_object_read_from_page_kmap(struct drm_i915_gem_object *obj, u64 offset, void *dst, int size)
 {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 3db53769864c..90dba761889c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -11,7 +11,6 @@
 #include <drm/drm_file.h>
 #include <drm/drm_device.h>
 
-#include "display/intel_frontbuffer.h"
 #include "intel_memory_region.h"
 #include "i915_gem_object_types.h"
 #include "i915_gem_gtt.h"
@@ -573,27 +572,6 @@ int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
 				  unsigned int flags,
 				  const struct i915_sched_attr *attr);
 
-void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
-					 enum fb_op_origin origin);
-void __i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
-					      enum fb_op_origin origin);
-
-static inline void
-i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
-				  enum fb_op_origin origin)
-{
-	if (unlikely(rcu_access_pointer(obj->frontbuffer)))
-		__i915_gem_object_flush_frontbuffer(obj, origin);
-}
-
-static inline void
-i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
-				       enum fb_op_origin origin)
-{
-	if (unlikely(rcu_access_pointer(obj->frontbuffer)))
-		__i915_gem_object_invalidate_frontbuffer(obj, origin);
-}
-
 int i915_gem_object_read_from_page(struct drm_i915_gem_object *obj, u64 offset, void *dst, int size);
 
 bool i915_gem_object_is_shmem(const struct drm_i915_gem_object *obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_phys.c b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
index 68453572275b..4cf57676e180 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_phys.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
@@ -156,15 +156,11 @@ int i915_gem_object_pwrite_phys(struct drm_i915_gem_object *obj,
 	 * We manually control the domain here and pretend that it
 	 * remains coherent i.e. in the GTT domain, like shmem_pwrite.
 	 */
-	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
-
 	if (copy_from_user(vaddr, user_data, args->size))
 		return -EFAULT;
 
 	drm_clflush_virt_range(vaddr, args->size);
 	intel_gt_chipset_flush(to_gt(i915));
-
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index c1e427ba57ae..f4201f9c5f84 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -346,6 +346,7 @@ static int i915_driver_early_probe(struct drm_i915_private *dev_priv)
 
 	spin_lock_init(&dev_priv->irq_lock);
 	spin_lock_init(&dev_priv->gpu_error.lock);
+	spin_lock_init(&dev_priv->display.fb_tracking.lock);
 	mutex_init(&dev_priv->display.backlight.lock);
 
 	mutex_init(&dev_priv->sb_lock);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 969581e7106f..594891291735 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -40,7 +40,6 @@
 #include <drm/drm_vma_manager.h>
 
 #include "display/intel_display.h"
-#include "display/intel_frontbuffer.h"
 
 #include "gem/i915_gem_clflush.h"
 #include "gem/i915_gem_context.h"
@@ -569,8 +568,6 @@ i915_gem_gtt_pwrite_fast(struct drm_i915_gem_object *obj,
 		goto out_rpm;
 	}
 
-	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
-
 	user_data = u64_to_user_ptr(args->data_ptr);
 	offset = args->offset;
 	remain = args->size;
@@ -613,7 +610,6 @@ i915_gem_gtt_pwrite_fast(struct drm_i915_gem_object *obj,
 	}
 
 	intel_gt_flush_ggtt_writes(ggtt->vm.gt);
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
 
 	i915_gem_gtt_cleanup(obj, &node, vma);
 out_rpm:
@@ -700,8 +696,6 @@ i915_gem_shmem_pwrite(struct drm_i915_gem_object *obj,
 		offset = 0;
 	}
 
-	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
-
 	i915_gem_object_unpin_pages(obj);
 	return ret;
 
@@ -1272,8 +1266,6 @@ void i915_gem_init_early(struct drm_i915_private *dev_priv)
 {
 	i915_gem_init__mm(dev_priv);
 	i915_gem_init__contexts(dev_priv);
-
-	spin_lock_init(&dev_priv->display.fb_tracking.lock);
 }
 
 void i915_gem_cleanup_early(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 7bd1861ddbdf..a9662cc6ed1e 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -15,7 +15,6 @@
 #include <asm/set_memory.h>
 #include <asm/smp.h>
 
-#include "display/intel_frontbuffer.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_requests.h"
 
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 7d044888ac33..e3b73175805b 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -26,7 +26,6 @@
 #include <linux/dma-fence-array.h>
 #include <drm/drm_gem.h>
 
-#include "display/intel_frontbuffer.h"
 #include "gem/i915_gem_lmem.h"
 #include "gem/i915_gem_tiling.h"
 #include "gt/intel_engine.h"
@@ -1901,17 +1900,6 @@ int _i915_vma_move_to_active(struct i915_vma *vma,
 			return err;
 	}
 
-	if (flags & EXEC_OBJECT_WRITE) {
-		struct intel_frontbuffer *front;
-
-		front = __intel_frontbuffer_get(obj);
-		if (unlikely(front)) {
-			if (intel_frontbuffer_invalidate(front, ORIGIN_CS))
-				i915_active_add_request(&front->write, rq);
-			intel_frontbuffer_put(front);
-		}
-	}
-
 	if (fence) {
 		struct dma_fence *curr;
 		enum dma_resv_usage usage;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 12/20] drm/i915/display: Neuter frontbuffer tracking harder
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Remove intel_frontbuffer type, and use fb->bits.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_cursor.c   |   6 +-
 drivers/gpu/drm/i915/display/intel_display.c  |   4 +-
 .../drm/i915/display/intel_display_types.h    |   8 +-
 drivers/gpu/drm/i915/display/intel_fb.c       |  11 +-
 drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
 drivers/gpu/drm/i915/display/intel_fbdev.c    |   7 +-
 .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 ++----------------
 .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +++---------
 .../drm/i915/display/intel_plane_initial.c    |   2 +-
 9 files changed, 32 insertions(+), 182 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_cursor.c b/drivers/gpu/drm/i915/display/intel_cursor.c
index d190fa0d393b..371009f8e194 100644
--- a/drivers/gpu/drm/i915/display/intel_cursor.c
+++ b/drivers/gpu/drm/i915/display/intel_cursor.c
@@ -692,10 +692,10 @@ intel_legacy_cursor_update(struct drm_plane *_plane,
 	if (ret)
 		goto out_free;
 
-	intel_frontbuffer_flush(to_intel_frontbuffer(new_plane_state->hw.fb),
+	intel_frontbuffer_flush(to_intel_framebuffer(new_plane_state->hw.fb),
 				ORIGIN_CURSOR_UPDATE);
-	intel_frontbuffer_track(to_intel_frontbuffer(old_plane_state->hw.fb),
-				to_intel_frontbuffer(new_plane_state->hw.fb),
+	intel_frontbuffer_track(to_intel_framebuffer(old_plane_state->hw.fb),
+				to_intel_framebuffer(new_plane_state->hw.fb),
 				plane->frontbuffer_bit);
 
 	/* Swap plane state */
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index e75b9b2a0e01..7a6191cad52a 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -7663,8 +7663,8 @@ static void intel_atomic_track_fbs(struct intel_atomic_state *state)
 
 	for_each_oldnew_intel_plane_in_state(state, plane, old_plane_state,
 					     new_plane_state, i)
-		intel_frontbuffer_track(to_intel_frontbuffer(old_plane_state->hw.fb),
-					to_intel_frontbuffer(new_plane_state->hw.fb),
+		intel_frontbuffer_track(to_intel_framebuffer(old_plane_state->hw.fb),
+					to_intel_framebuffer(new_plane_state->hw.fb),
 					plane->frontbuffer_bit);
 }
 
diff --git a/drivers/gpu/drm/i915/display/intel_display_types.h b/drivers/gpu/drm/i915/display/intel_display_types.h
index 32e8b2fc3cc6..34250a9cf3e1 100644
--- a/drivers/gpu/drm/i915/display/intel_display_types.h
+++ b/drivers/gpu/drm/i915/display/intel_display_types.h
@@ -132,7 +132,7 @@ struct intel_fb_view {
 
 struct intel_framebuffer {
 	struct drm_framebuffer base;
-	struct intel_frontbuffer *frontbuffer;
+	atomic_t bits;
 
 	/* Params to remap the FB pages and program the plane registers in each view. */
 	struct intel_fb_view normal_view;
@@ -2056,10 +2056,4 @@ static inline u32 intel_plane_ggtt_offset(const struct intel_plane_state *plane_
 	return i915_ggtt_offset(plane_state->ggtt_vma);
 }
 
-static inline struct intel_frontbuffer *
-to_intel_frontbuffer(struct drm_framebuffer *fb)
-{
-	return fb ? to_intel_framebuffer(fb)->frontbuffer : NULL;
-}
-
 #endif /*  __INTEL_DISPLAY_TYPES_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
index 7cf31c87884c..56cdacf33db2 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -1833,8 +1833,7 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
 	if (intel_fb_uses_dpt(fb))
 		intel_dpt_destroy(intel_fb->dpt_vm);
 
-	intel_frontbuffer_put(intel_fb->frontbuffer);
-
+	drm_gem_object_put(fb->obj[0]);
 	kfree(intel_fb);
 }
 
@@ -1863,7 +1862,7 @@ static int intel_user_framebuffer_dirty(struct drm_framebuffer *fb,
 	struct drm_i915_gem_object *obj = intel_fb_obj(fb);
 
 	i915_gem_object_flush_if_display(obj);
-	intel_frontbuffer_flush(to_intel_frontbuffer(fb), ORIGIN_DIRTYFB);
+	intel_frontbuffer_flush(to_intel_framebuffer(fb), ORIGIN_DIRTYFB);
 
 	return 0;
 }
@@ -1885,10 +1884,6 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 	int ret = -EINVAL;
 	int i;
 
-	intel_fb->frontbuffer = intel_frontbuffer_get(obj);
-	if (!intel_fb->frontbuffer)
-		return -ENOMEM;
-
 	i915_gem_object_lock(obj, NULL);
 	tiling = i915_gem_object_get_tiling(obj);
 	stride = i915_gem_object_get_stride(obj);
@@ -2021,10 +2016,10 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 		goto err;
 	}
 
+	drm_gem_object_get(fb->obj[0]);
 	return 0;
 
 err:
-	intel_frontbuffer_put(intel_fb->frontbuffer);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/display/intel_fb_pin.c b/drivers/gpu/drm/i915/display/intel_fb_pin.c
index 1aca7552a85d..70bce1a99a53 100644
--- a/drivers/gpu/drm/i915/display/intel_fb_pin.c
+++ b/drivers/gpu/drm/i915/display/intel_fb_pin.c
@@ -37,9 +37,6 @@ intel_pin_fb_obj_dpt(struct drm_framebuffer *fb,
 	 */
 	GEM_WARN_ON(vm->bind_async_flags);
 
-	if (WARN_ON(!i915_gem_object_is_framebuffer(obj)))
-		return ERR_PTR(-EINVAL);
-
 	alignment = 4096 * 512;
 
 	atomic_inc(&dev_priv->gpu_error.pending_fb_pin);
@@ -119,9 +116,6 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer *fb,
 	u32 alignment;
 	int ret;
 
-	if (drm_WARN_ON(dev, !i915_gem_object_is_framebuffer(obj)))
-		return ERR_PTR(-EINVAL);
-
 	if (phys_cursor)
 		alignment = intel_cursor_alignment(dev_priv);
 	else
diff --git a/drivers/gpu/drm/i915/display/intel_fbdev.c b/drivers/gpu/drm/i915/display/intel_fbdev.c
index 03ed4607a46d..8ccdf1a964ff 100644
--- a/drivers/gpu/drm/i915/display/intel_fbdev.c
+++ b/drivers/gpu/drm/i915/display/intel_fbdev.c
@@ -67,14 +67,9 @@ struct intel_fbdev {
 	struct mutex hpd_lock;
 };
 
-static struct intel_frontbuffer *to_frontbuffer(struct intel_fbdev *ifbdev)
-{
-	return ifbdev->fb->frontbuffer;
-}
-
 static void intel_fbdev_invalidate(struct intel_fbdev *ifbdev)
 {
-	intel_frontbuffer_invalidate(to_frontbuffer(ifbdev), ORIGIN_CPU);
+	intel_frontbuffer_invalidate(ifbdev->fb, ORIGIN_CPU);
 }
 
 static int intel_fbdev_set_par(struct fb_info *info)
diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.c b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
index 17a7aa8b28c2..99d194803520 100644
--- a/drivers/gpu/drm/i915/display/intel_frontbuffer.c
+++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
@@ -163,11 +163,11 @@ void intel_frontbuffer_flip(struct drm_i915_private *i915,
 	frontbuffer_flush(i915, frontbuffer_bits, ORIGIN_FLIP);
 }
 
-void __intel_fb_invalidate(struct intel_frontbuffer *front,
+void __intel_fb_invalidate(struct intel_framebuffer *fb,
 			   enum fb_op_origin origin,
 			   unsigned int frontbuffer_bits)
 {
-	struct drm_i915_private *i915 = to_i915(front->obj->base.dev);
+	struct drm_i915_private *i915 = to_i915(fb->base.dev);
 
 	if (origin == ORIGIN_CS) {
 		spin_lock(&i915->display.fb_tracking.lock);
@@ -184,11 +184,11 @@ void __intel_fb_invalidate(struct intel_frontbuffer *front,
 	intel_fbc_invalidate(i915, frontbuffer_bits, origin);
 }
 
-void __intel_fb_flush(struct intel_frontbuffer *front,
+void __intel_fb_flush(struct intel_framebuffer *fb,
 		      enum fb_op_origin origin,
 		      unsigned int frontbuffer_bits)
 {
-	struct drm_i915_private *i915 = to_i915(front->obj->base.dev);
+	struct drm_i915_private *i915 = to_i915(fb->base.dev);
 
 	if (origin == ORIGIN_CS) {
 		spin_lock(&i915->display.fb_tracking.lock);
@@ -202,93 +202,6 @@ void __intel_fb_flush(struct intel_frontbuffer *front,
 		frontbuffer_flush(i915, frontbuffer_bits, origin);
 }
 
-static int frontbuffer_active(struct i915_active *ref)
-{
-	struct intel_frontbuffer *front =
-		container_of(ref, typeof(*front), write);
-
-	kref_get(&front->ref);
-	return 0;
-}
-
-static void frontbuffer_retire(struct i915_active *ref)
-{
-	struct intel_frontbuffer *front =
-		container_of(ref, typeof(*front), write);
-
-	intel_frontbuffer_flush(front, ORIGIN_CS);
-	intel_frontbuffer_put(front);
-}
-
-static void frontbuffer_release(struct kref *ref)
-	__releases(&to_i915(front->obj->base.dev)->display.fb_tracking.lock)
-{
-	struct intel_frontbuffer *front =
-		container_of(ref, typeof(*front), ref);
-	struct drm_i915_gem_object *obj = front->obj;
-	struct i915_vma *vma;
-
-	drm_WARN_ON(obj->base.dev, atomic_read(&front->bits));
-
-	spin_lock(&obj->vma.lock);
-	for_each_ggtt_vma(vma, obj) {
-		i915_vma_clear_scanout(vma);
-		vma->display_alignment = I915_GTT_MIN_ALIGNMENT;
-	}
-	spin_unlock(&obj->vma.lock);
-
-	RCU_INIT_POINTER(obj->frontbuffer, NULL);
-	spin_unlock(&to_i915(obj->base.dev)->display.fb_tracking.lock);
-
-	i915_active_fini(&front->write);
-
-	i915_gem_object_put(obj);
-	kfree_rcu(front, rcu);
-}
-
-struct intel_frontbuffer *
-intel_frontbuffer_get(struct drm_i915_gem_object *obj)
-{
-	struct drm_i915_private *i915 = to_i915(obj->base.dev);
-	struct intel_frontbuffer *front;
-
-	front = __intel_frontbuffer_get(obj);
-	if (front)
-		return front;
-
-	front = kmalloc(sizeof(*front), GFP_KERNEL);
-	if (!front)
-		return NULL;
-
-	front->obj = obj;
-	kref_init(&front->ref);
-	atomic_set(&front->bits, 0);
-	i915_active_init(&front->write,
-			 frontbuffer_active,
-			 frontbuffer_retire,
-			 I915_ACTIVE_RETIRE_SLEEPS);
-
-	spin_lock(&i915->display.fb_tracking.lock);
-	if (rcu_access_pointer(obj->frontbuffer)) {
-		kfree(front);
-		front = rcu_dereference_protected(obj->frontbuffer, true);
-		kref_get(&front->ref);
-	} else {
-		i915_gem_object_get(obj);
-		rcu_assign_pointer(obj->frontbuffer, front);
-	}
-	spin_unlock(&i915->display.fb_tracking.lock);
-
-	return front;
-}
-
-void intel_frontbuffer_put(struct intel_frontbuffer *front)
-{
-	kref_put_lock(&front->ref,
-		      frontbuffer_release,
-		      &to_i915(front->obj->base.dev)->display.fb_tracking.lock);
-}
-
 /**
  * intel_frontbuffer_track - update frontbuffer tracking
  * @old: current buffer for the frontbuffer slots
@@ -298,8 +211,8 @@ void intel_frontbuffer_put(struct intel_frontbuffer *front)
  * This updates the frontbuffer tracking bits @frontbuffer_bits by clearing them
  * from @old and setting them in @new. Both @old and @new can be NULL.
  */
-void intel_frontbuffer_track(struct intel_frontbuffer *old,
-			     struct intel_frontbuffer *new,
+void intel_frontbuffer_track(struct intel_framebuffer *old,
+			     struct intel_framebuffer *new,
 			     unsigned int frontbuffer_bits)
 {
 	/*
@@ -315,13 +228,13 @@ void intel_frontbuffer_track(struct intel_frontbuffer *old,
 	BUILD_BUG_ON(I915_MAX_PLANES > INTEL_FRONTBUFFER_BITS_PER_PIPE);
 
 	if (old) {
-		drm_WARN_ON(old->obj->base.dev,
+		drm_WARN_ON(old->base.dev,
 			    !(atomic_read(&old->bits) & frontbuffer_bits));
 		atomic_andnot(frontbuffer_bits, &old->bits);
 	}
 
 	if (new) {
-		drm_WARN_ON(new->obj->base.dev,
+		drm_WARN_ON(new->base.dev,
 			    atomic_read(&new->bits) & frontbuffer_bits);
 		atomic_or(frontbuffer_bits, &new->bits);
 	}
diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.h b/drivers/gpu/drm/i915/display/intel_frontbuffer.h
index 3c474ed937fb..b91338651139 100644
--- a/drivers/gpu/drm/i915/display/intel_frontbuffer.h
+++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.h
@@ -28,8 +28,7 @@
 #include <linux/bits.h>
 #include <linux/kref.h>
 
-#include "gem/i915_gem_object_types.h"
-#include "i915_active_types.h"
+#include "intel_display_types.h"
 
 struct drm_i915_private;
 
@@ -41,14 +40,6 @@ enum fb_op_origin {
 	ORIGIN_CURSOR_UPDATE,
 };
 
-struct intel_frontbuffer {
-	struct kref ref;
-	atomic_t bits;
-	struct i915_active write;
-	struct drm_i915_gem_object *obj;
-	struct rcu_head rcu;
-};
-
 /*
  * Frontbuffer tracking bits. Set in obj->frontbuffer_bits while a gem bo is
  * considered to be the frontbuffer for the given plane interface-wise. This
@@ -73,39 +64,7 @@ void intel_frontbuffer_flip_complete(struct drm_i915_private *i915,
 void intel_frontbuffer_flip(struct drm_i915_private *i915,
 			    unsigned frontbuffer_bits);
 
-void intel_frontbuffer_put(struct intel_frontbuffer *front);
-
-static inline struct intel_frontbuffer *
-__intel_frontbuffer_get(const struct drm_i915_gem_object *obj)
-{
-	struct intel_frontbuffer *front;
-
-	if (likely(!rcu_access_pointer(obj->frontbuffer)))
-		return NULL;
-
-	rcu_read_lock();
-	do {
-		front = rcu_dereference(obj->frontbuffer);
-		if (!front)
-			break;
-
-		if (unlikely(!kref_get_unless_zero(&front->ref)))
-			continue;
-
-		if (likely(front == rcu_access_pointer(obj->frontbuffer)))
-			break;
-
-		intel_frontbuffer_put(front);
-	} while (1);
-	rcu_read_unlock();
-
-	return front;
-}
-
-struct intel_frontbuffer *
-intel_frontbuffer_get(struct drm_i915_gem_object *obj);
-
-void __intel_fb_invalidate(struct intel_frontbuffer *front,
+void __intel_fb_invalidate(struct intel_framebuffer *front,
 			   enum fb_op_origin origin,
 			   unsigned int frontbuffer_bits);
 
@@ -120,23 +79,23 @@ void __intel_fb_invalidate(struct intel_frontbuffer *front,
  * until the rendering completes or a flip on this frontbuffer plane is
  * scheduled.
  */
-static inline bool intel_frontbuffer_invalidate(struct intel_frontbuffer *front,
+static inline bool intel_frontbuffer_invalidate(struct intel_framebuffer *fb,
 						enum fb_op_origin origin)
 {
 	unsigned int frontbuffer_bits;
 
-	if (!front)
+	if (!fb)
 		return false;
 
-	frontbuffer_bits = atomic_read(&front->bits);
+	frontbuffer_bits = atomic_read(&fb->bits);
 	if (!frontbuffer_bits)
 		return false;
 
-	__intel_fb_invalidate(front, origin, frontbuffer_bits);
+	__intel_fb_invalidate(fb, origin, frontbuffer_bits);
 	return true;
 }
 
-void __intel_fb_flush(struct intel_frontbuffer *front,
+void __intel_fb_flush(struct intel_framebuffer *fb,
 		      enum fb_op_origin origin,
 		      unsigned int frontbuffer_bits);
 
@@ -148,23 +107,23 @@ void __intel_fb_flush(struct intel_frontbuffer *front,
  * This function gets called every time rendering on the given object has
  * completed and frontbuffer caching can be started again.
  */
-static inline void intel_frontbuffer_flush(struct intel_frontbuffer *front,
+static inline void intel_frontbuffer_flush(struct intel_framebuffer *fb,
 					   enum fb_op_origin origin)
 {
 	unsigned int frontbuffer_bits;
 
-	if (!front)
+	if (!fb)
 		return;
 
-	frontbuffer_bits = atomic_read(&front->bits);
+	frontbuffer_bits = atomic_read(&fb->bits);
 	if (!frontbuffer_bits)
 		return;
 
-	__intel_fb_flush(front, origin, frontbuffer_bits);
+	__intel_fb_flush(fb, origin, frontbuffer_bits);
 }
 
-void intel_frontbuffer_track(struct intel_frontbuffer *old,
-			     struct intel_frontbuffer *new,
+void intel_frontbuffer_track(struct intel_framebuffer *old,
+			     struct intel_framebuffer *new,
 			     unsigned int frontbuffer_bits);
 
 #endif /* __INTEL_FRONTBUFFER_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_plane_initial.c b/drivers/gpu/drm/i915/display/intel_plane_initial.c
index cad9c8884af3..82a54152d731 100644
--- a/drivers/gpu/drm/i915/display/intel_plane_initial.c
+++ b/drivers/gpu/drm/i915/display/intel_plane_initial.c
@@ -281,7 +281,7 @@ intel_find_initial_plane_obj(struct intel_crtc *crtc,
 	plane_state->uapi.crtc = &crtc->base;
 	intel_plane_copy_uapi_to_hw_state(plane_state, plane_state, crtc);
 
-	atomic_or(plane->frontbuffer_bit, &to_intel_frontbuffer(fb)->bits);
+	atomic_or(plane->frontbuffer_bit, &to_intel_framebuffer(fb)->bits);
 }
 
 static void plane_config_fini(struct intel_initial_plane_config *plane_config)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 12/20] drm/i915/display: Neuter frontbuffer tracking harder
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Remove intel_frontbuffer type, and use fb->bits.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_cursor.c   |   6 +-
 drivers/gpu/drm/i915/display/intel_display.c  |   4 +-
 .../drm/i915/display/intel_display_types.h    |   8 +-
 drivers/gpu/drm/i915/display/intel_fb.c       |  11 +-
 drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
 drivers/gpu/drm/i915/display/intel_fbdev.c    |   7 +-
 .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 ++----------------
 .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +++---------
 .../drm/i915/display/intel_plane_initial.c    |   2 +-
 9 files changed, 32 insertions(+), 182 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_cursor.c b/drivers/gpu/drm/i915/display/intel_cursor.c
index d190fa0d393b..371009f8e194 100644
--- a/drivers/gpu/drm/i915/display/intel_cursor.c
+++ b/drivers/gpu/drm/i915/display/intel_cursor.c
@@ -692,10 +692,10 @@ intel_legacy_cursor_update(struct drm_plane *_plane,
 	if (ret)
 		goto out_free;
 
-	intel_frontbuffer_flush(to_intel_frontbuffer(new_plane_state->hw.fb),
+	intel_frontbuffer_flush(to_intel_framebuffer(new_plane_state->hw.fb),
 				ORIGIN_CURSOR_UPDATE);
-	intel_frontbuffer_track(to_intel_frontbuffer(old_plane_state->hw.fb),
-				to_intel_frontbuffer(new_plane_state->hw.fb),
+	intel_frontbuffer_track(to_intel_framebuffer(old_plane_state->hw.fb),
+				to_intel_framebuffer(new_plane_state->hw.fb),
 				plane->frontbuffer_bit);
 
 	/* Swap plane state */
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index e75b9b2a0e01..7a6191cad52a 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -7663,8 +7663,8 @@ static void intel_atomic_track_fbs(struct intel_atomic_state *state)
 
 	for_each_oldnew_intel_plane_in_state(state, plane, old_plane_state,
 					     new_plane_state, i)
-		intel_frontbuffer_track(to_intel_frontbuffer(old_plane_state->hw.fb),
-					to_intel_frontbuffer(new_plane_state->hw.fb),
+		intel_frontbuffer_track(to_intel_framebuffer(old_plane_state->hw.fb),
+					to_intel_framebuffer(new_plane_state->hw.fb),
 					plane->frontbuffer_bit);
 }
 
diff --git a/drivers/gpu/drm/i915/display/intel_display_types.h b/drivers/gpu/drm/i915/display/intel_display_types.h
index 32e8b2fc3cc6..34250a9cf3e1 100644
--- a/drivers/gpu/drm/i915/display/intel_display_types.h
+++ b/drivers/gpu/drm/i915/display/intel_display_types.h
@@ -132,7 +132,7 @@ struct intel_fb_view {
 
 struct intel_framebuffer {
 	struct drm_framebuffer base;
-	struct intel_frontbuffer *frontbuffer;
+	atomic_t bits;
 
 	/* Params to remap the FB pages and program the plane registers in each view. */
 	struct intel_fb_view normal_view;
@@ -2056,10 +2056,4 @@ static inline u32 intel_plane_ggtt_offset(const struct intel_plane_state *plane_
 	return i915_ggtt_offset(plane_state->ggtt_vma);
 }
 
-static inline struct intel_frontbuffer *
-to_intel_frontbuffer(struct drm_framebuffer *fb)
-{
-	return fb ? to_intel_framebuffer(fb)->frontbuffer : NULL;
-}
-
 #endif /*  __INTEL_DISPLAY_TYPES_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
index 7cf31c87884c..56cdacf33db2 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -1833,8 +1833,7 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
 	if (intel_fb_uses_dpt(fb))
 		intel_dpt_destroy(intel_fb->dpt_vm);
 
-	intel_frontbuffer_put(intel_fb->frontbuffer);
-
+	drm_gem_object_put(fb->obj[0]);
 	kfree(intel_fb);
 }
 
@@ -1863,7 +1862,7 @@ static int intel_user_framebuffer_dirty(struct drm_framebuffer *fb,
 	struct drm_i915_gem_object *obj = intel_fb_obj(fb);
 
 	i915_gem_object_flush_if_display(obj);
-	intel_frontbuffer_flush(to_intel_frontbuffer(fb), ORIGIN_DIRTYFB);
+	intel_frontbuffer_flush(to_intel_framebuffer(fb), ORIGIN_DIRTYFB);
 
 	return 0;
 }
@@ -1885,10 +1884,6 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 	int ret = -EINVAL;
 	int i;
 
-	intel_fb->frontbuffer = intel_frontbuffer_get(obj);
-	if (!intel_fb->frontbuffer)
-		return -ENOMEM;
-
 	i915_gem_object_lock(obj, NULL);
 	tiling = i915_gem_object_get_tiling(obj);
 	stride = i915_gem_object_get_stride(obj);
@@ -2021,10 +2016,10 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 		goto err;
 	}
 
+	drm_gem_object_get(fb->obj[0]);
 	return 0;
 
 err:
-	intel_frontbuffer_put(intel_fb->frontbuffer);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/display/intel_fb_pin.c b/drivers/gpu/drm/i915/display/intel_fb_pin.c
index 1aca7552a85d..70bce1a99a53 100644
--- a/drivers/gpu/drm/i915/display/intel_fb_pin.c
+++ b/drivers/gpu/drm/i915/display/intel_fb_pin.c
@@ -37,9 +37,6 @@ intel_pin_fb_obj_dpt(struct drm_framebuffer *fb,
 	 */
 	GEM_WARN_ON(vm->bind_async_flags);
 
-	if (WARN_ON(!i915_gem_object_is_framebuffer(obj)))
-		return ERR_PTR(-EINVAL);
-
 	alignment = 4096 * 512;
 
 	atomic_inc(&dev_priv->gpu_error.pending_fb_pin);
@@ -119,9 +116,6 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer *fb,
 	u32 alignment;
 	int ret;
 
-	if (drm_WARN_ON(dev, !i915_gem_object_is_framebuffer(obj)))
-		return ERR_PTR(-EINVAL);
-
 	if (phys_cursor)
 		alignment = intel_cursor_alignment(dev_priv);
 	else
diff --git a/drivers/gpu/drm/i915/display/intel_fbdev.c b/drivers/gpu/drm/i915/display/intel_fbdev.c
index 03ed4607a46d..8ccdf1a964ff 100644
--- a/drivers/gpu/drm/i915/display/intel_fbdev.c
+++ b/drivers/gpu/drm/i915/display/intel_fbdev.c
@@ -67,14 +67,9 @@ struct intel_fbdev {
 	struct mutex hpd_lock;
 };
 
-static struct intel_frontbuffer *to_frontbuffer(struct intel_fbdev *ifbdev)
-{
-	return ifbdev->fb->frontbuffer;
-}
-
 static void intel_fbdev_invalidate(struct intel_fbdev *ifbdev)
 {
-	intel_frontbuffer_invalidate(to_frontbuffer(ifbdev), ORIGIN_CPU);
+	intel_frontbuffer_invalidate(ifbdev->fb, ORIGIN_CPU);
 }
 
 static int intel_fbdev_set_par(struct fb_info *info)
diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.c b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
index 17a7aa8b28c2..99d194803520 100644
--- a/drivers/gpu/drm/i915/display/intel_frontbuffer.c
+++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
@@ -163,11 +163,11 @@ void intel_frontbuffer_flip(struct drm_i915_private *i915,
 	frontbuffer_flush(i915, frontbuffer_bits, ORIGIN_FLIP);
 }
 
-void __intel_fb_invalidate(struct intel_frontbuffer *front,
+void __intel_fb_invalidate(struct intel_framebuffer *fb,
 			   enum fb_op_origin origin,
 			   unsigned int frontbuffer_bits)
 {
-	struct drm_i915_private *i915 = to_i915(front->obj->base.dev);
+	struct drm_i915_private *i915 = to_i915(fb->base.dev);
 
 	if (origin == ORIGIN_CS) {
 		spin_lock(&i915->display.fb_tracking.lock);
@@ -184,11 +184,11 @@ void __intel_fb_invalidate(struct intel_frontbuffer *front,
 	intel_fbc_invalidate(i915, frontbuffer_bits, origin);
 }
 
-void __intel_fb_flush(struct intel_frontbuffer *front,
+void __intel_fb_flush(struct intel_framebuffer *fb,
 		      enum fb_op_origin origin,
 		      unsigned int frontbuffer_bits)
 {
-	struct drm_i915_private *i915 = to_i915(front->obj->base.dev);
+	struct drm_i915_private *i915 = to_i915(fb->base.dev);
 
 	if (origin == ORIGIN_CS) {
 		spin_lock(&i915->display.fb_tracking.lock);
@@ -202,93 +202,6 @@ void __intel_fb_flush(struct intel_frontbuffer *front,
 		frontbuffer_flush(i915, frontbuffer_bits, origin);
 }
 
-static int frontbuffer_active(struct i915_active *ref)
-{
-	struct intel_frontbuffer *front =
-		container_of(ref, typeof(*front), write);
-
-	kref_get(&front->ref);
-	return 0;
-}
-
-static void frontbuffer_retire(struct i915_active *ref)
-{
-	struct intel_frontbuffer *front =
-		container_of(ref, typeof(*front), write);
-
-	intel_frontbuffer_flush(front, ORIGIN_CS);
-	intel_frontbuffer_put(front);
-}
-
-static void frontbuffer_release(struct kref *ref)
-	__releases(&to_i915(front->obj->base.dev)->display.fb_tracking.lock)
-{
-	struct intel_frontbuffer *front =
-		container_of(ref, typeof(*front), ref);
-	struct drm_i915_gem_object *obj = front->obj;
-	struct i915_vma *vma;
-
-	drm_WARN_ON(obj->base.dev, atomic_read(&front->bits));
-
-	spin_lock(&obj->vma.lock);
-	for_each_ggtt_vma(vma, obj) {
-		i915_vma_clear_scanout(vma);
-		vma->display_alignment = I915_GTT_MIN_ALIGNMENT;
-	}
-	spin_unlock(&obj->vma.lock);
-
-	RCU_INIT_POINTER(obj->frontbuffer, NULL);
-	spin_unlock(&to_i915(obj->base.dev)->display.fb_tracking.lock);
-
-	i915_active_fini(&front->write);
-
-	i915_gem_object_put(obj);
-	kfree_rcu(front, rcu);
-}
-
-struct intel_frontbuffer *
-intel_frontbuffer_get(struct drm_i915_gem_object *obj)
-{
-	struct drm_i915_private *i915 = to_i915(obj->base.dev);
-	struct intel_frontbuffer *front;
-
-	front = __intel_frontbuffer_get(obj);
-	if (front)
-		return front;
-
-	front = kmalloc(sizeof(*front), GFP_KERNEL);
-	if (!front)
-		return NULL;
-
-	front->obj = obj;
-	kref_init(&front->ref);
-	atomic_set(&front->bits, 0);
-	i915_active_init(&front->write,
-			 frontbuffer_active,
-			 frontbuffer_retire,
-			 I915_ACTIVE_RETIRE_SLEEPS);
-
-	spin_lock(&i915->display.fb_tracking.lock);
-	if (rcu_access_pointer(obj->frontbuffer)) {
-		kfree(front);
-		front = rcu_dereference_protected(obj->frontbuffer, true);
-		kref_get(&front->ref);
-	} else {
-		i915_gem_object_get(obj);
-		rcu_assign_pointer(obj->frontbuffer, front);
-	}
-	spin_unlock(&i915->display.fb_tracking.lock);
-
-	return front;
-}
-
-void intel_frontbuffer_put(struct intel_frontbuffer *front)
-{
-	kref_put_lock(&front->ref,
-		      frontbuffer_release,
-		      &to_i915(front->obj->base.dev)->display.fb_tracking.lock);
-}
-
 /**
  * intel_frontbuffer_track - update frontbuffer tracking
  * @old: current buffer for the frontbuffer slots
@@ -298,8 +211,8 @@ void intel_frontbuffer_put(struct intel_frontbuffer *front)
  * This updates the frontbuffer tracking bits @frontbuffer_bits by clearing them
  * from @old and setting them in @new. Both @old and @new can be NULL.
  */
-void intel_frontbuffer_track(struct intel_frontbuffer *old,
-			     struct intel_frontbuffer *new,
+void intel_frontbuffer_track(struct intel_framebuffer *old,
+			     struct intel_framebuffer *new,
 			     unsigned int frontbuffer_bits)
 {
 	/*
@@ -315,13 +228,13 @@ void intel_frontbuffer_track(struct intel_frontbuffer *old,
 	BUILD_BUG_ON(I915_MAX_PLANES > INTEL_FRONTBUFFER_BITS_PER_PIPE);
 
 	if (old) {
-		drm_WARN_ON(old->obj->base.dev,
+		drm_WARN_ON(old->base.dev,
 			    !(atomic_read(&old->bits) & frontbuffer_bits));
 		atomic_andnot(frontbuffer_bits, &old->bits);
 	}
 
 	if (new) {
-		drm_WARN_ON(new->obj->base.dev,
+		drm_WARN_ON(new->base.dev,
 			    atomic_read(&new->bits) & frontbuffer_bits);
 		atomic_or(frontbuffer_bits, &new->bits);
 	}
diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.h b/drivers/gpu/drm/i915/display/intel_frontbuffer.h
index 3c474ed937fb..b91338651139 100644
--- a/drivers/gpu/drm/i915/display/intel_frontbuffer.h
+++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.h
@@ -28,8 +28,7 @@
 #include <linux/bits.h>
 #include <linux/kref.h>
 
-#include "gem/i915_gem_object_types.h"
-#include "i915_active_types.h"
+#include "intel_display_types.h"
 
 struct drm_i915_private;
 
@@ -41,14 +40,6 @@ enum fb_op_origin {
 	ORIGIN_CURSOR_UPDATE,
 };
 
-struct intel_frontbuffer {
-	struct kref ref;
-	atomic_t bits;
-	struct i915_active write;
-	struct drm_i915_gem_object *obj;
-	struct rcu_head rcu;
-};
-
 /*
  * Frontbuffer tracking bits. Set in obj->frontbuffer_bits while a gem bo is
  * considered to be the frontbuffer for the given plane interface-wise. This
@@ -73,39 +64,7 @@ void intel_frontbuffer_flip_complete(struct drm_i915_private *i915,
 void intel_frontbuffer_flip(struct drm_i915_private *i915,
 			    unsigned frontbuffer_bits);
 
-void intel_frontbuffer_put(struct intel_frontbuffer *front);
-
-static inline struct intel_frontbuffer *
-__intel_frontbuffer_get(const struct drm_i915_gem_object *obj)
-{
-	struct intel_frontbuffer *front;
-
-	if (likely(!rcu_access_pointer(obj->frontbuffer)))
-		return NULL;
-
-	rcu_read_lock();
-	do {
-		front = rcu_dereference(obj->frontbuffer);
-		if (!front)
-			break;
-
-		if (unlikely(!kref_get_unless_zero(&front->ref)))
-			continue;
-
-		if (likely(front == rcu_access_pointer(obj->frontbuffer)))
-			break;
-
-		intel_frontbuffer_put(front);
-	} while (1);
-	rcu_read_unlock();
-
-	return front;
-}
-
-struct intel_frontbuffer *
-intel_frontbuffer_get(struct drm_i915_gem_object *obj);
-
-void __intel_fb_invalidate(struct intel_frontbuffer *front,
+void __intel_fb_invalidate(struct intel_framebuffer *front,
 			   enum fb_op_origin origin,
 			   unsigned int frontbuffer_bits);
 
@@ -120,23 +79,23 @@ void __intel_fb_invalidate(struct intel_frontbuffer *front,
  * until the rendering completes or a flip on this frontbuffer plane is
  * scheduled.
  */
-static inline bool intel_frontbuffer_invalidate(struct intel_frontbuffer *front,
+static inline bool intel_frontbuffer_invalidate(struct intel_framebuffer *fb,
 						enum fb_op_origin origin)
 {
 	unsigned int frontbuffer_bits;
 
-	if (!front)
+	if (!fb)
 		return false;
 
-	frontbuffer_bits = atomic_read(&front->bits);
+	frontbuffer_bits = atomic_read(&fb->bits);
 	if (!frontbuffer_bits)
 		return false;
 
-	__intel_fb_invalidate(front, origin, frontbuffer_bits);
+	__intel_fb_invalidate(fb, origin, frontbuffer_bits);
 	return true;
 }
 
-void __intel_fb_flush(struct intel_frontbuffer *front,
+void __intel_fb_flush(struct intel_framebuffer *fb,
 		      enum fb_op_origin origin,
 		      unsigned int frontbuffer_bits);
 
@@ -148,23 +107,23 @@ void __intel_fb_flush(struct intel_frontbuffer *front,
  * This function gets called every time rendering on the given object has
  * completed and frontbuffer caching can be started again.
  */
-static inline void intel_frontbuffer_flush(struct intel_frontbuffer *front,
+static inline void intel_frontbuffer_flush(struct intel_framebuffer *fb,
 					   enum fb_op_origin origin)
 {
 	unsigned int frontbuffer_bits;
 
-	if (!front)
+	if (!fb)
 		return;
 
-	frontbuffer_bits = atomic_read(&front->bits);
+	frontbuffer_bits = atomic_read(&fb->bits);
 	if (!frontbuffer_bits)
 		return;
 
-	__intel_fb_flush(front, origin, frontbuffer_bits);
+	__intel_fb_flush(fb, origin, frontbuffer_bits);
 }
 
-void intel_frontbuffer_track(struct intel_frontbuffer *old,
-			     struct intel_frontbuffer *new,
+void intel_frontbuffer_track(struct intel_framebuffer *old,
+			     struct intel_framebuffer *new,
 			     unsigned int frontbuffer_bits);
 
 #endif /* __INTEL_FRONTBUFFER_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_plane_initial.c b/drivers/gpu/drm/i915/display/intel_plane_initial.c
index cad9c8884af3..82a54152d731 100644
--- a/drivers/gpu/drm/i915/display/intel_plane_initial.c
+++ b/drivers/gpu/drm/i915/display/intel_plane_initial.c
@@ -281,7 +281,7 @@ intel_find_initial_plane_obj(struct intel_crtc *crtc,
 	plane_state->uapi.crtc = &crtc->base;
 	intel_plane_copy_uapi_to_hw_state(plane_state, plane_state, crtc);
 
-	atomic_or(plane->frontbuffer_bit, &to_intel_frontbuffer(fb)->bits);
+	atomic_or(plane->frontbuffer_bit, &to_intel_framebuffer(fb)->bits);
 }
 
 static void plane_config_fini(struct intel_initial_plane_config *plane_config)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 13/20] drm/i915/display: Add more macros to remove all direct calls to uncore
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_de.h | 38 +++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/i915/display/intel_de.h b/drivers/gpu/drm/i915/display/intel_de.h
index 3dbd76fdabd6..3394044d281c 100644
--- a/drivers/gpu/drm/i915/display/intel_de.h
+++ b/drivers/gpu/drm/i915/display/intel_de.h
@@ -9,6 +9,7 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 #include "intel_uncore.h"
+#include "intel_pcode.h"
 
 static inline u32
 intel_de_read(struct drm_i915_private *i915, i915_reg_t reg)
@@ -116,4 +117,41 @@ intel_de_write_notrace(struct drm_i915_private *i915, i915_reg_t reg, u32 val)
 	intel_uncore_write_notrace(&i915->uncore, reg, val);
 }
 
+static inline void
+intel_de_write_samevalue(struct drm_i915_private *i915, i915_reg_t reg)
+{
+	spin_lock_irq(&i915->uncore.lock);
+	intel_de_write_fw(i915, reg, intel_de_read_fw(i915, reg));
+	spin_unlock_irq(&i915->uncore.lock);
+}
+
+static inline int
+intel_de_pcode_write_timeout(struct drm_i915_private *i915, u32 mbox, u32 val,
+			    int fast_timeout_us, int slow_timeout_ms)
+{
+	return snb_pcode_write_timeout(&i915->uncore, mbox, val,
+				       fast_timeout_us, slow_timeout_ms);
+}
+
+static inline int
+intel_de_pcode_write(struct drm_i915_private *i915, u32 mbox, u32 val)
+{
+
+	return snb_pcode_write(&i915->uncore, mbox, val);
+}
+
+static inline int
+intel_de_pcode_read(struct drm_i915_private *i915, u32 mbox, u32 *val, u32 *val1)
+{
+	return snb_pcode_read(&i915->uncore, mbox, val, val1);
+}
+
+static inline int intel_de_pcode_request(struct drm_i915_private *i915, u32 mbox,
+					 u32 request, u32 reply_mask, u32 reply,
+					 int timeout_base_ms)
+{
+	return skl_pcode_request(&i915->uncore, mbox, request, reply_mask, reply,
+				 timeout_base_ms);
+}
+
 #endif /* __INTEL_DE_H__ */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 13/20] drm/i915/display: Add more macros to remove all direct calls to uncore
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_de.h | 38 +++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/i915/display/intel_de.h b/drivers/gpu/drm/i915/display/intel_de.h
index 3dbd76fdabd6..3394044d281c 100644
--- a/drivers/gpu/drm/i915/display/intel_de.h
+++ b/drivers/gpu/drm/i915/display/intel_de.h
@@ -9,6 +9,7 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 #include "intel_uncore.h"
+#include "intel_pcode.h"
 
 static inline u32
 intel_de_read(struct drm_i915_private *i915, i915_reg_t reg)
@@ -116,4 +117,41 @@ intel_de_write_notrace(struct drm_i915_private *i915, i915_reg_t reg, u32 val)
 	intel_uncore_write_notrace(&i915->uncore, reg, val);
 }
 
+static inline void
+intel_de_write_samevalue(struct drm_i915_private *i915, i915_reg_t reg)
+{
+	spin_lock_irq(&i915->uncore.lock);
+	intel_de_write_fw(i915, reg, intel_de_read_fw(i915, reg));
+	spin_unlock_irq(&i915->uncore.lock);
+}
+
+static inline int
+intel_de_pcode_write_timeout(struct drm_i915_private *i915, u32 mbox, u32 val,
+			    int fast_timeout_us, int slow_timeout_ms)
+{
+	return snb_pcode_write_timeout(&i915->uncore, mbox, val,
+				       fast_timeout_us, slow_timeout_ms);
+}
+
+static inline int
+intel_de_pcode_write(struct drm_i915_private *i915, u32 mbox, u32 val)
+{
+
+	return snb_pcode_write(&i915->uncore, mbox, val);
+}
+
+static inline int
+intel_de_pcode_read(struct drm_i915_private *i915, u32 mbox, u32 *val, u32 *val1)
+{
+	return snb_pcode_read(&i915->uncore, mbox, val, val1);
+}
+
+static inline int intel_de_pcode_request(struct drm_i915_private *i915, u32 mbox,
+					 u32 request, u32 reply_mask, u32 reply,
+					 int timeout_base_ms)
+{
+	return skl_pcode_request(&i915->uncore, mbox, request, reply_mask, reply,
+				 timeout_base_ms);
+}
+
 #endif /* __INTEL_DE_H__ */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 14/20] drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/hsw_ips.c        |  7 ++-
 drivers/gpu/drm/i915/display/intel_bios.c     | 25 ++++++-----
 drivers/gpu/drm/i915/display/intel_bw.c       | 34 +++++++-------
 drivers/gpu/drm/i915/display/intel_cdclk.c    | 45 +++++++++----------
 drivers/gpu/drm/i915/display/intel_display.c  |  1 -
 .../drm/i915/display/intel_display_power.c    |  3 +-
 .../i915/display/intel_display_power_well.c   |  7 ++-
 drivers/gpu/drm/i915/display/intel_dpio_phy.c |  9 ++--
 drivers/gpu/drm/i915/display/intel_hdcp.c     |  9 ++--
 drivers/gpu/drm/i915/display/intel_hdmi.c     |  1 -
 drivers/gpu/drm/i915/display/skl_watermark.c  | 23 +++++-----
 11 files changed, 78 insertions(+), 86 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/hsw_ips.c b/drivers/gpu/drm/i915/display/hsw_ips.c
index 83aa3800245f..b2253f7cfc00 100644
--- a/drivers/gpu/drm/i915/display/hsw_ips.c
+++ b/drivers/gpu/drm/i915/display/hsw_ips.c
@@ -8,7 +8,6 @@
 #include "i915_reg.h"
 #include "intel_de.h"
 #include "intel_display_types.h"
-#include "intel_pcode.h"
 
 static void hsw_ips_enable(const struct intel_crtc_state *crtc_state)
 {
@@ -28,8 +27,8 @@ static void hsw_ips_enable(const struct intel_crtc_state *crtc_state)
 
 	if (IS_BROADWELL(i915)) {
 		drm_WARN_ON(&i915->drm,
-			    snb_pcode_write(&i915->uncore, DISPLAY_IPS_CONTROL,
-					    IPS_ENABLE | IPS_PCODE_CONTROL));
+			    intel_de_pcode_write(i915, DISPLAY_IPS_CONTROL,
+						 IPS_ENABLE | IPS_PCODE_CONTROL));
 		/*
 		 * Quoting Art Runyan: "its not safe to expect any particular
 		 * value in IPS_CTL bit 31 after enabling IPS through the
@@ -62,7 +61,7 @@ bool hsw_ips_disable(const struct intel_crtc_state *crtc_state)
 
 	if (IS_BROADWELL(i915)) {
 		drm_WARN_ON(&i915->drm,
-			    snb_pcode_write(&i915->uncore, DISPLAY_IPS_CONTROL, 0));
+			    intel_de_pcode_write(i915, DISPLAY_IPS_CONTROL, 0));
 		/*
 		 * Wait for PCODE to finish disabling IPS. The BSpec specified
 		 * 42ms timeout value leads to occasional timeouts so use 100ms
diff --git a/drivers/gpu/drm/i915/display/intel_bios.c b/drivers/gpu/drm/i915/display/intel_bios.c
index 55544d484318..755e56f9db6c 100644
--- a/drivers/gpu/drm/i915/display/intel_bios.c
+++ b/drivers/gpu/drm/i915/display/intel_bios.c
@@ -29,9 +29,10 @@
 #include <drm/display/drm_dp_helper.h>
 #include <drm/display/drm_dsc_helper.h>
 
-#include "display/intel_display.h"
-#include "display/intel_display_types.h"
-#include "display/intel_gmbus.h"
+#include "intel_de.h"
+#include "intel_display.h"
+#include "intel_display_types.h"
+#include "intel_gmbus.h"
 
 #include "i915_drv.h"
 #include "i915_reg.h"
@@ -3001,16 +3002,16 @@ static struct vbt_header *spi_oprom_get_vbt(struct drm_i915_private *i915)
 	u16 vbt_size;
 	u32 *vbt;
 
-	static_region = intel_uncore_read(&i915->uncore, SPI_STATIC_REGIONS);
+	static_region = intel_de_read(i915, SPI_STATIC_REGIONS);
 	static_region &= OPTIONROM_SPI_REGIONID_MASK;
-	intel_uncore_write(&i915->uncore, PRIMARY_SPI_REGIONID, static_region);
+	intel_de_write(i915, PRIMARY_SPI_REGIONID, static_region);
 
-	oprom_offset = intel_uncore_read(&i915->uncore, OROM_OFFSET);
+	oprom_offset = intel_de_read(i915, OROM_OFFSET);
 	oprom_offset &= OROM_OFFSET_MASK;
 
 	for (count = 0; count < oprom_size; count += 4) {
-		intel_uncore_write(&i915->uncore, PRIMARY_SPI_ADDRESS, oprom_offset + count);
-		data = intel_uncore_read(&i915->uncore, PRIMARY_SPI_TRIGGER);
+		intel_de_write(i915, PRIMARY_SPI_ADDRESS, oprom_offset + count);
+		data = intel_de_read(i915, PRIMARY_SPI_TRIGGER);
 
 		if (data == *((const u32 *)"$VBT")) {
 			found = oprom_offset + count;
@@ -3022,9 +3023,9 @@ static struct vbt_header *spi_oprom_get_vbt(struct drm_i915_private *i915)
 		goto err_not_found;
 
 	/* Get VBT size and allocate space for the VBT */
-	intel_uncore_write(&i915->uncore, PRIMARY_SPI_ADDRESS, found +
+	intel_de_write(i915, PRIMARY_SPI_ADDRESS, found +
 		   offsetof(struct vbt_header, vbt_size));
-	vbt_size = intel_uncore_read(&i915->uncore, PRIMARY_SPI_TRIGGER);
+	vbt_size = intel_de_read(i915, PRIMARY_SPI_TRIGGER);
 	vbt_size &= 0xffff;
 
 	vbt = kzalloc(round_up(vbt_size, 4), GFP_KERNEL);
@@ -3032,8 +3033,8 @@ static struct vbt_header *spi_oprom_get_vbt(struct drm_i915_private *i915)
 		goto err_not_found;
 
 	for (count = 0; count < vbt_size; count += 4) {
-		intel_uncore_write(&i915->uncore, PRIMARY_SPI_ADDRESS, found + count);
-		data = intel_uncore_read(&i915->uncore, PRIMARY_SPI_TRIGGER);
+		intel_de_write(i915, PRIMARY_SPI_ADDRESS, found + count);
+		data = intel_de_read(i915, PRIMARY_SPI_TRIGGER);
 		*(vbt + store++) = data;
 	}
 
diff --git a/drivers/gpu/drm/i915/display/intel_bw.c b/drivers/gpu/drm/i915/display/intel_bw.c
index 1c236f02b380..54e03a3eaa0f 100644
--- a/drivers/gpu/drm/i915/display/intel_bw.c
+++ b/drivers/gpu/drm/i915/display/intel_bw.c
@@ -11,11 +11,11 @@
 #include "intel_atomic.h"
 #include "intel_bw.h"
 #include "intel_cdclk.h"
+#include "intel_de.h"
 #include "intel_display_core.h"
 #include "intel_display_types.h"
 #include "skl_watermark.h"
 #include "intel_mchbar_regs.h"
-#include "intel_pcode.h"
 
 /* Parameters for Qclk Geyserville (QGV) */
 struct intel_qgv_point {
@@ -44,7 +44,7 @@ static int dg1_mchbar_read_qgv_point_info(struct drm_i915_private *dev_priv,
 	u32 dclk_ratio, dclk_reference;
 	u32 val;
 
-	val = intel_uncore_read(&dev_priv->uncore, SA_PERF_STATUS_0_0_0_MCHBAR_PC);
+	val = intel_de_read(dev_priv, SA_PERF_STATUS_0_0_0_MCHBAR_PC);
 	dclk_ratio = REG_FIELD_GET(DG1_QCLK_RATIO_MASK, val);
 	if (val & DG1_QCLK_REFERENCE)
 		dclk_reference = 6; /* 6 * 16.666 MHz = 100 MHz */
@@ -52,18 +52,18 @@ static int dg1_mchbar_read_qgv_point_info(struct drm_i915_private *dev_priv,
 		dclk_reference = 8; /* 8 * 16.666 MHz = 133 MHz */
 	sp->dclk = DIV_ROUND_UP((16667 * dclk_ratio * dclk_reference) + 500, 1000);
 
-	val = intel_uncore_read(&dev_priv->uncore, SKL_MC_BIOS_DATA_0_0_0_MCHBAR_PCU);
+	val = intel_de_read(dev_priv, SKL_MC_BIOS_DATA_0_0_0_MCHBAR_PCU);
 	if (val & DG1_GEAR_TYPE)
 		sp->dclk *= 2;
 
 	if (sp->dclk == 0)
 		return -EINVAL;
 
-	val = intel_uncore_read(&dev_priv->uncore, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR);
+	val = intel_de_read(dev_priv, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR);
 	sp->t_rp = REG_FIELD_GET(DG1_DRAM_T_RP_MASK, val);
 	sp->t_rdpre = REG_FIELD_GET(DG1_DRAM_T_RDPRE_MASK, val);
 
-	val = intel_uncore_read(&dev_priv->uncore, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR_HIGH);
+	val = intel_de_read(dev_priv, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR_HIGH);
 	sp->t_rcd = REG_FIELD_GET(DG1_DRAM_T_RCD_MASK, val);
 	sp->t_ras = REG_FIELD_GET(DG1_DRAM_T_RAS_MASK, val);
 
@@ -80,9 +80,9 @@ static int icl_pcode_read_qgv_point_info(struct drm_i915_private *dev_priv,
 	u16 dclk;
 	int ret;
 
-	ret = snb_pcode_read(&dev_priv->uncore, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
-			     ICL_PCODE_MEM_SS_READ_QGV_POINT_INFO(point),
-			     &val, &val2);
+	ret = intel_de_pcode_read(dev_priv, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
+				  ICL_PCODE_MEM_SS_READ_QGV_POINT_INFO(point),
+				  &val, &val2);
 	if (ret)
 		return ret;
 
@@ -106,8 +106,8 @@ static int adls_pcode_read_psf_gv_point_info(struct drm_i915_private *dev_priv,
 	int ret;
 	int i;
 
-	ret = snb_pcode_read(&dev_priv->uncore, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
-			     ADL_PCODE_MEM_SS_READ_PSF_GV_INFO, &val, NULL);
+	ret = intel_de_pcode_read(dev_priv, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
+				  ADL_PCODE_MEM_SS_READ_PSF_GV_INFO, &val, NULL);
 	if (ret)
 		return ret;
 
@@ -125,11 +125,11 @@ int icl_pcode_restrict_qgv_points(struct drm_i915_private *dev_priv,
 	int ret;
 
 	/* bspec says to keep retrying for at least 1 ms */
-	ret = skl_pcode_request(&dev_priv->uncore, ICL_PCODE_SAGV_DE_MEM_SS_CONFIG,
-				points_mask,
-				ICL_PCODE_REP_QGV_MASK | ADLS_PCODE_REP_PSF_MASK,
-				ICL_PCODE_REP_QGV_SAFE | ADLS_PCODE_REP_PSF_SAFE,
-				1);
+	ret = intel_de_pcode_request(dev_priv, ICL_PCODE_SAGV_DE_MEM_SS_CONFIG,
+				     points_mask,
+				     ICL_PCODE_REP_QGV_MASK | ADLS_PCODE_REP_PSF_MASK,
+				     ICL_PCODE_REP_QGV_SAFE | ADLS_PCODE_REP_PSF_SAFE,
+				     1);
 
 	if (ret < 0) {
 		drm_err(&dev_priv->drm, "Failed to disable qgv points (%d) points: 0x%x\n", ret, points_mask);
@@ -145,9 +145,9 @@ static int mtl_read_qgv_point_info(struct drm_i915_private *dev_priv,
 	u32 val, val2;
 	u16 dclk;
 
-	val = intel_uncore_read(&dev_priv->uncore,
+	val = intel_de_read(dev_priv,
 				MTL_MEM_SS_INFO_QGV_POINT_LOW(point));
-	val2 = intel_uncore_read(&dev_priv->uncore,
+	val2 = intel_de_read(dev_priv,
 				 MTL_MEM_SS_INFO_QGV_POINT_HIGH(point));
 	dclk = REG_FIELD_GET(MTL_DCLK_MASK, val);
 	sp->dclk = DIV_ROUND_UP((16667 * dclk), 1000);
diff --git a/drivers/gpu/drm/i915/display/intel_cdclk.c b/drivers/gpu/drm/i915/display/intel_cdclk.c
index 0c107a38f9d0..80e2db6b5ea4 100644
--- a/drivers/gpu/drm/i915/display/intel_cdclk.c
+++ b/drivers/gpu/drm/i915/display/intel_cdclk.c
@@ -35,7 +35,6 @@
 #include "intel_display_types.h"
 #include "intel_mchbar_regs.h"
 #include "intel_pci_config.h"
-#include "intel_pcode.h"
 #include "intel_psr.h"
 #include "vlv_sideband.h"
 
@@ -801,7 +800,7 @@ static void bdw_set_cdclk(struct drm_i915_private *dev_priv,
 		     "trying to change cdclk frequency with cdclk not enabled\n"))
 		return;
 
-	ret = snb_pcode_write(&dev_priv->uncore, BDW_PCODE_DISPLAY_FREQ_CHANGE_REQ, 0x0);
+	ret = intel_de_pcode_write(dev_priv, BDW_PCODE_DISPLAY_FREQ_CHANGE_REQ, 0x0);
 	if (ret) {
 		drm_err(&dev_priv->drm,
 			"failed to inform pcode about cdclk change\n");
@@ -829,8 +828,8 @@ static void bdw_set_cdclk(struct drm_i915_private *dev_priv,
 			 LCPLL_CD_SOURCE_FCLK_DONE) == 0, 1))
 		drm_err(&dev_priv->drm, "Switching back to LCPLL failed\n");
 
-	snb_pcode_write(&dev_priv->uncore, HSW_PCODE_DE_WRITE_FREQ_REQ,
-			cdclk_config->voltage_level);
+	intel_de_pcode_write(dev_priv, HSW_PCODE_DE_WRITE_FREQ_REQ,
+			     cdclk_config->voltage_level);
 
 	intel_de_write(dev_priv, CDCLK_FREQ,
 		       DIV_ROUND_CLOSEST(cdclk, 1000) - 1);
@@ -1087,10 +1086,10 @@ static void skl_set_cdclk(struct drm_i915_private *dev_priv,
 	drm_WARN_ON_ONCE(&dev_priv->drm,
 			 IS_SKYLAKE(dev_priv) && vco == 8640000);
 
-	ret = skl_pcode_request(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-				SKL_CDCLK_PREPARE_FOR_CHANGE,
-				SKL_CDCLK_READY_FOR_CHANGE,
-				SKL_CDCLK_READY_FOR_CHANGE, 3);
+	ret = intel_de_pcode_request(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+				     SKL_CDCLK_PREPARE_FOR_CHANGE,
+				     SKL_CDCLK_READY_FOR_CHANGE,
+				     SKL_CDCLK_READY_FOR_CHANGE, 3);
 	if (ret) {
 		drm_err(&dev_priv->drm,
 			"Failed to inform PCU about cdclk change (%d)\n", ret);
@@ -1133,8 +1132,8 @@ static void skl_set_cdclk(struct drm_i915_private *dev_priv,
 	intel_de_posting_read(dev_priv, CDCLK_CTL);
 
 	/* inform PCU of the change */
-	snb_pcode_write(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-			cdclk_config->voltage_level);
+	intel_de_pcode_write(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+			     cdclk_config->voltage_level);
 
 	intel_update_cdclk(dev_priv);
 }
@@ -1864,18 +1863,18 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv,
 	if (DISPLAY_VER(dev_priv) >= 14)
 		/* NOOP */;
 	else if (DISPLAY_VER(dev_priv) >= 11)
-		ret = skl_pcode_request(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-					SKL_CDCLK_PREPARE_FOR_CHANGE,
-					SKL_CDCLK_READY_FOR_CHANGE,
-					SKL_CDCLK_READY_FOR_CHANGE, 3);
+		ret = intel_de_pcode_request(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+					     SKL_CDCLK_PREPARE_FOR_CHANGE,
+					     SKL_CDCLK_READY_FOR_CHANGE,
+					     SKL_CDCLK_READY_FOR_CHANGE, 3);
 	else
 		/*
 		 * BSpec requires us to wait up to 150usec, but that leads to
 		 * timeouts; the 2ms used here is based on experiment.
 		 */
-		ret = snb_pcode_write_timeout(&dev_priv->uncore,
-					      HSW_PCODE_DE_WRITE_FREQ_REQ,
-					      0x80000000, 150, 2);
+		ret = intel_de_pcode_write_timeout(dev_priv,
+						   HSW_PCODE_DE_WRITE_FREQ_REQ,
+						   0x80000000, 150, 2);
 
 	if (ret) {
 		drm_err(&dev_priv->drm,
@@ -1898,8 +1897,8 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv,
 		 * Display versions 14 and beyond
 		 */;
 	else if (DISPLAY_VER(dev_priv) >= 11)
-		ret = snb_pcode_write(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-				      cdclk_config->voltage_level);
+		ret = intel_de_pcode_write(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+					   cdclk_config->voltage_level);
 	else
 		/*
 		 * The timeout isn't specified, the 2ms used here is based on
@@ -1907,10 +1906,10 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv,
 		 * FIXME: Waiting for the request completion could be delayed
 		 * until the next PCODE request based on BSpec.
 		 */
-		ret = snb_pcode_write_timeout(&dev_priv->uncore,
-					      HSW_PCODE_DE_WRITE_FREQ_REQ,
-					      cdclk_config->voltage_level,
-					      150, 2);
+		ret = intel_de_pcode_write_timeout(dev_priv,
+						   HSW_PCODE_DE_WRITE_FREQ_REQ,
+						   cdclk_config->voltage_level,
+						   150, 2);
 
 	if (ret) {
 		drm_err(&dev_priv->drm,
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 7a6191cad52a..ef9bab4043ee 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -105,7 +105,6 @@
 #include "intel_panel.h"
 #include "intel_pch_display.h"
 #include "intel_pch_refclk.h"
-#include "intel_pcode.h"
 #include "intel_pipe_crc.h"
 #include "intel_plane_initial.h"
 #include "intel_pm.h"
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.c b/drivers/gpu/drm/i915/display/intel_display_power.c
index 1a23ecd4623a..fd0fedb65e42 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -18,7 +18,6 @@
 #include "intel_dmc.h"
 #include "intel_mchbar_regs.h"
 #include "intel_pch_refclk.h"
-#include "intel_pcode.h"
 #include "intel_snps_phy.h"
 #include "skl_watermark.h"
 #include "vlv_sideband.h"
@@ -1206,7 +1205,7 @@ static u32 hsw_read_dcomp(struct drm_i915_private *dev_priv)
 static void hsw_write_dcomp(struct drm_i915_private *dev_priv, u32 val)
 {
 	if (IS_HASWELL(dev_priv)) {
-		if (snb_pcode_write(&dev_priv->uncore, GEN6_PCODE_WRITE_D_COMP, val))
+		if (intel_de_pcode_write(dev_priv, GEN6_PCODE_WRITE_D_COMP, val))
 			drm_dbg_kms(&dev_priv->drm,
 				    "Failed to write to D_COMP\n");
 	} else {
diff --git a/drivers/gpu/drm/i915/display/intel_display_power_well.c b/drivers/gpu/drm/i915/display/intel_display_power_well.c
index 8710dd41ffd4..a1d75956ae97 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power_well.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power_well.c
@@ -18,7 +18,6 @@
 #include "intel_dpio_phy.h"
 #include "intel_dpll.h"
 #include "intel_hotplug.h"
-#include "intel_pcode.h"
 #include "intel_pps.h"
 #include "intel_tc.h"
 #include "intel_vga.h"
@@ -477,8 +476,8 @@ static void icl_tc_cold_exit(struct drm_i915_private *i915)
 	int ret, tries = 0;
 
 	while (1) {
-		ret = snb_pcode_write_timeout(&i915->uncore, ICL_PCODE_EXIT_TCCOLD, 0,
-					      250, 1);
+		ret = intel_de_pcode_write_timeout(i915, ICL_PCODE_EXIT_TCCOLD, 0,
+						   250, 1);
 		if (ret != -EAGAIN || ++tries == 3)
 			break;
 		msleep(1);
@@ -1740,7 +1739,7 @@ tgl_tc_cold_request(struct drm_i915_private *i915, bool block)
 		 * Spec states that we should timeout the request after 200us
 		 * but the function below will timeout after 500us
 		 */
-		ret = snb_pcode_read(&i915->uncore, TGL_PCODE_TCCOLD, &low_val, &high_val);
+		ret = intel_de_pcode_read(i915, TGL_PCODE_TCCOLD, &low_val, &high_val);
 		if (ret == 0) {
 			if (block &&
 			    (low_val & TGL_PCODE_EXIT_TCCOLD_DATA_L_EXIT_FAILED))
diff --git a/drivers/gpu/drm/i915/display/intel_dpio_phy.c b/drivers/gpu/drm/i915/display/intel_dpio_phy.c
index 7eb7440b3180..25bea6d2da67 100644
--- a/drivers/gpu/drm/i915/display/intel_dpio_phy.c
+++ b/drivers/gpu/drm/i915/display/intel_dpio_phy.c
@@ -401,11 +401,10 @@ static void _bxt_ddi_phy_init(struct drm_i915_private *dev_priv,
 	 * The flag should get set in 100us according to the HW team, but
 	 * use 1ms due to occasional timeouts observed with that.
 	 */
-	if (intel_wait_for_register_fw(&dev_priv->uncore,
-				       BXT_PORT_CL1CM_DW0(phy),
-				       PHY_RESERVED | PHY_POWER_GOOD,
-				       PHY_POWER_GOOD,
-				       1))
+	if (intel_de_wait_for_register_fw(dev_priv,
+					  BXT_PORT_CL1CM_DW0(phy),
+					  PHY_RESERVED | PHY_POWER_GOOD,
+					  PHY_POWER_GOOD, 1))
 		drm_err(&dev_priv->drm, "timeout during PHY%d power on\n",
 			phy);
 
diff --git a/drivers/gpu/drm/i915/display/intel_hdcp.c b/drivers/gpu/drm/i915/display/intel_hdcp.c
index 6406fd487ee5..e62f64dd481f 100644
--- a/drivers/gpu/drm/i915/display/intel_hdcp.c
+++ b/drivers/gpu/drm/i915/display/intel_hdcp.c
@@ -24,7 +24,6 @@
 #include "intel_display_types.h"
 #include "intel_hdcp.h"
 #include "intel_hdcp_regs.h"
-#include "intel_pcode.h"
 
 #define KEY_LOAD_TRIES	5
 #define HDCP2_LC_RETRY_CNT			3
@@ -321,7 +320,7 @@ static int intel_hdcp_load_keys(struct drm_i915_private *dev_priv)
 	 * Mailbox interface.
 	 */
 	if (DISPLAY_VER(dev_priv) == 9 && !IS_BROXTON(dev_priv)) {
-		ret = snb_pcode_write(&dev_priv->uncore, SKL_PCODE_LOAD_HDCP_KEYS, 1);
+		ret = intel_de_pcode_write(dev_priv, SKL_PCODE_LOAD_HDCP_KEYS, 1);
 		if (ret) {
 			drm_err(&dev_priv->drm,
 				"Failed to initiate HDCP key load (%d)\n",
@@ -333,9 +332,9 @@ static int intel_hdcp_load_keys(struct drm_i915_private *dev_priv)
 	}
 
 	/* Wait for the keys to load (500us) */
-	ret = __intel_wait_for_register(&dev_priv->uncore, HDCP_KEY_STATUS,
-					HDCP_KEY_LOAD_DONE, HDCP_KEY_LOAD_DONE,
-					10, 1, &val);
+	ret = __intel_de_wait_for_register(dev_priv, HDCP_KEY_STATUS,
+					   HDCP_KEY_LOAD_DONE, HDCP_KEY_LOAD_DONE,
+					   10, 1, &val);
 	if (ret)
 		return ret;
 	else if (!(val & HDCP_KEY_LOAD_STATUS))
diff --git a/drivers/gpu/drm/i915/display/intel_hdmi.c b/drivers/gpu/drm/i915/display/intel_hdmi.c
index efa2da080f62..6d097a11f939 100644
--- a/drivers/gpu/drm/i915/display/intel_hdmi.c
+++ b/drivers/gpu/drm/i915/display/intel_hdmi.c
@@ -40,7 +40,6 @@
 #include <drm/drm_edid.h>
 #include <drm/intel_lpe_audio.h>
 
-#include "i915_debugfs.h"
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "intel_atomic.h"
diff --git a/drivers/gpu/drm/i915/display/skl_watermark.c b/drivers/gpu/drm/i915/display/skl_watermark.c
index ae4e9e680c2e..e254fb21b47f 100644
--- a/drivers/gpu/drm/i915/display/skl_watermark.c
+++ b/drivers/gpu/drm/i915/display/skl_watermark.c
@@ -18,7 +18,6 @@
 #include "i915_drv.h"
 #include "i915_fixed.h"
 #include "i915_reg.h"
-#include "intel_pcode.h"
 #include "intel_pm.h"
 
 static void skl_sagv_disable(struct drm_i915_private *i915);
@@ -81,9 +80,9 @@ intel_sagv_block_time(struct drm_i915_private *i915)
 		u32 val = 0;
 		int ret;
 
-		ret = snb_pcode_read(&i915->uncore,
-				     GEN12_PCODE_READ_SAGV_BLOCK_TIME_US,
-				     &val, NULL);
+		ret = intel_de_pcode_read(i915,
+					  GEN12_PCODE_READ_SAGV_BLOCK_TIME_US,
+					  &val, NULL);
 		if (ret) {
 			drm_dbg_kms(&i915->drm, "Couldn't read SAGV block time!\n");
 			return 0;
@@ -150,8 +149,8 @@ static void skl_sagv_enable(struct drm_i915_private *i915)
 		return;
 
 	drm_dbg_kms(&i915->drm, "Enabling SAGV\n");
-	ret = snb_pcode_write(&i915->uncore, GEN9_PCODE_SAGV_CONTROL,
-			      GEN9_SAGV_ENABLE);
+	ret = intel_de_pcode_write(i915, GEN9_PCODE_SAGV_CONTROL,
+				   GEN9_SAGV_ENABLE);
 
 	/* We don't need to wait for SAGV when enabling */
 
@@ -183,10 +182,10 @@ static void skl_sagv_disable(struct drm_i915_private *i915)
 
 	drm_dbg_kms(&i915->drm, "Disabling SAGV\n");
 	/* bspec says to keep retrying for at least 1 ms */
-	ret = skl_pcode_request(&i915->uncore, GEN9_PCODE_SAGV_CONTROL,
-				GEN9_SAGV_DISABLE,
-				GEN9_SAGV_IS_DISABLED, GEN9_SAGV_IS_DISABLED,
-				1);
+	ret = intel_de_pcode_request(i915, GEN9_PCODE_SAGV_CONTROL,
+				     GEN9_SAGV_DISABLE,
+				     GEN9_SAGV_IS_DISABLED, GEN9_SAGV_IS_DISABLED,
+				     1);
 	/*
 	 * Some skl systems, pre-release machines in particular,
 	 * don't actually have SAGV.
@@ -3224,7 +3223,7 @@ static void skl_read_wm_latency(struct drm_i915_private *i915, u16 wm[])
 
 	/* read the first set of memory latencies[0:3] */
 	val = 0; /* data0 to be programmed to 0 for first set */
-	ret = snb_pcode_read(&i915->uncore, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
+	ret = intel_de_pcode_read(i915, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
 	if (ret) {
 		drm_err(&i915->drm, "SKL Mailbox read error = %d\n", ret);
 		return;
@@ -3237,7 +3236,7 @@ static void skl_read_wm_latency(struct drm_i915_private *i915, u16 wm[])
 
 	/* read the second set of memory latencies[4:7] */
 	val = 1; /* data0 to be programmed to 1 for second set */
-	ret = snb_pcode_read(&i915->uncore, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
+	ret = intel_de_pcode_read(i915, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
 	if (ret) {
 		drm_err(&i915->drm, "SKL Mailbox read error = %d\n", ret);
 		return;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 14/20] drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/hsw_ips.c        |  7 ++-
 drivers/gpu/drm/i915/display/intel_bios.c     | 25 ++++++-----
 drivers/gpu/drm/i915/display/intel_bw.c       | 34 +++++++-------
 drivers/gpu/drm/i915/display/intel_cdclk.c    | 45 +++++++++----------
 drivers/gpu/drm/i915/display/intel_display.c  |  1 -
 .../drm/i915/display/intel_display_power.c    |  3 +-
 .../i915/display/intel_display_power_well.c   |  7 ++-
 drivers/gpu/drm/i915/display/intel_dpio_phy.c |  9 ++--
 drivers/gpu/drm/i915/display/intel_hdcp.c     |  9 ++--
 drivers/gpu/drm/i915/display/intel_hdmi.c     |  1 -
 drivers/gpu/drm/i915/display/skl_watermark.c  | 23 +++++-----
 11 files changed, 78 insertions(+), 86 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/hsw_ips.c b/drivers/gpu/drm/i915/display/hsw_ips.c
index 83aa3800245f..b2253f7cfc00 100644
--- a/drivers/gpu/drm/i915/display/hsw_ips.c
+++ b/drivers/gpu/drm/i915/display/hsw_ips.c
@@ -8,7 +8,6 @@
 #include "i915_reg.h"
 #include "intel_de.h"
 #include "intel_display_types.h"
-#include "intel_pcode.h"
 
 static void hsw_ips_enable(const struct intel_crtc_state *crtc_state)
 {
@@ -28,8 +27,8 @@ static void hsw_ips_enable(const struct intel_crtc_state *crtc_state)
 
 	if (IS_BROADWELL(i915)) {
 		drm_WARN_ON(&i915->drm,
-			    snb_pcode_write(&i915->uncore, DISPLAY_IPS_CONTROL,
-					    IPS_ENABLE | IPS_PCODE_CONTROL));
+			    intel_de_pcode_write(i915, DISPLAY_IPS_CONTROL,
+						 IPS_ENABLE | IPS_PCODE_CONTROL));
 		/*
 		 * Quoting Art Runyan: "its not safe to expect any particular
 		 * value in IPS_CTL bit 31 after enabling IPS through the
@@ -62,7 +61,7 @@ bool hsw_ips_disable(const struct intel_crtc_state *crtc_state)
 
 	if (IS_BROADWELL(i915)) {
 		drm_WARN_ON(&i915->drm,
-			    snb_pcode_write(&i915->uncore, DISPLAY_IPS_CONTROL, 0));
+			    intel_de_pcode_write(i915, DISPLAY_IPS_CONTROL, 0));
 		/*
 		 * Wait for PCODE to finish disabling IPS. The BSpec specified
 		 * 42ms timeout value leads to occasional timeouts so use 100ms
diff --git a/drivers/gpu/drm/i915/display/intel_bios.c b/drivers/gpu/drm/i915/display/intel_bios.c
index 55544d484318..755e56f9db6c 100644
--- a/drivers/gpu/drm/i915/display/intel_bios.c
+++ b/drivers/gpu/drm/i915/display/intel_bios.c
@@ -29,9 +29,10 @@
 #include <drm/display/drm_dp_helper.h>
 #include <drm/display/drm_dsc_helper.h>
 
-#include "display/intel_display.h"
-#include "display/intel_display_types.h"
-#include "display/intel_gmbus.h"
+#include "intel_de.h"
+#include "intel_display.h"
+#include "intel_display_types.h"
+#include "intel_gmbus.h"
 
 #include "i915_drv.h"
 #include "i915_reg.h"
@@ -3001,16 +3002,16 @@ static struct vbt_header *spi_oprom_get_vbt(struct drm_i915_private *i915)
 	u16 vbt_size;
 	u32 *vbt;
 
-	static_region = intel_uncore_read(&i915->uncore, SPI_STATIC_REGIONS);
+	static_region = intel_de_read(i915, SPI_STATIC_REGIONS);
 	static_region &= OPTIONROM_SPI_REGIONID_MASK;
-	intel_uncore_write(&i915->uncore, PRIMARY_SPI_REGIONID, static_region);
+	intel_de_write(i915, PRIMARY_SPI_REGIONID, static_region);
 
-	oprom_offset = intel_uncore_read(&i915->uncore, OROM_OFFSET);
+	oprom_offset = intel_de_read(i915, OROM_OFFSET);
 	oprom_offset &= OROM_OFFSET_MASK;
 
 	for (count = 0; count < oprom_size; count += 4) {
-		intel_uncore_write(&i915->uncore, PRIMARY_SPI_ADDRESS, oprom_offset + count);
-		data = intel_uncore_read(&i915->uncore, PRIMARY_SPI_TRIGGER);
+		intel_de_write(i915, PRIMARY_SPI_ADDRESS, oprom_offset + count);
+		data = intel_de_read(i915, PRIMARY_SPI_TRIGGER);
 
 		if (data == *((const u32 *)"$VBT")) {
 			found = oprom_offset + count;
@@ -3022,9 +3023,9 @@ static struct vbt_header *spi_oprom_get_vbt(struct drm_i915_private *i915)
 		goto err_not_found;
 
 	/* Get VBT size and allocate space for the VBT */
-	intel_uncore_write(&i915->uncore, PRIMARY_SPI_ADDRESS, found +
+	intel_de_write(i915, PRIMARY_SPI_ADDRESS, found +
 		   offsetof(struct vbt_header, vbt_size));
-	vbt_size = intel_uncore_read(&i915->uncore, PRIMARY_SPI_TRIGGER);
+	vbt_size = intel_de_read(i915, PRIMARY_SPI_TRIGGER);
 	vbt_size &= 0xffff;
 
 	vbt = kzalloc(round_up(vbt_size, 4), GFP_KERNEL);
@@ -3032,8 +3033,8 @@ static struct vbt_header *spi_oprom_get_vbt(struct drm_i915_private *i915)
 		goto err_not_found;
 
 	for (count = 0; count < vbt_size; count += 4) {
-		intel_uncore_write(&i915->uncore, PRIMARY_SPI_ADDRESS, found + count);
-		data = intel_uncore_read(&i915->uncore, PRIMARY_SPI_TRIGGER);
+		intel_de_write(i915, PRIMARY_SPI_ADDRESS, found + count);
+		data = intel_de_read(i915, PRIMARY_SPI_TRIGGER);
 		*(vbt + store++) = data;
 	}
 
diff --git a/drivers/gpu/drm/i915/display/intel_bw.c b/drivers/gpu/drm/i915/display/intel_bw.c
index 1c236f02b380..54e03a3eaa0f 100644
--- a/drivers/gpu/drm/i915/display/intel_bw.c
+++ b/drivers/gpu/drm/i915/display/intel_bw.c
@@ -11,11 +11,11 @@
 #include "intel_atomic.h"
 #include "intel_bw.h"
 #include "intel_cdclk.h"
+#include "intel_de.h"
 #include "intel_display_core.h"
 #include "intel_display_types.h"
 #include "skl_watermark.h"
 #include "intel_mchbar_regs.h"
-#include "intel_pcode.h"
 
 /* Parameters for Qclk Geyserville (QGV) */
 struct intel_qgv_point {
@@ -44,7 +44,7 @@ static int dg1_mchbar_read_qgv_point_info(struct drm_i915_private *dev_priv,
 	u32 dclk_ratio, dclk_reference;
 	u32 val;
 
-	val = intel_uncore_read(&dev_priv->uncore, SA_PERF_STATUS_0_0_0_MCHBAR_PC);
+	val = intel_de_read(dev_priv, SA_PERF_STATUS_0_0_0_MCHBAR_PC);
 	dclk_ratio = REG_FIELD_GET(DG1_QCLK_RATIO_MASK, val);
 	if (val & DG1_QCLK_REFERENCE)
 		dclk_reference = 6; /* 6 * 16.666 MHz = 100 MHz */
@@ -52,18 +52,18 @@ static int dg1_mchbar_read_qgv_point_info(struct drm_i915_private *dev_priv,
 		dclk_reference = 8; /* 8 * 16.666 MHz = 133 MHz */
 	sp->dclk = DIV_ROUND_UP((16667 * dclk_ratio * dclk_reference) + 500, 1000);
 
-	val = intel_uncore_read(&dev_priv->uncore, SKL_MC_BIOS_DATA_0_0_0_MCHBAR_PCU);
+	val = intel_de_read(dev_priv, SKL_MC_BIOS_DATA_0_0_0_MCHBAR_PCU);
 	if (val & DG1_GEAR_TYPE)
 		sp->dclk *= 2;
 
 	if (sp->dclk == 0)
 		return -EINVAL;
 
-	val = intel_uncore_read(&dev_priv->uncore, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR);
+	val = intel_de_read(dev_priv, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR);
 	sp->t_rp = REG_FIELD_GET(DG1_DRAM_T_RP_MASK, val);
 	sp->t_rdpre = REG_FIELD_GET(DG1_DRAM_T_RDPRE_MASK, val);
 
-	val = intel_uncore_read(&dev_priv->uncore, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR_HIGH);
+	val = intel_de_read(dev_priv, MCHBAR_CH0_CR_TC_PRE_0_0_0_MCHBAR_HIGH);
 	sp->t_rcd = REG_FIELD_GET(DG1_DRAM_T_RCD_MASK, val);
 	sp->t_ras = REG_FIELD_GET(DG1_DRAM_T_RAS_MASK, val);
 
@@ -80,9 +80,9 @@ static int icl_pcode_read_qgv_point_info(struct drm_i915_private *dev_priv,
 	u16 dclk;
 	int ret;
 
-	ret = snb_pcode_read(&dev_priv->uncore, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
-			     ICL_PCODE_MEM_SS_READ_QGV_POINT_INFO(point),
-			     &val, &val2);
+	ret = intel_de_pcode_read(dev_priv, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
+				  ICL_PCODE_MEM_SS_READ_QGV_POINT_INFO(point),
+				  &val, &val2);
 	if (ret)
 		return ret;
 
@@ -106,8 +106,8 @@ static int adls_pcode_read_psf_gv_point_info(struct drm_i915_private *dev_priv,
 	int ret;
 	int i;
 
-	ret = snb_pcode_read(&dev_priv->uncore, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
-			     ADL_PCODE_MEM_SS_READ_PSF_GV_INFO, &val, NULL);
+	ret = intel_de_pcode_read(dev_priv, ICL_PCODE_MEM_SUBSYSYSTEM_INFO |
+				  ADL_PCODE_MEM_SS_READ_PSF_GV_INFO, &val, NULL);
 	if (ret)
 		return ret;
 
@@ -125,11 +125,11 @@ int icl_pcode_restrict_qgv_points(struct drm_i915_private *dev_priv,
 	int ret;
 
 	/* bspec says to keep retrying for at least 1 ms */
-	ret = skl_pcode_request(&dev_priv->uncore, ICL_PCODE_SAGV_DE_MEM_SS_CONFIG,
-				points_mask,
-				ICL_PCODE_REP_QGV_MASK | ADLS_PCODE_REP_PSF_MASK,
-				ICL_PCODE_REP_QGV_SAFE | ADLS_PCODE_REP_PSF_SAFE,
-				1);
+	ret = intel_de_pcode_request(dev_priv, ICL_PCODE_SAGV_DE_MEM_SS_CONFIG,
+				     points_mask,
+				     ICL_PCODE_REP_QGV_MASK | ADLS_PCODE_REP_PSF_MASK,
+				     ICL_PCODE_REP_QGV_SAFE | ADLS_PCODE_REP_PSF_SAFE,
+				     1);
 
 	if (ret < 0) {
 		drm_err(&dev_priv->drm, "Failed to disable qgv points (%d) points: 0x%x\n", ret, points_mask);
@@ -145,9 +145,9 @@ static int mtl_read_qgv_point_info(struct drm_i915_private *dev_priv,
 	u32 val, val2;
 	u16 dclk;
 
-	val = intel_uncore_read(&dev_priv->uncore,
+	val = intel_de_read(dev_priv,
 				MTL_MEM_SS_INFO_QGV_POINT_LOW(point));
-	val2 = intel_uncore_read(&dev_priv->uncore,
+	val2 = intel_de_read(dev_priv,
 				 MTL_MEM_SS_INFO_QGV_POINT_HIGH(point));
 	dclk = REG_FIELD_GET(MTL_DCLK_MASK, val);
 	sp->dclk = DIV_ROUND_UP((16667 * dclk), 1000);
diff --git a/drivers/gpu/drm/i915/display/intel_cdclk.c b/drivers/gpu/drm/i915/display/intel_cdclk.c
index 0c107a38f9d0..80e2db6b5ea4 100644
--- a/drivers/gpu/drm/i915/display/intel_cdclk.c
+++ b/drivers/gpu/drm/i915/display/intel_cdclk.c
@@ -35,7 +35,6 @@
 #include "intel_display_types.h"
 #include "intel_mchbar_regs.h"
 #include "intel_pci_config.h"
-#include "intel_pcode.h"
 #include "intel_psr.h"
 #include "vlv_sideband.h"
 
@@ -801,7 +800,7 @@ static void bdw_set_cdclk(struct drm_i915_private *dev_priv,
 		     "trying to change cdclk frequency with cdclk not enabled\n"))
 		return;
 
-	ret = snb_pcode_write(&dev_priv->uncore, BDW_PCODE_DISPLAY_FREQ_CHANGE_REQ, 0x0);
+	ret = intel_de_pcode_write(dev_priv, BDW_PCODE_DISPLAY_FREQ_CHANGE_REQ, 0x0);
 	if (ret) {
 		drm_err(&dev_priv->drm,
 			"failed to inform pcode about cdclk change\n");
@@ -829,8 +828,8 @@ static void bdw_set_cdclk(struct drm_i915_private *dev_priv,
 			 LCPLL_CD_SOURCE_FCLK_DONE) == 0, 1))
 		drm_err(&dev_priv->drm, "Switching back to LCPLL failed\n");
 
-	snb_pcode_write(&dev_priv->uncore, HSW_PCODE_DE_WRITE_FREQ_REQ,
-			cdclk_config->voltage_level);
+	intel_de_pcode_write(dev_priv, HSW_PCODE_DE_WRITE_FREQ_REQ,
+			     cdclk_config->voltage_level);
 
 	intel_de_write(dev_priv, CDCLK_FREQ,
 		       DIV_ROUND_CLOSEST(cdclk, 1000) - 1);
@@ -1087,10 +1086,10 @@ static void skl_set_cdclk(struct drm_i915_private *dev_priv,
 	drm_WARN_ON_ONCE(&dev_priv->drm,
 			 IS_SKYLAKE(dev_priv) && vco == 8640000);
 
-	ret = skl_pcode_request(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-				SKL_CDCLK_PREPARE_FOR_CHANGE,
-				SKL_CDCLK_READY_FOR_CHANGE,
-				SKL_CDCLK_READY_FOR_CHANGE, 3);
+	ret = intel_de_pcode_request(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+				     SKL_CDCLK_PREPARE_FOR_CHANGE,
+				     SKL_CDCLK_READY_FOR_CHANGE,
+				     SKL_CDCLK_READY_FOR_CHANGE, 3);
 	if (ret) {
 		drm_err(&dev_priv->drm,
 			"Failed to inform PCU about cdclk change (%d)\n", ret);
@@ -1133,8 +1132,8 @@ static void skl_set_cdclk(struct drm_i915_private *dev_priv,
 	intel_de_posting_read(dev_priv, CDCLK_CTL);
 
 	/* inform PCU of the change */
-	snb_pcode_write(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-			cdclk_config->voltage_level);
+	intel_de_pcode_write(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+			     cdclk_config->voltage_level);
 
 	intel_update_cdclk(dev_priv);
 }
@@ -1864,18 +1863,18 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv,
 	if (DISPLAY_VER(dev_priv) >= 14)
 		/* NOOP */;
 	else if (DISPLAY_VER(dev_priv) >= 11)
-		ret = skl_pcode_request(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-					SKL_CDCLK_PREPARE_FOR_CHANGE,
-					SKL_CDCLK_READY_FOR_CHANGE,
-					SKL_CDCLK_READY_FOR_CHANGE, 3);
+		ret = intel_de_pcode_request(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+					     SKL_CDCLK_PREPARE_FOR_CHANGE,
+					     SKL_CDCLK_READY_FOR_CHANGE,
+					     SKL_CDCLK_READY_FOR_CHANGE, 3);
 	else
 		/*
 		 * BSpec requires us to wait up to 150usec, but that leads to
 		 * timeouts; the 2ms used here is based on experiment.
 		 */
-		ret = snb_pcode_write_timeout(&dev_priv->uncore,
-					      HSW_PCODE_DE_WRITE_FREQ_REQ,
-					      0x80000000, 150, 2);
+		ret = intel_de_pcode_write_timeout(dev_priv,
+						   HSW_PCODE_DE_WRITE_FREQ_REQ,
+						   0x80000000, 150, 2);
 
 	if (ret) {
 		drm_err(&dev_priv->drm,
@@ -1898,8 +1897,8 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv,
 		 * Display versions 14 and beyond
 		 */;
 	else if (DISPLAY_VER(dev_priv) >= 11)
-		ret = snb_pcode_write(&dev_priv->uncore, SKL_PCODE_CDCLK_CONTROL,
-				      cdclk_config->voltage_level);
+		ret = intel_de_pcode_write(dev_priv, SKL_PCODE_CDCLK_CONTROL,
+					   cdclk_config->voltage_level);
 	else
 		/*
 		 * The timeout isn't specified, the 2ms used here is based on
@@ -1907,10 +1906,10 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv,
 		 * FIXME: Waiting for the request completion could be delayed
 		 * until the next PCODE request based on BSpec.
 		 */
-		ret = snb_pcode_write_timeout(&dev_priv->uncore,
-					      HSW_PCODE_DE_WRITE_FREQ_REQ,
-					      cdclk_config->voltage_level,
-					      150, 2);
+		ret = intel_de_pcode_write_timeout(dev_priv,
+						   HSW_PCODE_DE_WRITE_FREQ_REQ,
+						   cdclk_config->voltage_level,
+						   150, 2);
 
 	if (ret) {
 		drm_err(&dev_priv->drm,
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index 7a6191cad52a..ef9bab4043ee 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -105,7 +105,6 @@
 #include "intel_panel.h"
 #include "intel_pch_display.h"
 #include "intel_pch_refclk.h"
-#include "intel_pcode.h"
 #include "intel_pipe_crc.h"
 #include "intel_plane_initial.h"
 #include "intel_pm.h"
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.c b/drivers/gpu/drm/i915/display/intel_display_power.c
index 1a23ecd4623a..fd0fedb65e42 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -18,7 +18,6 @@
 #include "intel_dmc.h"
 #include "intel_mchbar_regs.h"
 #include "intel_pch_refclk.h"
-#include "intel_pcode.h"
 #include "intel_snps_phy.h"
 #include "skl_watermark.h"
 #include "vlv_sideband.h"
@@ -1206,7 +1205,7 @@ static u32 hsw_read_dcomp(struct drm_i915_private *dev_priv)
 static void hsw_write_dcomp(struct drm_i915_private *dev_priv, u32 val)
 {
 	if (IS_HASWELL(dev_priv)) {
-		if (snb_pcode_write(&dev_priv->uncore, GEN6_PCODE_WRITE_D_COMP, val))
+		if (intel_de_pcode_write(dev_priv, GEN6_PCODE_WRITE_D_COMP, val))
 			drm_dbg_kms(&dev_priv->drm,
 				    "Failed to write to D_COMP\n");
 	} else {
diff --git a/drivers/gpu/drm/i915/display/intel_display_power_well.c b/drivers/gpu/drm/i915/display/intel_display_power_well.c
index 8710dd41ffd4..a1d75956ae97 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power_well.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power_well.c
@@ -18,7 +18,6 @@
 #include "intel_dpio_phy.h"
 #include "intel_dpll.h"
 #include "intel_hotplug.h"
-#include "intel_pcode.h"
 #include "intel_pps.h"
 #include "intel_tc.h"
 #include "intel_vga.h"
@@ -477,8 +476,8 @@ static void icl_tc_cold_exit(struct drm_i915_private *i915)
 	int ret, tries = 0;
 
 	while (1) {
-		ret = snb_pcode_write_timeout(&i915->uncore, ICL_PCODE_EXIT_TCCOLD, 0,
-					      250, 1);
+		ret = intel_de_pcode_write_timeout(i915, ICL_PCODE_EXIT_TCCOLD, 0,
+						   250, 1);
 		if (ret != -EAGAIN || ++tries == 3)
 			break;
 		msleep(1);
@@ -1740,7 +1739,7 @@ tgl_tc_cold_request(struct drm_i915_private *i915, bool block)
 		 * Spec states that we should timeout the request after 200us
 		 * but the function below will timeout after 500us
 		 */
-		ret = snb_pcode_read(&i915->uncore, TGL_PCODE_TCCOLD, &low_val, &high_val);
+		ret = intel_de_pcode_read(i915, TGL_PCODE_TCCOLD, &low_val, &high_val);
 		if (ret == 0) {
 			if (block &&
 			    (low_val & TGL_PCODE_EXIT_TCCOLD_DATA_L_EXIT_FAILED))
diff --git a/drivers/gpu/drm/i915/display/intel_dpio_phy.c b/drivers/gpu/drm/i915/display/intel_dpio_phy.c
index 7eb7440b3180..25bea6d2da67 100644
--- a/drivers/gpu/drm/i915/display/intel_dpio_phy.c
+++ b/drivers/gpu/drm/i915/display/intel_dpio_phy.c
@@ -401,11 +401,10 @@ static void _bxt_ddi_phy_init(struct drm_i915_private *dev_priv,
 	 * The flag should get set in 100us according to the HW team, but
 	 * use 1ms due to occasional timeouts observed with that.
 	 */
-	if (intel_wait_for_register_fw(&dev_priv->uncore,
-				       BXT_PORT_CL1CM_DW0(phy),
-				       PHY_RESERVED | PHY_POWER_GOOD,
-				       PHY_POWER_GOOD,
-				       1))
+	if (intel_de_wait_for_register_fw(dev_priv,
+					  BXT_PORT_CL1CM_DW0(phy),
+					  PHY_RESERVED | PHY_POWER_GOOD,
+					  PHY_POWER_GOOD, 1))
 		drm_err(&dev_priv->drm, "timeout during PHY%d power on\n",
 			phy);
 
diff --git a/drivers/gpu/drm/i915/display/intel_hdcp.c b/drivers/gpu/drm/i915/display/intel_hdcp.c
index 6406fd487ee5..e62f64dd481f 100644
--- a/drivers/gpu/drm/i915/display/intel_hdcp.c
+++ b/drivers/gpu/drm/i915/display/intel_hdcp.c
@@ -24,7 +24,6 @@
 #include "intel_display_types.h"
 #include "intel_hdcp.h"
 #include "intel_hdcp_regs.h"
-#include "intel_pcode.h"
 
 #define KEY_LOAD_TRIES	5
 #define HDCP2_LC_RETRY_CNT			3
@@ -321,7 +320,7 @@ static int intel_hdcp_load_keys(struct drm_i915_private *dev_priv)
 	 * Mailbox interface.
 	 */
 	if (DISPLAY_VER(dev_priv) == 9 && !IS_BROXTON(dev_priv)) {
-		ret = snb_pcode_write(&dev_priv->uncore, SKL_PCODE_LOAD_HDCP_KEYS, 1);
+		ret = intel_de_pcode_write(dev_priv, SKL_PCODE_LOAD_HDCP_KEYS, 1);
 		if (ret) {
 			drm_err(&dev_priv->drm,
 				"Failed to initiate HDCP key load (%d)\n",
@@ -333,9 +332,9 @@ static int intel_hdcp_load_keys(struct drm_i915_private *dev_priv)
 	}
 
 	/* Wait for the keys to load (500us) */
-	ret = __intel_wait_for_register(&dev_priv->uncore, HDCP_KEY_STATUS,
-					HDCP_KEY_LOAD_DONE, HDCP_KEY_LOAD_DONE,
-					10, 1, &val);
+	ret = __intel_de_wait_for_register(dev_priv, HDCP_KEY_STATUS,
+					   HDCP_KEY_LOAD_DONE, HDCP_KEY_LOAD_DONE,
+					   10, 1, &val);
 	if (ret)
 		return ret;
 	else if (!(val & HDCP_KEY_LOAD_STATUS))
diff --git a/drivers/gpu/drm/i915/display/intel_hdmi.c b/drivers/gpu/drm/i915/display/intel_hdmi.c
index efa2da080f62..6d097a11f939 100644
--- a/drivers/gpu/drm/i915/display/intel_hdmi.c
+++ b/drivers/gpu/drm/i915/display/intel_hdmi.c
@@ -40,7 +40,6 @@
 #include <drm/drm_edid.h>
 #include <drm/intel_lpe_audio.h>
 
-#include "i915_debugfs.h"
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "intel_atomic.h"
diff --git a/drivers/gpu/drm/i915/display/skl_watermark.c b/drivers/gpu/drm/i915/display/skl_watermark.c
index ae4e9e680c2e..e254fb21b47f 100644
--- a/drivers/gpu/drm/i915/display/skl_watermark.c
+++ b/drivers/gpu/drm/i915/display/skl_watermark.c
@@ -18,7 +18,6 @@
 #include "i915_drv.h"
 #include "i915_fixed.h"
 #include "i915_reg.h"
-#include "intel_pcode.h"
 #include "intel_pm.h"
 
 static void skl_sagv_disable(struct drm_i915_private *i915);
@@ -81,9 +80,9 @@ intel_sagv_block_time(struct drm_i915_private *i915)
 		u32 val = 0;
 		int ret;
 
-		ret = snb_pcode_read(&i915->uncore,
-				     GEN12_PCODE_READ_SAGV_BLOCK_TIME_US,
-				     &val, NULL);
+		ret = intel_de_pcode_read(i915,
+					  GEN12_PCODE_READ_SAGV_BLOCK_TIME_US,
+					  &val, NULL);
 		if (ret) {
 			drm_dbg_kms(&i915->drm, "Couldn't read SAGV block time!\n");
 			return 0;
@@ -150,8 +149,8 @@ static void skl_sagv_enable(struct drm_i915_private *i915)
 		return;
 
 	drm_dbg_kms(&i915->drm, "Enabling SAGV\n");
-	ret = snb_pcode_write(&i915->uncore, GEN9_PCODE_SAGV_CONTROL,
-			      GEN9_SAGV_ENABLE);
+	ret = intel_de_pcode_write(i915, GEN9_PCODE_SAGV_CONTROL,
+				   GEN9_SAGV_ENABLE);
 
 	/* We don't need to wait for SAGV when enabling */
 
@@ -183,10 +182,10 @@ static void skl_sagv_disable(struct drm_i915_private *i915)
 
 	drm_dbg_kms(&i915->drm, "Disabling SAGV\n");
 	/* bspec says to keep retrying for at least 1 ms */
-	ret = skl_pcode_request(&i915->uncore, GEN9_PCODE_SAGV_CONTROL,
-				GEN9_SAGV_DISABLE,
-				GEN9_SAGV_IS_DISABLED, GEN9_SAGV_IS_DISABLED,
-				1);
+	ret = intel_de_pcode_request(i915, GEN9_PCODE_SAGV_CONTROL,
+				     GEN9_SAGV_DISABLE,
+				     GEN9_SAGV_IS_DISABLED, GEN9_SAGV_IS_DISABLED,
+				     1);
 	/*
 	 * Some skl systems, pre-release machines in particular,
 	 * don't actually have SAGV.
@@ -3224,7 +3223,7 @@ static void skl_read_wm_latency(struct drm_i915_private *i915, u16 wm[])
 
 	/* read the first set of memory latencies[0:3] */
 	val = 0; /* data0 to be programmed to 0 for first set */
-	ret = snb_pcode_read(&i915->uncore, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
+	ret = intel_de_pcode_read(i915, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
 	if (ret) {
 		drm_err(&i915->drm, "SKL Mailbox read error = %d\n", ret);
 		return;
@@ -3237,7 +3236,7 @@ static void skl_read_wm_latency(struct drm_i915_private *i915, u16 wm[])
 
 	/* read the second set of memory latencies[4:7] */
 	val = 1; /* data0 to be programmed to 1 for second set */
-	ret = snb_pcode_read(&i915->uncore, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
+	ret = intel_de_pcode_read(i915, GEN9_PCODE_READ_MEM_LATENCY, &val, NULL);
 	if (ret) {
 		drm_err(&i915->drm, "SKL Mailbox read error = %d\n", ret);
 		return;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 15/20] drm/i915: Rename find_section to find_bdb_section
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

This prevents a namespace collision on other archs.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_bios.c | 46 +++++++++++------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_bios.c b/drivers/gpu/drm/i915/display/intel_bios.c
index 755e56f9db6c..7fd96b409d48 100644
--- a/drivers/gpu/drm/i915/display/intel_bios.c
+++ b/drivers/gpu/drm/i915/display/intel_bios.c
@@ -143,8 +143,8 @@ struct bdb_block_entry {
 };
 
 static const void *
-find_section(struct drm_i915_private *i915,
-	     enum bdb_block_id section_id)
+bdb_find_section(struct drm_i915_private *i915,
+		 enum bdb_block_id section_id)
 {
 	struct bdb_block_entry *entry;
 
@@ -203,7 +203,7 @@ static size_t lfp_data_min_size(struct drm_i915_private *i915)
 	const struct bdb_lvds_lfp_data_ptrs *ptrs;
 	size_t size;
 
-	ptrs = find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
+	ptrs = bdb_find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
 	if (!ptrs)
 		return 0;
 
@@ -632,7 +632,7 @@ static int vbt_get_panel_type(struct drm_i915_private *i915,
 {
 	const struct bdb_lvds_options *lvds_options;
 
-	lvds_options = find_section(i915, BDB_LVDS_OPTIONS);
+	lvds_options = bdb_find_section(i915, BDB_LVDS_OPTIONS);
 	if (!lvds_options)
 		return -1;
 
@@ -672,11 +672,11 @@ static int pnpid_get_panel_type(struct drm_i915_private *i915,
 
 	dump_pnp_id(i915, edid_id, "EDID");
 
-	ptrs = find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
+	ptrs = bdb_find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
 	if (!ptrs)
 		return -1;
 
-	data = find_section(i915, BDB_LVDS_LFP_DATA);
+	data = bdb_find_section(i915, BDB_LVDS_LFP_DATA);
 	if (!data)
 		return -1;
 
@@ -792,7 +792,7 @@ parse_panel_options(struct drm_i915_private *i915,
 	int panel_type = panel->vbt.panel_type;
 	int drrs_mode;
 
-	lvds_options = find_section(i915, BDB_LVDS_OPTIONS);
+	lvds_options = bdb_find_section(i915, BDB_LVDS_OPTIONS);
 	if (!lvds_options)
 		return;
 
@@ -882,11 +882,11 @@ parse_lfp_data(struct drm_i915_private *i915,
 	const struct lvds_pnp_id *pnp_id;
 	int panel_type = panel->vbt.panel_type;
 
-	ptrs = find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
+	ptrs = bdb_find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
 	if (!ptrs)
 		return;
 
-	data = find_section(i915, BDB_LVDS_LFP_DATA);
+	data = bdb_find_section(i915, BDB_LVDS_LFP_DATA);
 	if (!data)
 		return;
 
@@ -933,7 +933,7 @@ parse_generic_dtd(struct drm_i915_private *i915,
 	if (i915->display.vbt.version < 229)
 		return;
 
-	generic_dtd = find_section(i915, BDB_GENERIC_DTD);
+	generic_dtd = bdb_find_section(i915, BDB_GENERIC_DTD);
 	if (!generic_dtd)
 		return;
 
@@ -1012,7 +1012,7 @@ parse_lfp_backlight(struct drm_i915_private *i915,
 	int panel_type = panel->vbt.panel_type;
 	u16 level;
 
-	backlight_data = find_section(i915, BDB_LVDS_BACKLIGHT);
+	backlight_data = bdb_find_section(i915, BDB_LVDS_BACKLIGHT);
 	if (!backlight_data)
 		return;
 
@@ -1113,14 +1113,14 @@ parse_sdvo_panel_data(struct drm_i915_private *i915,
 	if (index == -1) {
 		const struct bdb_sdvo_lvds_options *sdvo_lvds_options;
 
-		sdvo_lvds_options = find_section(i915, BDB_SDVO_LVDS_OPTIONS);
+		sdvo_lvds_options = bdb_find_section(i915, BDB_SDVO_LVDS_OPTIONS);
 		if (!sdvo_lvds_options)
 			return;
 
 		index = sdvo_lvds_options->panel_type;
 	}
 
-	dtds = find_section(i915, BDB_SDVO_PANEL_DTDS);
+	dtds = bdb_find_section(i915, BDB_SDVO_PANEL_DTDS);
 	if (!dtds)
 		return;
 
@@ -1156,7 +1156,7 @@ parse_general_features(struct drm_i915_private *i915)
 {
 	const struct bdb_general_features *general;
 
-	general = find_section(i915, BDB_GENERAL_FEATURES);
+	general = bdb_find_section(i915, BDB_GENERAL_FEATURES);
 	if (!general)
 		return;
 
@@ -1280,7 +1280,7 @@ parse_driver_features(struct drm_i915_private *i915)
 {
 	const struct bdb_driver_features *driver;
 
-	driver = find_section(i915, BDB_DRIVER_FEATURES);
+	driver = bdb_find_section(i915, BDB_DRIVER_FEATURES);
 	if (!driver)
 		return;
 
@@ -1317,7 +1317,7 @@ parse_panel_driver_features(struct drm_i915_private *i915,
 {
 	const struct bdb_driver_features *driver;
 
-	driver = find_section(i915, BDB_DRIVER_FEATURES);
+	driver = bdb_find_section(i915, BDB_DRIVER_FEATURES);
 	if (!driver)
 		return;
 
@@ -1357,7 +1357,7 @@ parse_power_conservation_features(struct drm_i915_private *i915,
 	if (i915->display.vbt.version < 228)
 		return;
 
-	power = find_section(i915, BDB_LFP_POWER);
+	power = bdb_find_section(i915, BDB_LFP_POWER);
 	if (!power)
 		return;
 
@@ -1397,7 +1397,7 @@ parse_edp(struct drm_i915_private *i915,
 	const struct edp_fast_link_params *edp_link_params;
 	int panel_type = panel->vbt.panel_type;
 
-	edp = find_section(i915, BDB_EDP);
+	edp = bdb_find_section(i915, BDB_EDP);
 	if (!edp)
 		return;
 
@@ -1527,7 +1527,7 @@ parse_psr(struct drm_i915_private *i915,
 	const struct psr_table *psr_table;
 	int panel_type = panel->vbt.panel_type;
 
-	psr = find_section(i915, BDB_PSR);
+	psr = bdb_find_section(i915, BDB_PSR);
 	if (!psr) {
 		drm_dbg_kms(&i915->drm, "No PSR BDB found.\n");
 		return;
@@ -1688,7 +1688,7 @@ parse_mipi_config(struct drm_i915_private *i915,
 	/* Parse #52 for panel index used from panel_type already
 	 * parsed
 	 */
-	start = find_section(i915, BDB_MIPI_CONFIG);
+	start = bdb_find_section(i915, BDB_MIPI_CONFIG);
 	if (!start) {
 		drm_dbg_kms(&i915->drm, "No MIPI config BDB found");
 		return;
@@ -2000,7 +2000,7 @@ parse_mipi_sequence(struct drm_i915_private *i915,
 	if (panel->vbt.dsi.panel_id != MIPI_DSI_GENERIC_PANEL_ID)
 		return;
 
-	sequence = find_section(i915, BDB_MIPI_SEQUENCE);
+	sequence = bdb_find_section(i915, BDB_MIPI_SEQUENCE);
 	if (!sequence) {
 		drm_dbg_kms(&i915->drm,
 			    "No MIPI Sequence found, parsing complete\n");
@@ -2082,7 +2082,7 @@ parse_compression_parameters(struct drm_i915_private *i915)
 	if (i915->display.vbt.version < 198)
 		return;
 
-	params = find_section(i915, BDB_COMPRESSION_PARAMETERS);
+	params = bdb_find_section(i915, BDB_COMPRESSION_PARAMETERS);
 	if (params) {
 		/* Sanity checks */
 		if (params->entry_size != sizeof(params->data[0])) {
@@ -2756,7 +2756,7 @@ parse_general_definitions(struct drm_i915_private *i915)
 	u16 block_size;
 	int bus_pin;
 
-	defs = find_section(i915, BDB_GENERAL_DEFINITIONS);
+	defs = bdb_find_section(i915, BDB_GENERAL_DEFINITIONS);
 	if (!defs) {
 		drm_dbg_kms(&i915->drm,
 			    "No general definition block is found, no devices defined.\n");
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 15/20] drm/i915: Rename find_section to find_bdb_section
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

This prevents a namespace collision on other archs.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_bios.c | 46 +++++++++++------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_bios.c b/drivers/gpu/drm/i915/display/intel_bios.c
index 755e56f9db6c..7fd96b409d48 100644
--- a/drivers/gpu/drm/i915/display/intel_bios.c
+++ b/drivers/gpu/drm/i915/display/intel_bios.c
@@ -143,8 +143,8 @@ struct bdb_block_entry {
 };
 
 static const void *
-find_section(struct drm_i915_private *i915,
-	     enum bdb_block_id section_id)
+bdb_find_section(struct drm_i915_private *i915,
+		 enum bdb_block_id section_id)
 {
 	struct bdb_block_entry *entry;
 
@@ -203,7 +203,7 @@ static size_t lfp_data_min_size(struct drm_i915_private *i915)
 	const struct bdb_lvds_lfp_data_ptrs *ptrs;
 	size_t size;
 
-	ptrs = find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
+	ptrs = bdb_find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
 	if (!ptrs)
 		return 0;
 
@@ -632,7 +632,7 @@ static int vbt_get_panel_type(struct drm_i915_private *i915,
 {
 	const struct bdb_lvds_options *lvds_options;
 
-	lvds_options = find_section(i915, BDB_LVDS_OPTIONS);
+	lvds_options = bdb_find_section(i915, BDB_LVDS_OPTIONS);
 	if (!lvds_options)
 		return -1;
 
@@ -672,11 +672,11 @@ static int pnpid_get_panel_type(struct drm_i915_private *i915,
 
 	dump_pnp_id(i915, edid_id, "EDID");
 
-	ptrs = find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
+	ptrs = bdb_find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
 	if (!ptrs)
 		return -1;
 
-	data = find_section(i915, BDB_LVDS_LFP_DATA);
+	data = bdb_find_section(i915, BDB_LVDS_LFP_DATA);
 	if (!data)
 		return -1;
 
@@ -792,7 +792,7 @@ parse_panel_options(struct drm_i915_private *i915,
 	int panel_type = panel->vbt.panel_type;
 	int drrs_mode;
 
-	lvds_options = find_section(i915, BDB_LVDS_OPTIONS);
+	lvds_options = bdb_find_section(i915, BDB_LVDS_OPTIONS);
 	if (!lvds_options)
 		return;
 
@@ -882,11 +882,11 @@ parse_lfp_data(struct drm_i915_private *i915,
 	const struct lvds_pnp_id *pnp_id;
 	int panel_type = panel->vbt.panel_type;
 
-	ptrs = find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
+	ptrs = bdb_find_section(i915, BDB_LVDS_LFP_DATA_PTRS);
 	if (!ptrs)
 		return;
 
-	data = find_section(i915, BDB_LVDS_LFP_DATA);
+	data = bdb_find_section(i915, BDB_LVDS_LFP_DATA);
 	if (!data)
 		return;
 
@@ -933,7 +933,7 @@ parse_generic_dtd(struct drm_i915_private *i915,
 	if (i915->display.vbt.version < 229)
 		return;
 
-	generic_dtd = find_section(i915, BDB_GENERIC_DTD);
+	generic_dtd = bdb_find_section(i915, BDB_GENERIC_DTD);
 	if (!generic_dtd)
 		return;
 
@@ -1012,7 +1012,7 @@ parse_lfp_backlight(struct drm_i915_private *i915,
 	int panel_type = panel->vbt.panel_type;
 	u16 level;
 
-	backlight_data = find_section(i915, BDB_LVDS_BACKLIGHT);
+	backlight_data = bdb_find_section(i915, BDB_LVDS_BACKLIGHT);
 	if (!backlight_data)
 		return;
 
@@ -1113,14 +1113,14 @@ parse_sdvo_panel_data(struct drm_i915_private *i915,
 	if (index == -1) {
 		const struct bdb_sdvo_lvds_options *sdvo_lvds_options;
 
-		sdvo_lvds_options = find_section(i915, BDB_SDVO_LVDS_OPTIONS);
+		sdvo_lvds_options = bdb_find_section(i915, BDB_SDVO_LVDS_OPTIONS);
 		if (!sdvo_lvds_options)
 			return;
 
 		index = sdvo_lvds_options->panel_type;
 	}
 
-	dtds = find_section(i915, BDB_SDVO_PANEL_DTDS);
+	dtds = bdb_find_section(i915, BDB_SDVO_PANEL_DTDS);
 	if (!dtds)
 		return;
 
@@ -1156,7 +1156,7 @@ parse_general_features(struct drm_i915_private *i915)
 {
 	const struct bdb_general_features *general;
 
-	general = find_section(i915, BDB_GENERAL_FEATURES);
+	general = bdb_find_section(i915, BDB_GENERAL_FEATURES);
 	if (!general)
 		return;
 
@@ -1280,7 +1280,7 @@ parse_driver_features(struct drm_i915_private *i915)
 {
 	const struct bdb_driver_features *driver;
 
-	driver = find_section(i915, BDB_DRIVER_FEATURES);
+	driver = bdb_find_section(i915, BDB_DRIVER_FEATURES);
 	if (!driver)
 		return;
 
@@ -1317,7 +1317,7 @@ parse_panel_driver_features(struct drm_i915_private *i915,
 {
 	const struct bdb_driver_features *driver;
 
-	driver = find_section(i915, BDB_DRIVER_FEATURES);
+	driver = bdb_find_section(i915, BDB_DRIVER_FEATURES);
 	if (!driver)
 		return;
 
@@ -1357,7 +1357,7 @@ parse_power_conservation_features(struct drm_i915_private *i915,
 	if (i915->display.vbt.version < 228)
 		return;
 
-	power = find_section(i915, BDB_LFP_POWER);
+	power = bdb_find_section(i915, BDB_LFP_POWER);
 	if (!power)
 		return;
 
@@ -1397,7 +1397,7 @@ parse_edp(struct drm_i915_private *i915,
 	const struct edp_fast_link_params *edp_link_params;
 	int panel_type = panel->vbt.panel_type;
 
-	edp = find_section(i915, BDB_EDP);
+	edp = bdb_find_section(i915, BDB_EDP);
 	if (!edp)
 		return;
 
@@ -1527,7 +1527,7 @@ parse_psr(struct drm_i915_private *i915,
 	const struct psr_table *psr_table;
 	int panel_type = panel->vbt.panel_type;
 
-	psr = find_section(i915, BDB_PSR);
+	psr = bdb_find_section(i915, BDB_PSR);
 	if (!psr) {
 		drm_dbg_kms(&i915->drm, "No PSR BDB found.\n");
 		return;
@@ -1688,7 +1688,7 @@ parse_mipi_config(struct drm_i915_private *i915,
 	/* Parse #52 for panel index used from panel_type already
 	 * parsed
 	 */
-	start = find_section(i915, BDB_MIPI_CONFIG);
+	start = bdb_find_section(i915, BDB_MIPI_CONFIG);
 	if (!start) {
 		drm_dbg_kms(&i915->drm, "No MIPI config BDB found");
 		return;
@@ -2000,7 +2000,7 @@ parse_mipi_sequence(struct drm_i915_private *i915,
 	if (panel->vbt.dsi.panel_id != MIPI_DSI_GENERIC_PANEL_ID)
 		return;
 
-	sequence = find_section(i915, BDB_MIPI_SEQUENCE);
+	sequence = bdb_find_section(i915, BDB_MIPI_SEQUENCE);
 	if (!sequence) {
 		drm_dbg_kms(&i915->drm,
 			    "No MIPI Sequence found, parsing complete\n");
@@ -2082,7 +2082,7 @@ parse_compression_parameters(struct drm_i915_private *i915)
 	if (i915->display.vbt.version < 198)
 		return;
 
-	params = find_section(i915, BDB_COMPRESSION_PARAMETERS);
+	params = bdb_find_section(i915, BDB_COMPRESSION_PARAMETERS);
 	if (params) {
 		/* Sanity checks */
 		if (params->entry_size != sizeof(params->data[0])) {
@@ -2756,7 +2756,7 @@ parse_general_definitions(struct drm_i915_private *i915)
 	u16 block_size;
 	int bus_pin;
 
-	defs = find_section(i915, BDB_GENERAL_DEFINITIONS);
+	defs = bdb_find_section(i915, BDB_GENERAL_DEFINITIONS);
 	if (!defs) {
 		drm_dbg_kms(&i915->drm,
 			    "No general definition block is found, no devices defined.\n");
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 16/20] drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Only required for some old pre-gen9 platforms, not for Xe.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/Makefile                         | 2 +-
 drivers/gpu/drm/i915/display/intel_display_reg_defs.h | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index f47f00b162a4..a6e7cd2185c2 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -12,7 +12,7 @@
 # Note the danger in using -Wall -Wextra is that when CI updates gcc we
 # will most likely get a sudden build breakage... Hopefully we will fix
 # new warnings before CI updates!
-subdir-ccflags-y := -Wall -Wextra
+subdir-ccflags-y := -Wall -Wextra -DI915
 subdir-ccflags-y += -Wno-format-security
 subdir-ccflags-y += -Wno-unused-parameter
 subdir-ccflags-y += -Wno-type-limits
diff --git a/drivers/gpu/drm/i915/display/intel_display_reg_defs.h b/drivers/gpu/drm/i915/display/intel_display_reg_defs.h
index 02605418ff08..e163eedd8ffd 100644
--- a/drivers/gpu/drm/i915/display/intel_display_reg_defs.h
+++ b/drivers/gpu/drm/i915/display/intel_display_reg_defs.h
@@ -8,7 +8,11 @@
 
 #include "i915_reg_defs.h"
 
+#ifdef I915
 #define DISPLAY_MMIO_BASE(dev_priv)	(INTEL_INFO(dev_priv)->display.mmio_offset)
+#else
+#define DISPLAY_MMIO_BASE(dev_priv)    ((dev_priv) ? 0U : 0U)
+#endif
 
 #define VLV_DISPLAY_BASE		0x180000
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 16/20] drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Only required for some old pre-gen9 platforms, not for Xe.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/Makefile                         | 2 +-
 drivers/gpu/drm/i915/display/intel_display_reg_defs.h | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index f47f00b162a4..a6e7cd2185c2 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -12,7 +12,7 @@
 # Note the danger in using -Wall -Wextra is that when CI updates gcc we
 # will most likely get a sudden build breakage... Hopefully we will fix
 # new warnings before CI updates!
-subdir-ccflags-y := -Wall -Wextra
+subdir-ccflags-y := -Wall -Wextra -DI915
 subdir-ccflags-y += -Wno-format-security
 subdir-ccflags-y += -Wno-unused-parameter
 subdir-ccflags-y += -Wno-type-limits
diff --git a/drivers/gpu/drm/i915/display/intel_display_reg_defs.h b/drivers/gpu/drm/i915/display/intel_display_reg_defs.h
index 02605418ff08..e163eedd8ffd 100644
--- a/drivers/gpu/drm/i915/display/intel_display_reg_defs.h
+++ b/drivers/gpu/drm/i915/display/intel_display_reg_defs.h
@@ -8,7 +8,11 @@
 
 #include "i915_reg_defs.h"
 
+#ifdef I915
 #define DISPLAY_MMIO_BASE(dev_priv)	(INTEL_INFO(dev_priv)->display.mmio_offset)
+#else
+#define DISPLAY_MMIO_BASE(dev_priv)    ((dev_priv) ? 0U : 0U)
+#endif
 
 #define VLV_DISPLAY_BASE		0x180000
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 17/20] drm/i915/display: Fix a use-after-free when intel_edp_init_connector fails
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

We enable the DP aux channel during probe, but may free the connector
soon afterwards. Ensure the DP aux display power put is completed before
everything is freed, to prevent a use-after-free in icl_aux_pw_to_phy(),
called from icl_combo_phy_aux_power_well_disable.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_display_power.c | 2 +-
 drivers/gpu/drm/i915/display/intel_display_power.h | 1 +
 drivers/gpu/drm/i915/display/intel_dp_aux.c        | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display_power.c b/drivers/gpu/drm/i915/display/intel_display_power.c
index fd0fedb65e42..cdb36e3f96cd 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -786,7 +786,7 @@ void intel_display_power_flush_work(struct drm_i915_private *i915)
  * Like intel_display_power_flush_work(), but also ensure that the work
  * handler function is not running any more when this function returns.
  */
-static void
+void
 intel_display_power_flush_work_sync(struct drm_i915_private *i915)
 {
 	struct i915_power_domains *power_domains = &i915->display.power.domains;
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.h b/drivers/gpu/drm/i915/display/intel_display_power.h
index 2154d900b1aa..d220f6b16e00 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.h
+++ b/drivers/gpu/drm/i915/display/intel_display_power.h
@@ -195,6 +195,7 @@ void __intel_display_power_put_async(struct drm_i915_private *i915,
 				     enum intel_display_power_domain domain,
 				     intel_wakeref_t wakeref);
 void intel_display_power_flush_work(struct drm_i915_private *i915);
+void intel_display_power_flush_work_sync(struct drm_i915_private *i915);
 #if IS_ENABLED(CONFIG_DRM_I915_DEBUG_RUNTIME_PM)
 void intel_display_power_put(struct drm_i915_private *dev_priv,
 			     enum intel_display_power_domain domain,
diff --git a/drivers/gpu/drm/i915/display/intel_dp_aux.c b/drivers/gpu/drm/i915/display/intel_dp_aux.c
index 91c93c93e5fc..220aa88c67ee 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_aux.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_aux.c
@@ -680,6 +680,8 @@ void intel_dp_aux_fini(struct intel_dp *intel_dp)
 	if (cpu_latency_qos_request_active(&intel_dp->pm_qos))
 		cpu_latency_qos_remove_request(&intel_dp->pm_qos);
 
+	/* Ensure async work from intel_dp_aux_xfer() is flushed before we clean up */
+	intel_display_power_flush_work_sync(dp_to_i915(intel_dp));
 	kfree(intel_dp->aux.name);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 17/20] drm/i915/display: Fix a use-after-free when intel_edp_init_connector fails
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

We enable the DP aux channel during probe, but may free the connector
soon afterwards. Ensure the DP aux display power put is completed before
everything is freed, to prevent a use-after-free in icl_aux_pw_to_phy(),
called from icl_combo_phy_aux_power_well_disable.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/display/intel_display_power.c | 2 +-
 drivers/gpu/drm/i915/display/intel_display_power.h | 1 +
 drivers/gpu/drm/i915/display/intel_dp_aux.c        | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display_power.c b/drivers/gpu/drm/i915/display/intel_display_power.c
index fd0fedb65e42..cdb36e3f96cd 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -786,7 +786,7 @@ void intel_display_power_flush_work(struct drm_i915_private *i915)
  * Like intel_display_power_flush_work(), but also ensure that the work
  * handler function is not running any more when this function returns.
  */
-static void
+void
 intel_display_power_flush_work_sync(struct drm_i915_private *i915)
 {
 	struct i915_power_domains *power_domains = &i915->display.power.domains;
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.h b/drivers/gpu/drm/i915/display/intel_display_power.h
index 2154d900b1aa..d220f6b16e00 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.h
+++ b/drivers/gpu/drm/i915/display/intel_display_power.h
@@ -195,6 +195,7 @@ void __intel_display_power_put_async(struct drm_i915_private *i915,
 				     enum intel_display_power_domain domain,
 				     intel_wakeref_t wakeref);
 void intel_display_power_flush_work(struct drm_i915_private *i915);
+void intel_display_power_flush_work_sync(struct drm_i915_private *i915);
 #if IS_ENABLED(CONFIG_DRM_I915_DEBUG_RUNTIME_PM)
 void intel_display_power_put(struct drm_i915_private *dev_priv,
 			     enum intel_display_power_domain domain,
diff --git a/drivers/gpu/drm/i915/display/intel_dp_aux.c b/drivers/gpu/drm/i915/display/intel_dp_aux.c
index 91c93c93e5fc..220aa88c67ee 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_aux.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_aux.c
@@ -680,6 +680,8 @@ void intel_dp_aux_fini(struct intel_dp *intel_dp)
 	if (cpu_latency_qos_request_active(&intel_dp->pm_qos))
 		cpu_latency_qos_remove_request(&intel_dp->pm_qos);
 
+	/* Ensure async work from intel_dp_aux_xfer() is flushed before we clean up */
+	intel_display_power_flush_work_sync(dp_to_i915(intel_dp));
 	kfree(intel_dp->aux.name);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 18/20] drm/i915/display: Remaining changes to make xe compile
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Xe, the new Intel GPU driver, will re-use the i915 display.

At least for now, the plan is to use symbolic links and
adjust the build so we are building the display either for
i915 or for xe.

The display can be split out if needed.
Also the compilation is optional at this time

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Rodrigo changed i915_reg_defs.h, commit msg, and rebased]
---
 drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
 .../gpu/drm/i915/display/intel_atomic_plane.c |  25 ++-
 .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
 drivers/gpu/drm/i915/display/intel_bw.c       |   2 +-
 drivers/gpu/drm/i915/display/intel_cdclk.c    |  23 ++-
 drivers/gpu/drm/i915/display/intel_color.c    |   1 +
 drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
 drivers/gpu/drm/i915/display/intel_cursor.c   |   8 +-
 drivers/gpu/drm/i915/display/intel_display.c  | 150 ++++++++++++++++--
 drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
 .../gpu/drm/i915/display/intel_display_core.h |   5 +-
 .../drm/i915/display/intel_display_debugfs.c  |   8 +
 .../drm/i915/display/intel_display_power.c    |  35 ++--
 .../drm/i915/display/intel_display_power.h    |   5 +
 .../i915/display/intel_display_power_map.c    |   7 +
 .../i915/display/intel_display_power_well.c   |  17 +-
 .../drm/i915/display/intel_display_trace.h    |   6 +
 .../drm/i915/display/intel_display_types.h    |  24 ++-
 drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
 drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
 drivers/gpu/drm/i915/display/intel_dp_aux.c   |   4 +
 drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 ++
 drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
 drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
 drivers/gpu/drm/i915/display/intel_dsb.c      | 124 ++++++++++++---
 drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 ++-
 drivers/gpu/drm/i915/display/intel_fb.c       |  96 +++++++++--
 drivers/gpu/drm/i915/display/intel_fbc.c      |  49 +++++-
 drivers/gpu/drm/i915/display/intel_fbdev.c    | 101 ++++++++++--
 drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
 .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
 .../drm/i915/display/intel_modeset_setup.c    |  11 +-
 drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
 .../gpu/drm/i915/display/intel_pch_display.h  |  16 ++
 .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
 drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
 drivers/gpu/drm/i915/display/intel_sprite.c   |  21 +++
 drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
 drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
 drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
 .../drm/i915/display/skl_universal_plane.c    |  51 +++++-
 drivers/gpu/drm/i915/display/skl_watermark.c  |   2 +-
 drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
 drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
 44 files changed, 811 insertions(+), 129 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_atomic.c b/drivers/gpu/drm/i915/display/intel_atomic.c
index 6621aa245caf..56875afa592f 100644
--- a/drivers/gpu/drm/i915/display/intel_atomic.c
+++ b/drivers/gpu/drm/i915/display/intel_atomic.c
@@ -522,7 +522,9 @@ void intel_atomic_state_free(struct drm_atomic_state *_state)
 	drm_atomic_state_default_release(&state->base);
 	kfree(state->global_objs);
 
+#ifdef I915
 	i915_sw_fence_fini(&state->commit_ready);
+#endif
 
 	kfree(state);
 }
diff --git a/drivers/gpu/drm/i915/display/intel_atomic_plane.c b/drivers/gpu/drm/i915/display/intel_atomic_plane.c
index 10e1fc9d0698..acb32396e73c 100644
--- a/drivers/gpu/drm/i915/display/intel_atomic_plane.c
+++ b/drivers/gpu/drm/i915/display/intel_atomic_plane.c
@@ -34,7 +34,9 @@
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_fourcc.h>
 
+#ifdef I915
 #include "gt/intel_rps.h"
+#endif
 
 #include "intel_atomic_plane.h"
 #include "intel_cdclk.h"
@@ -107,7 +109,9 @@ intel_plane_duplicate_state(struct drm_plane *plane)
 	__drm_atomic_helper_plane_duplicate_state(plane, &intel_state->uapi);
 
 	intel_state->ggtt_vma = NULL;
+#ifdef I915
 	intel_state->dpt_vma = NULL;
+#endif
 	intel_state->flags = 0;
 
 	/* add reference to fb */
@@ -132,7 +136,9 @@ intel_plane_destroy_state(struct drm_plane *plane,
 	struct intel_plane_state *plane_state = to_intel_plane_state(state);
 
 	drm_WARN_ON(plane->dev, plane_state->ggtt_vma);
+#ifdef I915
 	drm_WARN_ON(plane->dev, plane_state->dpt_vma);
+#endif
 
 	__drm_atomic_helper_plane_destroy_state(&plane_state->uapi);
 	if (plane_state->hw.fb)
@@ -937,6 +943,7 @@ int intel_atomic_plane_check_clipping(struct intel_plane_state *plane_state,
 	return 0;
 }
 
+#ifdef I915
 struct wait_rps_boost {
 	struct wait_queue_entry wait;
 
@@ -994,6 +1001,7 @@ static void add_rps_boost_after_vblank(struct drm_crtc *crtc,
 
 	add_wait_queue(drm_crtc_vblank_waitqueue(crtc), &wait->wait);
 }
+#endif
 
 /**
  * intel_prepare_plane_fb - Prepare fb for usage on plane
@@ -1011,10 +1019,11 @@ static int
 intel_prepare_plane_fb(struct drm_plane *_plane,
 		       struct drm_plane_state *_new_plane_state)
 {
-	struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
-	struct intel_plane *plane = to_intel_plane(_plane);
 	struct intel_plane_state *new_plane_state =
 		to_intel_plane_state(_new_plane_state);
+#ifdef I915
+	struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
+	struct intel_plane *plane = to_intel_plane(_plane);
 	struct intel_atomic_state *state =
 		to_intel_atomic_state(new_plane_state->uapi.state);
 	struct drm_i915_private *dev_priv = to_i915(plane->base.dev);
@@ -1113,6 +1122,12 @@ intel_prepare_plane_fb(struct drm_plane *_plane,
 	intel_plane_unpin_fb(new_plane_state);
 
 	return ret;
+#else
+	if (!intel_fb_obj(new_plane_state->hw.fb))
+		return 0;
+
+	return intel_plane_pin_fb(new_plane_state);
+#endif
 }
 
 /**
@@ -1128,18 +1143,20 @@ intel_cleanup_plane_fb(struct drm_plane *plane,
 {
 	struct intel_plane_state *old_plane_state =
 		to_intel_plane_state(_old_plane_state);
-	struct intel_atomic_state *state =
+	__maybe_unused struct intel_atomic_state *state =
 		to_intel_atomic_state(old_plane_state->uapi.state);
-	struct drm_i915_private *dev_priv = to_i915(plane->dev);
+	__maybe_unused struct drm_i915_private *dev_priv = to_i915(plane->dev);
 	struct drm_i915_gem_object *obj = intel_fb_obj(old_plane_state->hw.fb);
 
 	if (!obj)
 		return;
 
+#ifdef I915
 	if (state->rps_interactive) {
 		intel_rps_mark_interactive(&to_gt(dev_priv)->rps, false);
 		state->rps_interactive = false;
 	}
+#endif
 
 	/* Should only be called after a successful intel_prepare_plane_fb()! */
 	intel_plane_unpin_fb(old_plane_state);
diff --git a/drivers/gpu/drm/i915/display/intel_backlight.c b/drivers/gpu/drm/i915/display/intel_backlight.c
index 5b7da72c95b8..e63eb43622e0 100644
--- a/drivers/gpu/drm/i915/display/intel_backlight.c
+++ b/drivers/gpu/drm/i915/display/intel_backlight.c
@@ -19,7 +19,7 @@
 #include "intel_dp_aux_backlight.h"
 #include "intel_dsi_dcs_backlight.h"
 #include "intel_panel.h"
-#include "intel_pci_config.h"
+#include "../i915/intel_pci_config.h"
 #include "intel_pps.h"
 #include "intel_quirks.h"
 
diff --git a/drivers/gpu/drm/i915/display/intel_bw.c b/drivers/gpu/drm/i915/display/intel_bw.c
index 54e03a3eaa0f..67b4e947589c 100644
--- a/drivers/gpu/drm/i915/display/intel_bw.c
+++ b/drivers/gpu/drm/i915/display/intel_bw.c
@@ -15,7 +15,7 @@
 #include "intel_display_core.h"
 #include "intel_display_types.h"
 #include "skl_watermark.h"
-#include "intel_mchbar_regs.h"
+#include "../i915/intel_mchbar_regs.h"
 
 /* Parameters for Qclk Geyserville (QGV) */
 struct intel_qgv_point {
diff --git a/drivers/gpu/drm/i915/display/intel_cdclk.c b/drivers/gpu/drm/i915/display/intel_cdclk.c
index 80e2db6b5ea4..3b6a37403f25 100644
--- a/drivers/gpu/drm/i915/display/intel_cdclk.c
+++ b/drivers/gpu/drm/i915/display/intel_cdclk.c
@@ -23,7 +23,6 @@
 
 #include <linux/time.h>
 
-#include "hsw_ips.h"
 #include "i915_reg.h"
 #include "intel_atomic.h"
 #include "intel_atomic_plane.h"
@@ -33,10 +32,14 @@
 #include "intel_crtc.h"
 #include "intel_de.h"
 #include "intel_display_types.h"
-#include "intel_mchbar_regs.h"
-#include "intel_pci_config.h"
+#include "../i915/intel_mchbar_regs.h"
+#include "../i915/intel_pci_config.h"
 #include "intel_psr.h"
+
+#ifdef I915
+#include "hsw_ips.h"
 #include "vlv_sideband.h"
+#endif
 
 /**
  * DOC: CDCLK / RAWCLK
@@ -474,6 +477,7 @@ static void hsw_get_cdclk(struct drm_i915_private *dev_priv,
 		cdclk_config->cdclk = 540000;
 }
 
+#ifdef I915
 static int vlv_calc_cdclk(struct drm_i915_private *dev_priv, int min_cdclk)
 {
 	int freq_320 = (dev_priv->hpll_freq <<  1) % 320000 != 0 ?
@@ -712,6 +716,7 @@ static void chv_set_cdclk(struct drm_i915_private *dev_priv,
 
 	intel_display_power_put(dev_priv, POWER_DOMAIN_DISPLAY_CORE, wakeref);
 }
+#endif
 
 static int bdw_calc_cdclk(int min_cdclk)
 {
@@ -2375,9 +2380,11 @@ int intel_crtc_compute_min_cdclk(const struct intel_crtc_state *crtc_state)
 
 	min_cdclk = intel_pixel_rate_to_cdclk(crtc_state);
 
+#ifdef I915
 	/* pixel rate mustn't exceed 95% of cdclk with IPS on BDW */
 	if (IS_BROADWELL(dev_priv) && hsw_crtc_state_ips_capable(crtc_state))
 		min_cdclk = DIV_ROUND_UP(min_cdclk * 100, 95);
+#endif
 
 	/* BSpec says "Do not use DisplayPort with CDCLK less than 432 MHz,
 	 * audio enabled, port width x4, and link rate HBR2 (5.4 GHz), or else
@@ -2571,6 +2578,7 @@ static int bxt_compute_min_voltage_level(struct intel_cdclk_state *cdclk_state)
 	return min_voltage_level;
 }
 
+#ifdef I915
 static int vlv_modeset_calc_cdclk(struct intel_cdclk_state *cdclk_state)
 {
 	struct intel_atomic_state *state = cdclk_state->base.state;
@@ -2599,6 +2607,7 @@ static int vlv_modeset_calc_cdclk(struct intel_cdclk_state *cdclk_state)
 
 	return 0;
 }
+#endif
 
 static int bdw_modeset_calc_cdclk(struct intel_cdclk_state *cdclk_state)
 {
@@ -3101,12 +3110,14 @@ static int pch_rawclk(struct drm_i915_private *dev_priv)
 	return (intel_de_read(dev_priv, PCH_RAWCLK_FREQ) & RAWCLK_FREQ_MASK) * 1000;
 }
 
+#ifdef I915
 static int vlv_hrawclk(struct drm_i915_private *dev_priv)
 {
 	/* RAWCLK_FREQ_VLV register updated from power well code */
 	return vlv_get_cck_clock_hpll(dev_priv, "hrawclk",
 				      CCK_DISPLAY_REF_CLOCK_CONTROL);
 }
+#endif
 
 static int i9xx_hrawclk(struct drm_i915_private *dev_priv)
 {
@@ -3188,8 +3199,10 @@ u32 intel_read_rawclk(struct drm_i915_private *dev_priv)
 		freq = cnp_rawclk(dev_priv);
 	else if (HAS_PCH_SPLIT(dev_priv))
 		freq = pch_rawclk(dev_priv);
+#ifdef I915
 	else if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv))
 		freq = vlv_hrawclk(dev_priv);
+#endif
 	else if (DISPLAY_VER(dev_priv) >= 3)
 		freq = i9xx_hrawclk(dev_priv);
 	else
@@ -3246,6 +3259,7 @@ static const struct intel_cdclk_funcs bdw_cdclk_funcs = {
 	.modeset_calc_cdclk = bdw_modeset_calc_cdclk,
 };
 
+#ifdef I915
 static const struct intel_cdclk_funcs chv_cdclk_funcs = {
 	.get_cdclk = vlv_get_cdclk,
 	.set_cdclk = chv_set_cdclk,
@@ -3257,6 +3271,7 @@ static const struct intel_cdclk_funcs vlv_cdclk_funcs = {
 	.set_cdclk = vlv_set_cdclk,
 	.modeset_calc_cdclk = vlv_modeset_calc_cdclk,
 };
+#endif
 
 static const struct intel_cdclk_funcs hsw_cdclk_funcs = {
 	.get_cdclk = hsw_get_cdclk,
@@ -3378,10 +3393,12 @@ void intel_init_cdclk_hooks(struct drm_i915_private *dev_priv)
 		dev_priv->display.funcs.cdclk = &bdw_cdclk_funcs;
 	} else if (IS_HASWELL(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &hsw_cdclk_funcs;
+#ifdef I915
 	} else if (IS_CHERRYVIEW(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &chv_cdclk_funcs;
 	} else if (IS_VALLEYVIEW(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &vlv_cdclk_funcs;
+#endif
 	} else if (IS_SANDYBRIDGE(dev_priv) || IS_IVYBRIDGE(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &fixed_400mhz_cdclk_funcs;
 	} else if (IS_IRONLAKE(dev_priv)) {
diff --git a/drivers/gpu/drm/i915/display/intel_color.c b/drivers/gpu/drm/i915/display/intel_color.c
index d57631b0bb9a..22f42ec3ee03 100644
--- a/drivers/gpu/drm/i915/display/intel_color.c
+++ b/drivers/gpu/drm/i915/display/intel_color.c
@@ -26,6 +26,7 @@
 #include "intel_color.h"
 #include "intel_de.h"
 #include "intel_display_types.h"
+#include "intel_dpll.h"
 #include "intel_dsb.h"
 
 struct intel_color_funcs {
diff --git a/drivers/gpu/drm/i915/display/intel_crtc.c b/drivers/gpu/drm/i915/display/intel_crtc.c
index 037fc140b585..5214bfe86a13 100644
--- a/drivers/gpu/drm/i915/display/intel_crtc.c
+++ b/drivers/gpu/drm/i915/display/intel_crtc.c
@@ -12,8 +12,10 @@
 #include <drm/drm_vblank_work.h>
 
 #include "i915_irq.h"
+#ifdef I915
 #include "i915_vgpu.h"
 #include "i9xx_plane.h"
+#endif
 #include "icl_dsi.h"
 #include "intel_atomic.h"
 #include "intel_atomic_plane.h"
@@ -306,7 +308,11 @@ int intel_crtc_init(struct drm_i915_private *dev_priv, enum pipe pipe)
 		primary = skl_universal_plane_create(dev_priv, pipe,
 						     PLANE_PRIMARY);
 	else
+#ifdef I915
 		primary = intel_primary_plane_create(dev_priv, pipe);
+#else
+		BUG_ON(1);
+#endif
 	if (IS_ERR(primary)) {
 		ret = PTR_ERR(primary);
 		goto fail;
@@ -655,13 +661,15 @@ void intel_pipe_update_end(struct intel_crtc_state *new_crtc_state)
 					 drm_crtc_accurate_vblank_count(&crtc->base) + 1,
 					 false);
 	} else if (new_crtc_state->uapi.event) {
+		unsigned long flags;
+
 		drm_WARN_ON(&dev_priv->drm,
 			    drm_crtc_vblank_get(&crtc->base) != 0);
 
-		spin_lock(&crtc->base.dev->event_lock);
+		spin_lock_irqsave(&crtc->base.dev->event_lock, flags);
 		drm_crtc_arm_vblank_event(&crtc->base,
 					  new_crtc_state->uapi.event);
-		spin_unlock(&crtc->base.dev->event_lock);
+		spin_unlock_irqrestore(&crtc->base.dev->event_lock, flags);
 
 		new_crtc_state->uapi.event = NULL;
 	}
@@ -684,8 +692,10 @@ void intel_pipe_update_end(struct intel_crtc_state *new_crtc_state)
 
 	local_irq_enable();
 
+#ifdef I915
 	if (intel_vgpu_active(dev_priv))
 		return;
+#endif
 
 	if (crtc->debug.start_vbl_count &&
 	    crtc->debug.start_vbl_count != end_vbl_count) {
diff --git a/drivers/gpu/drm/i915/display/intel_cursor.c b/drivers/gpu/drm/i915/display/intel_cursor.c
index 371009f8e194..5bdd66e66202 100644
--- a/drivers/gpu/drm/i915/display/intel_cursor.c
+++ b/drivers/gpu/drm/i915/display/intel_cursor.c
@@ -31,15 +31,15 @@ static const u32 intel_cursor_formats[] = {
 
 static u32 intel_cursor_base(const struct intel_plane_state *plane_state)
 {
-	struct drm_i915_private *dev_priv =
+	__maybe_unused struct drm_i915_private *dev_priv =
 		to_i915(plane_state->uapi.plane->dev);
-	const struct drm_framebuffer *fb = plane_state->hw.fb;
-	const struct drm_i915_gem_object *obj = intel_fb_obj(fb);
 	u32 base;
 
+#ifdef I915
 	if (INTEL_INFO(dev_priv)->display.cursor_needs_physical)
-		base = sg_dma_address(obj->mm.pages->sgl);
+		base = sg_dma_address(intel_fb_obj(plane_state->hw.fb)->mm.pages->sgl);
 	else
+#endif
 		base = intel_plane_ggtt_offset(plane_state);
 
 	return base + plane_state->view.color_plane[0].offset;
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index ef9bab4043ee..5a0a8179b0dc 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -46,7 +46,7 @@
 #include <drm/drm_rect.h>
 
 #include "display/intel_audio.h"
-#include "display/intel_crt.h"
+#include "display/intel_backlight.h"
 #include "display/intel_ddi.h"
 #include "display/intel_display_debugfs.h"
 #include "display/intel_display_power.h"
@@ -55,24 +55,36 @@
 #include "display/intel_dpll.h"
 #include "display/intel_dpll_mgr.h"
 #include "display/intel_drrs.h"
+#include "display/intel_dsb.h"
 #include "display/intel_dsi.h"
-#include "display/intel_dvo.h"
 #include "display/intel_fb.h"
 #include "display/intel_gmbus.h"
 #include "display/intel_hdmi.h"
 #include "display/intel_lvds.h"
-#include "display/intel_sdvo.h"
 #include "display/intel_snps_phy.h"
-#include "display/intel_tv.h"
 #include "display/intel_vdsc.h"
 #include "display/intel_vrr.h"
 
+#ifdef I915
+#include "display/intel_crt.h"
+#include "display/intel_dvo.h"
+#include "display/intel_overlay.h"
+#include "display/intel_sdvo.h"
+#include "display/intel_tv.h"
+
 #include "gem/i915_gem_lmem.h"
 #include "gem/i915_gem_object.h"
 
 #include "g4x_dp.h"
 #include "g4x_hdmi.h"
 #include "hsw_ips.h"
+#include "i9xx_plane.h"
+#include "vlv_dsi.h"
+#include "vlv_dsi_pll.h"
+#include "vlv_dsi_regs.h"
+#include "vlv_sideband.h"
+#endif
+
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "i915_utils.h"
@@ -101,7 +113,6 @@
 #include "intel_hti.h"
 #include "intel_modeset_verify.h"
 #include "intel_modeset_setup.h"
-#include "intel_overlay.h"
 #include "intel_panel.h"
 #include "intel_pch_display.h"
 #include "intel_pch_refclk.h"
@@ -114,14 +125,16 @@
 #include "intel_sprite.h"
 #include "intel_tc.h"
 #include "intel_vga.h"
-#include "i9xx_plane.h"
 #include "skl_scaler.h"
 #include "skl_universal_plane.h"
 #include "skl_watermark.h"
+
+#ifdef I915
 #include "vlv_dsi.h"
 #include "vlv_dsi_pll.h"
 #include "vlv_dsi_regs.h"
 #include "vlv_sideband.h"
+#endif
 
 static void intel_set_transcoder_timings(const struct intel_crtc_state *crtc_state);
 static void intel_set_pipe_src_size(const struct intel_crtc_state *crtc_state);
@@ -224,6 +237,7 @@ static int intel_compute_global_watermarks(struct intel_atomic_state *state)
 	return 0;
 }
 
+#ifdef I915
 /* returns HPLL frequency in kHz */
 int vlv_get_hpll_vco(struct drm_i915_private *dev_priv)
 {
@@ -280,6 +294,7 @@ static void intel_update_czclk(struct drm_i915_private *dev_priv)
 	drm_dbg(&dev_priv->drm, "CZ clock rate: %d kHz\n",
 		dev_priv->czclk_freq);
 }
+#endif
 
 static bool is_hdr_mode(const struct intel_crtc_state *crtc_state)
 {
@@ -879,14 +894,17 @@ __intel_display_resume(struct drm_i915_private *i915,
 	return intel_display_commit_duplicated_state(to_intel_atomic_state(state), ctx);
 }
 
+#ifdef I915
 static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
 {
 	return (INTEL_INFO(dev_priv)->gpu_reset_clobbers_display &&
 		intel_has_gpu_reset(to_gt(dev_priv)));
 }
+#endif
 
 void intel_display_prepare_reset(struct drm_i915_private *dev_priv)
 {
+#ifdef I915
 	struct drm_modeset_acquire_ctx *ctx = &dev_priv->display.restore.reset_ctx;
 	struct drm_atomic_state *state;
 	int ret;
@@ -945,10 +963,12 @@ void intel_display_prepare_reset(struct drm_i915_private *dev_priv)
 
 	dev_priv->display.restore.modeset_state = state;
 	state->acquire_ctx = ctx;
+#endif
 }
 
 void intel_display_finish_reset(struct drm_i915_private *i915)
 {
+#ifdef I915
 	struct drm_modeset_acquire_ctx *ctx = &i915->display.restore.reset_ctx;
 	struct drm_atomic_state *state;
 	int ret;
@@ -996,6 +1016,7 @@ void intel_display_finish_reset(struct drm_i915_private *i915)
 	mutex_unlock(&i915->drm.mode_config.mutex);
 
 	clear_bit_unlock(I915_RESET_MODESET, &to_gt(i915)->reset.flags);
+#endif
 }
 
 static void icl_set_pipe_chicken(const struct intel_crtc_state *crtc_state)
@@ -3123,6 +3144,7 @@ static void i9xx_get_pfit_config(struct intel_crtc_state *crtc_state)
 		intel_de_read(dev_priv, PFIT_PGM_RATIOS);
 }
 
+#ifdef I915
 static void vlv_crtc_clock_get(struct intel_crtc *crtc,
 			       struct intel_crtc_state *pipe_config)
 {
@@ -3183,6 +3205,7 @@ static void chv_crtc_clock_get(struct intel_crtc *crtc,
 
 	pipe_config->port_clock = chv_calc_dpll_params(refclk, &clock);
 }
+#endif
 
 static enum intel_output_format
 bdw_get_pipemisc_output_format(struct intel_crtc *crtc)
@@ -3287,7 +3310,7 @@ static bool i9xx_get_pipe_config(struct intel_crtc *crtc,
 	intel_get_pipe_src_size(crtc, pipe_config);
 
 	i9xx_get_pfit_config(pipe_config);
-
+#ifdef I915
 	if (DISPLAY_VER(dev_priv) >= 4) {
 		/* No way to read it out on pipes B and C */
 		if (IS_CHERRYVIEW(dev_priv) && crtc->pipe != PIPE_A)
@@ -3329,6 +3352,7 @@ static bool i9xx_get_pipe_config(struct intel_crtc *crtc,
 	else if (IS_VALLEYVIEW(dev_priv))
 		vlv_crtc_clock_get(crtc, pipe_config);
 	else
+#endif
 		i9xx_crtc_clock_get(crtc, pipe_config);
 
 	/*
@@ -3987,6 +4011,7 @@ static bool bxt_get_dsi_transcoder_state(struct intel_crtc *crtc,
 					 struct intel_crtc_state *pipe_config,
 					 struct intel_display_power_domain_set *power_domain_set)
 {
+#ifdef I915
 	struct drm_device *dev = crtc->base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	enum transcoder cpu_transcoder;
@@ -4025,6 +4050,7 @@ static bool bxt_get_dsi_transcoder_state(struct intel_crtc *crtc,
 		pipe_config->cpu_transcoder = cpu_transcoder;
 		break;
 	}
+#endif
 
 	return transcoder_is_dsi(pipe_config->cpu_transcoder);
 }
@@ -4129,7 +4155,9 @@ static bool hsw_get_pipe_config(struct intel_crtc *crtc,
 			ilk_get_pfit_config(pipe_config);
 	}
 
+#ifdef I915
 	hsw_ips_get_config(pipe_config);
+#endif
 
 	if (pipe_config->cpu_transcoder != TRANSCODER_EDP &&
 	    !transcoder_is_dsi(pipe_config->cpu_transcoder)) {
@@ -4762,8 +4790,8 @@ static u16 hsw_linetime_wm(const struct intel_crtc_state *crtc_state)
 	return min(linetime_wm, 0x1ff);
 }
 
-static u16 hsw_ips_linetime_wm(const struct intel_crtc_state *crtc_state,
-			       const struct intel_cdclk_state *cdclk_state)
+static inline u16 hsw_ips_linetime_wm(const struct intel_crtc_state *crtc_state,
+				      const struct intel_cdclk_state *cdclk_state)
 {
 	const struct drm_display_mode *pipe_mode =
 		&crtc_state->hw.pipe_mode;
@@ -4806,13 +4834,14 @@ static int hsw_compute_linetime_wm(struct intel_atomic_state *state,
 	struct drm_i915_private *dev_priv = to_i915(crtc->base.dev);
 	struct intel_crtc_state *crtc_state =
 		intel_atomic_get_new_crtc_state(state, crtc);
-	const struct intel_cdclk_state *cdclk_state;
+	__maybe_unused const struct intel_cdclk_state *cdclk_state;
 
 	if (DISPLAY_VER(dev_priv) >= 9)
 		crtc_state->linetime = skl_linetime_wm(crtc_state);
 	else
 		crtc_state->linetime = hsw_linetime_wm(crtc_state);
 
+#ifdef I915
 	if (!hsw_crtc_supports_ips(crtc))
 		return 0;
 
@@ -4822,6 +4851,7 @@ static int hsw_compute_linetime_wm(struct intel_atomic_state *state,
 
 	crtc_state->ips_linetime = hsw_ips_linetime_wm(crtc_state,
 						       cdclk_state);
+#endif
 
 	return 0;
 }
@@ -4890,11 +4920,13 @@ static int intel_crtc_atomic_check(struct intel_atomic_state *state,
 			return ret;
 	}
 
+#ifdef I915
 	if (HAS_IPS(dev_priv)) {
 		ret = hsw_ips_compute_config(state, crtc);
 		if (ret)
 			return ret;
 	}
+#endif
 
 	if (DISPLAY_VER(dev_priv) >= 9 ||
 	    IS_BROADWELL(dev_priv) || IS_HASWELL(dev_priv)) {
@@ -5503,6 +5535,7 @@ pipe_config_mismatch(bool fastset, const struct intel_crtc *crtc,
 
 static bool fastboot_enabled(struct drm_i915_private *dev_priv)
 {
+#ifdef I915
 	if (dev_priv->params.fastboot != -1)
 		return dev_priv->params.fastboot;
 
@@ -5516,6 +5549,9 @@ static bool fastboot_enabled(struct drm_i915_private *dev_priv)
 
 	/* Disabled by default on all others */
 	return false;
+#else
+	return true;
+#endif
 }
 
 bool
@@ -7333,6 +7369,7 @@ static void skl_commit_modeset_enables(struct intel_atomic_state *state)
 	drm_WARN_ON(&dev_priv->drm, update_pipes);
 }
 
+#ifdef I915
 static void intel_atomic_helper_free_state(struct drm_i915_private *dev_priv)
 {
 	struct intel_atomic_state *state, *next;
@@ -7350,9 +7387,11 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 
 	intel_atomic_helper_free_state(dev_priv);
 }
+#endif
 
 static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_state)
 {
+#ifdef I915
 	struct wait_queue_entry wait_fence, wait_reset;
 	struct drm_i915_private *dev_priv = to_i915(intel_state->base.dev);
 
@@ -7376,6 +7415,24 @@ static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_stat
 	finish_wait(bit_waitqueue(&to_gt(dev_priv)->reset.flags,
 				  I915_RESET_MODESET),
 		    &wait_reset);
+#else
+	struct intel_plane_state *plane_state;
+	struct intel_plane *plane;
+	int i;
+
+	for_each_new_intel_plane_in_state(intel_state, plane, plane_state, i) {
+		struct xe_bo *bo;
+
+		if (plane_state->uapi.fence)
+			dma_fence_wait(plane_state->uapi.fence, false);
+		bo = intel_fb_obj(plane_state->hw.fb);
+		if (!bo)
+			continue;
+
+		/* TODO: May deadlock, need to grab all fences in prepare_plane_fb */
+		dma_resv_wait_timeout(bo->ttm.base.resv, DMA_RESV_USAGE_KERNEL, false, MAX_SCHEDULE_TIMEOUT);
+	}
+#endif
 }
 
 static void intel_atomic_cleanup_work(struct work_struct *work)
@@ -7394,9 +7451,45 @@ static void intel_atomic_cleanup_work(struct work_struct *work)
 	drm_atomic_helper_commit_cleanup_done(&state->base);
 	drm_atomic_state_put(&state->base);
 
+#ifdef I915
 	intel_atomic_helper_free_state(i915);
+#endif
 }
 
+#ifndef I915
+static int i915_gem_object_read_from_page(struct xe_bo *bo,
+					  u32 ofs, u64 *ptr, u32 size)
+{
+	struct ttm_bo_kmap_obj map;
+	void *virtual;
+	bool is_iomem;
+	int ret;
+	struct ww_acquire_ctx ww;
+
+	XE_BUG_ON(size != 8);
+
+	ret = xe_bo_lock(bo, &ww, 0, true);
+	if (ret)
+		return ret;
+
+	ret = ttm_bo_kmap(&bo->ttm, ofs >> PAGE_SHIFT, 1, &map);
+	if (ret)
+		goto out_unlock;
+
+	ofs &= ~PAGE_MASK;
+	virtual = ttm_kmap_obj_virtual(&map, &is_iomem);
+	if (is_iomem)
+		*ptr = readq((void __iomem *)(virtual + ofs));
+	else
+		*ptr = *(u64 *)(virtual + ofs);
+
+	ttm_bo_kunmap(&map);
+out_unlock:
+	xe_bo_unlock(bo, &ww);
+	return ret;
+}
+#endif
+
 static void intel_atomic_prepare_plane_clear_colors(struct intel_atomic_state *state)
 {
 	struct drm_i915_private *i915 = to_i915(state->base.dev);
@@ -7629,6 +7722,7 @@ static void intel_atomic_commit_work(struct work_struct *work)
 	intel_atomic_commit_tail(state);
 }
 
+#ifdef I915
 static int
 intel_atomic_commit_ready(struct i915_sw_fence *fence,
 			  enum i915_sw_fence_notify notify)
@@ -7653,6 +7747,7 @@ intel_atomic_commit_ready(struct i915_sw_fence *fence,
 
 	return NOTIFY_DONE;
 }
+#endif
 
 static void intel_atomic_track_fbs(struct intel_atomic_state *state)
 {
@@ -7677,9 +7772,11 @@ static int intel_atomic_commit(struct drm_device *dev,
 
 	state->wakeref = intel_runtime_pm_get(&dev_priv->runtime_pm);
 
+#ifdef I915
 	drm_atomic_state_get(&state->base);
 	i915_sw_fence_init(&state->commit_ready,
 			   intel_atomic_commit_ready);
+#endif
 
 	/*
 	 * The intel_legacy_cursor_update() fast path takes care
@@ -7783,7 +7880,7 @@ static void intel_plane_possible_crtcs_init(struct drm_i915_private *dev_priv)
 	}
 }
 
-
+#ifdef I915
 int intel_get_pipe_from_crtc_id_ioctl(struct drm_device *dev, void *data,
 				      struct drm_file *file)
 {
@@ -7800,6 +7897,7 @@ int intel_get_pipe_from_crtc_id_ioctl(struct drm_device *dev, void *data,
 
 	return 0;
 }
+#endif
 
 static u32 intel_encoder_possible_clones(struct intel_encoder *encoder)
 {
@@ -7827,7 +7925,7 @@ static u32 intel_encoder_possible_crtcs(struct intel_encoder *encoder)
 	return possible_crtcs;
 }
 
-static bool ilk_has_edp_a(struct drm_i915_private *dev_priv)
+static inline bool ilk_has_edp_a(struct drm_i915_private *dev_priv)
 {
 	if (!IS_MOBILE(dev_priv))
 		return false;
@@ -7841,7 +7939,7 @@ static bool ilk_has_edp_a(struct drm_i915_private *dev_priv)
 	return true;
 }
 
-static bool intel_ddi_crt_present(struct drm_i915_private *dev_priv)
+static inline bool intel_ddi_crt_present(struct drm_i915_private *dev_priv)
 {
 	if (DISPLAY_VER(dev_priv) >= 9)
 		return false;
@@ -7866,7 +7964,6 @@ static bool intel_ddi_crt_present(struct drm_i915_private *dev_priv)
 static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 {
 	struct intel_encoder *encoder;
-	bool dpd_is_edp = false;
 
 	intel_pps_unlock_regs_wa(dev_priv);
 
@@ -7926,7 +8023,9 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 		intel_ddi_init(dev_priv, PORT_A);
 		intel_ddi_init(dev_priv, PORT_B);
 		intel_ddi_init(dev_priv, PORT_C);
+#ifdef I915
 		vlv_dsi_init(dev_priv);
+#endif
 	} else if (DISPLAY_VER(dev_priv) >= 9) {
 		intel_ddi_init(dev_priv, PORT_A);
 		intel_ddi_init(dev_priv, PORT_B);
@@ -7935,9 +8034,10 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 		intel_ddi_init(dev_priv, PORT_E);
 	} else if (HAS_DDI(dev_priv)) {
 		u32 found;
-
+#ifdef I915
 		if (intel_ddi_crt_present(dev_priv))
 			intel_crt_init(dev_priv);
+#endif
 
 		/* Haswell uses DDI functions to detect digital outputs. */
 		found = intel_de_read(dev_priv, DDI_BUF_CTL(PORT_A)) & DDI_INIT_DISPLAY_DETECTED;
@@ -7953,7 +8053,9 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 			intel_ddi_init(dev_priv, PORT_D);
 		if (found & SFUSE_STRAP_DDIF_DETECTED)
 			intel_ddi_init(dev_priv, PORT_F);
+#ifdef I915
 	} else if (HAS_PCH_SPLIT(dev_priv)) {
+		bool dpd_is_edp = false;
 		int found;
 
 		/*
@@ -8090,6 +8192,7 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 
 		intel_crt_init(dev_priv);
 		intel_dvo_init(dev_priv);
+#endif
 	}
 
 	for_each_intel_encoder(&dev_priv->drm, encoder) {
@@ -8277,6 +8380,10 @@ static const struct intel_display_funcs skl_display_funcs = {
 	.get_initial_plane_config = skl_get_initial_plane_config,
 };
 
+#ifndef I915
+#define i9xx_get_initial_plane_config skl_get_initial_plane_config
+#endif
+
 static const struct intel_display_funcs ddi_display_funcs = {
 	.get_pipe_config = hsw_get_pipe_config,
 	.crtc_enable = hsw_crtc_enable,
@@ -8661,9 +8768,11 @@ int intel_modeset_init_noirq(struct drm_i915_private *i915)
 	if (ret)
 		goto cleanup_vga_client_pw_domain_dmc;
 
+#ifdef I915
 	init_llist_head(&i915->display.atomic_helper.free_list);
 	INIT_WORK(&i915->display.atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
+#endif
 
 	intel_init_quirks(i915);
 
@@ -8716,7 +8825,9 @@ int intel_modeset_init_nogem(struct drm_i915_private *i915)
 	intel_shared_dpll_init(i915);
 	intel_fdi_pll_freq_update(i915);
 
+#ifdef I915
 	intel_update_czclk(i915);
+#endif
 	intel_modeset_init_hw(i915);
 	intel_dpll_update_ref_clks(i915);
 
@@ -8923,11 +9034,14 @@ void intel_display_resume(struct drm_device *dev)
 		drm_atomic_state_put(state);
 }
 
-static void intel_hpd_poll_fini(struct drm_i915_private *i915)
+void intel_hpd_poll_fini(struct drm_i915_private *i915)
 {
 	struct intel_connector *connector;
 	struct drm_connector_list_iter conn_iter;
 
+	if (!HAS_DISPLAY(i915))
+		return;
+
 	/* Kill all the work that may have been queued by hpd. */
 	drm_connector_list_iter_begin(&i915->drm, &conn_iter);
 	for_each_intel_connector_iter(connector, &conn_iter) {
@@ -8950,8 +9064,10 @@ void intel_modeset_driver_remove(struct drm_i915_private *i915)
 	flush_workqueue(i915->display.wq.flip);
 	flush_workqueue(i915->display.wq.modeset);
 
+#ifdef I915
 	flush_work(&i915->display.atomic_helper.free_work);
 	drm_WARN_ON(&i915->drm, !llist_empty(&i915->display.atomic_helper.free_list));
+#endif
 
 	/*
 	 * MST topology needs to be suspended so we don't have any calls to
@@ -9011,12 +9127,14 @@ bool intel_modeset_probe_defer(struct pci_dev *pdev)
 {
 	struct drm_privacy_screen *privacy_screen;
 
+#ifdef I915
 	/*
 	 * apple-gmux is needed on dual GPU MacBook Pro
 	 * to probe the panel if we're the inactive GPU.
 	 */
 	if (vga_switcheroo_client_probe_defer(pdev))
 		return true;
+#endif
 
 	/* If the LCD panel has a privacy-screen, wait for it */
 	privacy_screen = drm_privacy_screen_get(&pdev->dev, NULL);
diff --git a/drivers/gpu/drm/i915/display/intel_display.h b/drivers/gpu/drm/i915/display/intel_display.h
index ef73730f32b0..b063d16f4767 100644
--- a/drivers/gpu/drm/i915/display/intel_display.h
+++ b/drivers/gpu/drm/i915/display/intel_display.h
@@ -545,6 +545,7 @@ int vlv_get_cck_clock(struct drm_i915_private *dev_priv,
 		      const char *name, u32 reg, int ref_freq);
 int vlv_get_cck_clock_hpll(struct drm_i915_private *dev_priv,
 			   const char *name, u32 reg);
+void intel_hpd_poll_fini(struct drm_i915_private *i915);
 void intel_init_display_hooks(struct drm_i915_private *dev_priv);
 unsigned int intel_fb_xy_to_linear(int x, int y,
 				   const struct intel_plane_state *state,
@@ -670,10 +671,16 @@ void assert_transcoder(struct drm_i915_private *dev_priv,
  * enable distros and users to tailor their preferred amount of i915 abrt
  * spam.
  */
+#ifdef I915
+#define i915_display_verbose_check (i915_modparams.verbose_state_checks)
+#else
+#define i915_display_verbose_check 1
+#endif
+
 #define I915_STATE_WARN(condition, format...) ({			\
 	int __ret_warn_on = !!(condition);				\
 	if (unlikely(__ret_warn_on))					\
-		if (!WARN(i915_modparams.verbose_state_checks, format))	\
+		if (!WARN(i915_display_verbose_check, format))	\
 			DRM_ERROR(format);				\
 	unlikely(__ret_warn_on);					\
 })
diff --git a/drivers/gpu/drm/i915/display/intel_display_core.h b/drivers/gpu/drm/i915/display/intel_display_core.h
index 57ddce3ba02b..1c65b5b2893e 100644
--- a/drivers/gpu/drm/i915/display/intel_display_core.h
+++ b/drivers/gpu/drm/i915/display/intel_display_core.h
@@ -227,12 +227,13 @@ struct intel_wm {
 	u16 skl_latency[8];
 
 	/* current hardware state */
+#ifdef I915
 	union {
 		struct ilk_wm_values hw;
 		struct vlv_wm_values vlv;
 		struct g4x_wm_values g4x;
 	};
-
+#endif
 	u8 max_level;
 
 	/*
@@ -274,10 +275,12 @@ struct intel_display {
 	} funcs;
 
 	/* Grouping using anonymous structs. Keep sorted. */
+#ifdef I915
 	struct intel_atomic_helper {
 		struct llist_head free_list;
 		struct work_struct free_work;
 	} atomic_helper;
+#endif
 
 	struct {
 		/* backlight registers and fields in struct intel_panel */
diff --git a/drivers/gpu/drm/i915/display/intel_display_debugfs.c b/drivers/gpu/drm/i915/display/intel_display_debugfs.c
index 7bcd90384a46..6c40ca8a709f 100644
--- a/drivers/gpu/drm/i915/display/intel_display_debugfs.c
+++ b/drivers/gpu/drm/i915/display/intel_display_debugfs.c
@@ -8,7 +8,11 @@
 #include <drm/drm_debugfs.h>
 #include <drm/drm_fourcc.h>
 
+#ifdef I915
 #include "i915_debugfs.h"
+#else
+#define i915_debugfs_describe_obj(a, b) do { } while (0)
+#endif
 #include "i915_irq.h"
 #include "i915_reg.h"
 #include "intel_de.h"
@@ -51,6 +55,7 @@ static int i915_frontbuffer_tracking(struct seq_file *m, void *unused)
 
 static int i915_ips_status(struct seq_file *m, void *unused)
 {
+#ifdef I915
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
 	intel_wakeref_t wakeref;
 
@@ -74,6 +79,9 @@ static int i915_ips_status(struct seq_file *m, void *unused)
 	intel_runtime_pm_put(&dev_priv->runtime_pm, wakeref);
 
 	return 0;
+#else
+	return -ENODEV;
+#endif
 }
 
 static int i915_sr_status(struct seq_file *m, void *unused)
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.c b/drivers/gpu/drm/i915/display/intel_display_power.c
index cdb36e3f96cd..c3a57ec0d2f3 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -16,11 +16,17 @@
 #include "intel_display_power_well.h"
 #include "intel_display_types.h"
 #include "intel_dmc.h"
-#include "intel_mchbar_regs.h"
+#include "../i915/intel_mchbar_regs.h"
 #include "intel_pch_refclk.h"
 #include "intel_snps_phy.h"
 #include "skl_watermark.h"
+
+#ifdef I915
 #include "vlv_sideband.h"
+#else
+#define PUNIT_REG_ISPSSPM0 0
+#define PUNIT_REG_VEDSSPM0 0
+#endif
 
 #define for_each_power_domain_well(__dev_priv, __power_well, __domain)	\
 	for_each_power_well(__dev_priv, __power_well)				\
@@ -212,8 +218,10 @@ bool __intel_display_power_is_enabled(struct drm_i915_private *dev_priv,
 	struct i915_power_well *power_well;
 	bool is_enabled;
 
+#ifdef I915
 	if (dev_priv->runtime_pm.suspended)
 		return false;
+#endif
 
 	is_enabled = true;
 
@@ -621,7 +629,6 @@ release_async_put_domains(struct i915_power_domains *power_domains,
 	struct drm_i915_private *dev_priv =
 		container_of(power_domains, struct drm_i915_private,
 			     display.power.domains);
-	struct intel_runtime_pm *rpm = &dev_priv->runtime_pm;
 	enum intel_display_power_domain domain;
 	intel_wakeref_t wakeref;
 
@@ -630,8 +637,8 @@ release_async_put_domains(struct i915_power_domains *power_domains,
 	 * wakeref to make the state checker happy about the HW access during
 	 * power well disabling.
 	 */
-	assert_rpm_raw_wakeref_held(rpm);
-	wakeref = intel_runtime_pm_get(rpm);
+	assert_rpm_raw_wakeref_held(&dev_priv->runtime_pm);
+	wakeref = intel_runtime_pm_get(&dev_priv->runtime_pm);
 
 	for_each_power_domain(domain, mask) {
 		/* Clear before put, so put's sanity check is happy. */
@@ -639,7 +646,7 @@ release_async_put_domains(struct i915_power_domains *power_domains,
 		__intel_display_power_put_domain(dev_priv, domain);
 	}
 
-	intel_runtime_pm_put(rpm, wakeref);
+	intel_runtime_pm_put(&dev_priv->runtime_pm, wakeref);
 }
 
 static void
@@ -649,8 +656,7 @@ intel_display_power_put_async_work(struct work_struct *work)
 		container_of(work, struct drm_i915_private,
 			     display.power.domains.async_put_work.work);
 	struct i915_power_domains *power_domains = &dev_priv->display.power.domains;
-	struct intel_runtime_pm *rpm = &dev_priv->runtime_pm;
-	intel_wakeref_t new_work_wakeref = intel_runtime_pm_get_raw(rpm);
+	intel_wakeref_t new_work_wakeref = intel_runtime_pm_get_raw(&dev_priv->runtime_pm);
 	intel_wakeref_t old_work_wakeref = 0;
 
 	mutex_lock(&power_domains->lock);
@@ -689,9 +695,9 @@ intel_display_power_put_async_work(struct work_struct *work)
 	mutex_unlock(&power_domains->lock);
 
 	if (old_work_wakeref)
-		intel_runtime_pm_put_raw(rpm, old_work_wakeref);
+		intel_runtime_pm_put_raw(&dev_priv->runtime_pm, old_work_wakeref);
 	if (new_work_wakeref)
-		intel_runtime_pm_put_raw(rpm, new_work_wakeref);
+		intel_runtime_pm_put_raw(&dev_priv->runtime_pm, new_work_wakeref);
 }
 
 /**
@@ -709,8 +715,7 @@ void __intel_display_power_put_async(struct drm_i915_private *i915,
 				     intel_wakeref_t wakeref)
 {
 	struct i915_power_domains *power_domains = &i915->display.power.domains;
-	struct intel_runtime_pm *rpm = &i915->runtime_pm;
-	intel_wakeref_t work_wakeref = intel_runtime_pm_get_raw(rpm);
+	intel_wakeref_t work_wakeref = intel_runtime_pm_get_raw(&i915->runtime_pm);
 
 	mutex_lock(&power_domains->lock);
 
@@ -737,9 +742,9 @@ void __intel_display_power_put_async(struct drm_i915_private *i915,
 	mutex_unlock(&power_domains->lock);
 
 	if (work_wakeref)
-		intel_runtime_pm_put_raw(rpm, work_wakeref);
+		intel_runtime_pm_put_raw(&i915->runtime_pm, work_wakeref);
 
-	intel_runtime_pm_put(rpm, wakeref);
+	intel_runtime_pm_put(&i915->runtime_pm, wakeref);
 }
 
 /**
@@ -1830,6 +1835,7 @@ static void vlv_cmnlane_wa(struct drm_i915_private *dev_priv)
 
 static bool vlv_punit_is_power_gated(struct drm_i915_private *dev_priv, u32 reg0)
 {
+#ifdef I915
 	bool ret;
 
 	vlv_punit_get(dev_priv);
@@ -1837,6 +1843,9 @@ static bool vlv_punit_is_power_gated(struct drm_i915_private *dev_priv, u32 reg0
 	vlv_punit_put(dev_priv);
 
 	return ret;
+#else
+	return false;
+#endif
 }
 
 static void assert_ved_power_gated(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.h b/drivers/gpu/drm/i915/display/intel_display_power.h
index d220f6b16e00..3aae045749f7 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.h
+++ b/drivers/gpu/drm/i915/display/intel_display_power.h
@@ -7,6 +7,11 @@
 #define __INTEL_DISPLAY_POWER_H__
 
 #include "intel_wakeref.h"
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include "intel_runtime_pm.h"
 
 enum aux_ch;
 enum dpio_channel;
diff --git a/drivers/gpu/drm/i915/display/intel_display_power_map.c b/drivers/gpu/drm/i915/display/intel_display_power_map.c
index f5d66ca85b19..6e1facc66af3 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power_map.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power_map.c
@@ -6,7 +6,10 @@
 #include "i915_drv.h"
 #include "i915_reg.h"
 
+#ifdef I915
 #include "vlv_sideband_reg.h"
+#else
+#endif
 
 #include "intel_display_power_map.h"
 #include "intel_display_power_well.h"
@@ -197,6 +200,7 @@ I915_DECL_PW_DOMAINS(vlv_pwdoms_dpio_tx_bc_lanes,
 	POWER_DOMAIN_INIT);
 
 static const struct i915_power_well_desc vlv_power_wells_main[] = {
+#ifdef I915
 	{
 		.instances = &I915_PW_INSTANCES(
 			I915_PW("display", &vlv_pwdoms_display,
@@ -224,6 +228,7 @@ static const struct i915_power_well_desc vlv_power_wells_main[] = {
 		),
 		.ops = &vlv_dpio_cmn_power_well_ops,
 	},
+#endif
 };
 
 static const struct i915_power_well_desc_list vlv_power_wells[] = {
@@ -274,6 +279,7 @@ I915_DECL_PW_DOMAINS(chv_pwdoms_dpio_cmn_d,
 	POWER_DOMAIN_INIT);
 
 static const struct i915_power_well_desc chv_power_wells_main[] = {
+#ifdef I915
 	{
 		/*
 		 * Pipe A power well is the new disp2d well. Pipe B and C
@@ -295,6 +301,7 @@ static const struct i915_power_well_desc chv_power_wells_main[] = {
 		),
 		.ops = &chv_dpio_cmn_power_well_ops,
 	},
+#endif
 };
 
 static const struct i915_power_well_desc_list chv_power_wells[] = {
diff --git a/drivers/gpu/drm/i915/display/intel_display_power_well.c b/drivers/gpu/drm/i915/display/intel_display_power_well.c
index a1d75956ae97..9683cb661f62 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power_well.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power_well.c
@@ -8,7 +8,6 @@
 #include "intel_backlight_regs.h"
 #include "intel_combo_phy.h"
 #include "intel_combo_phy_regs.h"
-#include "intel_crt.h"
 #include "intel_de.h"
 #include "intel_display_power_well.h"
 #include "intel_display_types.h"
@@ -22,8 +21,12 @@
 #include "intel_tc.h"
 #include "intel_vga.h"
 #include "skl_watermark.h"
+
+#ifdef I915
+#include "intel_crt.h"
 #include "vlv_sideband.h"
 #include "vlv_sideband_reg.h"
+#endif
 
 struct i915_power_well_regs {
 	i915_reg_t bios;
@@ -1061,6 +1064,7 @@ static void i830_pipes_power_well_sync_hw(struct drm_i915_private *dev_priv,
 		i830_pipes_power_well_disable(dev_priv, power_well);
 }
 
+#ifdef I915
 static void vlv_set_power_well(struct drm_i915_private *dev_priv,
 			       struct i915_power_well *power_well, bool enable)
 {
@@ -1719,6 +1723,7 @@ static void chv_pipe_power_well_disable(struct drm_i915_private *dev_priv,
 
 	chv_set_pipe_power_well(dev_priv, power_well, false);
 }
+#endif
 
 static void
 tgl_tc_cold_request(struct drm_i915_private *i915, bool block)
@@ -1843,17 +1848,21 @@ const struct i915_power_well_ops i9xx_always_on_power_well_ops = {
 };
 
 const struct i915_power_well_ops chv_pipe_power_well_ops = {
+#ifdef I915
 	.sync_hw = chv_pipe_power_well_sync_hw,
 	.enable = chv_pipe_power_well_enable,
 	.disable = chv_pipe_power_well_disable,
 	.is_enabled = chv_pipe_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops chv_dpio_cmn_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = chv_dpio_cmn_power_well_enable,
 	.disable = chv_dpio_cmn_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops i830_pipes_power_well_ops = {
@@ -1894,23 +1903,29 @@ const struct i915_power_well_ops bxt_dpio_cmn_power_well_ops = {
 
 const struct i915_power_well_ops vlv_display_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = vlv_display_power_well_enable,
 	.disable = vlv_display_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops vlv_dpio_cmn_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = vlv_dpio_cmn_power_well_enable,
 	.disable = vlv_dpio_cmn_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops vlv_dpio_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = vlv_power_well_enable,
 	.disable = vlv_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 static const struct i915_power_well_regs icl_aux_power_well_regs = {
diff --git a/drivers/gpu/drm/i915/display/intel_display_trace.h b/drivers/gpu/drm/i915/display/intel_display_trace.h
index 725aba3fa531..391ddb94062b 100644
--- a/drivers/gpu/drm/i915/display/intel_display_trace.h
+++ b/drivers/gpu/drm/i915/display/intel_display_trace.h
@@ -185,6 +185,7 @@ TRACE_EVENT(intel_memory_cxsr,
 		      __entry->frame[PIPE_C], __entry->scanline[PIPE_C])
 );
 
+#ifdef I915
 TRACE_EVENT(g4x_wm,
 	    TP_PROTO(struct intel_crtc *crtc, const struct g4x_wm_values *wm),
 	    TP_ARGS(crtc, wm),
@@ -277,6 +278,7 @@ TRACE_EVENT(vlv_wm,
 		      __entry->primary, __entry->sprite0, __entry->sprite1, __entry->cursor,
 		      __entry->sr_plane, __entry->sr_cursor)
 );
+#endif
 
 TRACE_EVENT(vlv_fifo_size,
 	    TP_PROTO(struct intel_crtc *crtc, u32 sprite0_start, u32 sprite1_start, u32 fifo_size),
@@ -648,6 +650,10 @@ TRACE_EVENT(intel_frontbuffer_flush,
 /* This part must be outside protection */
 #undef TRACE_INCLUDE_PATH
 #undef TRACE_INCLUDE_FILE
+#ifdef I915
 #define TRACE_INCLUDE_PATH ../../drivers/gpu/drm/i915/display
+#else
+#define TRACE_INCLUDE_PATH ../../drivers/gpu/drm/xe/display
+#endif
 #define TRACE_INCLUDE_FILE intel_display_trace
 #include <trace/define_trace.h>
diff --git a/drivers/gpu/drm/i915/display/intel_display_types.h b/drivers/gpu/drm/i915/display/intel_display_types.h
index 34250a9cf3e1..3bd391d33e42 100644
--- a/drivers/gpu/drm/i915/display/intel_display_types.h
+++ b/drivers/gpu/drm/i915/display/intel_display_types.h
@@ -46,6 +46,7 @@
 #include <drm/i915_mei_hdcp_interface.h>
 #include <media/cec-notifier.h>
 
+#include "i915_utils.h"
 #include "i915_vma.h"
 #include "i915_vma_types.h"
 #include "intel_bios.h"
@@ -141,7 +142,9 @@ struct intel_framebuffer {
 		struct intel_fb_view remapped_view;
 	};
 
+#ifdef I915
 	struct i915_address_space *dpt_vm;
+#endif
 };
 
 enum intel_hotplug_state {
@@ -653,7 +656,9 @@ struct intel_atomic_state {
 
 	bool rps_interactive;
 
+#ifdef I915
 	struct i915_sw_fence commit_ready;
+#endif
 
 	struct llist_node freed;
 };
@@ -679,7 +684,11 @@ struct intel_plane_state {
 	} hw;
 
 	struct i915_vma *ggtt_vma;
+#ifdef I915
 	struct i915_vma *dpt_vma;
+#else
+	struct i915_vma embed_vma;
+#endif
 	unsigned long flags;
 #define PLANE_HAS_FENCE BIT(0)
 
@@ -739,9 +748,9 @@ struct intel_plane_state {
 	 * this plane. They're calculated by the linked plane's wm code.
 	 */
 	u32 planar_slave;
-
+#ifdef I915
 	struct drm_intel_sprite_colorkey ckey;
-
+#endif
 	struct drm_rect psr2_sel_fetch_area;
 
 	/* Clear Color Value */
@@ -851,6 +860,7 @@ struct skl_pipe_wm {
 	bool use_sagv_wm;
 };
 
+#ifdef I915
 enum vlv_wm_level {
 	VLV_WM_LEVEL_PM2,
 	VLV_WM_LEVEL_PM5,
@@ -884,6 +894,7 @@ struct g4x_wm_state {
 	bool hpll_en;
 	bool fbc_en;
 };
+#endif
 
 struct intel_crtc_wm_state {
 	union {
@@ -927,7 +938,7 @@ struct intel_crtc_wm_state {
 			/* pre-icl: for planar Y */
 			struct skl_ddb_entry plane_ddb_y[I915_MAX_PLANES];
 		} skl;
-
+#ifdef I915
 		struct {
 			struct g4x_pipe_wm raw[NUM_VLV_WM_LEVELS]; /* not inverted */
 			struct vlv_wm_state intermediate; /* inverted */
@@ -940,6 +951,7 @@ struct intel_crtc_wm_state {
 			struct g4x_wm_state intermediate;
 			struct g4x_wm_state optimal;
 		} g4x;
+#endif
 	};
 
 	/*
@@ -1387,6 +1399,7 @@ struct intel_crtc {
 	bool pch_fifo_underrun_disabled;
 
 	/* per-pipe watermark state */
+#ifdef I915
 	struct {
 		/* watermarks currently being used  */
 		union {
@@ -1395,6 +1408,7 @@ struct intel_crtc {
 			struct g4x_wm_state g4x;
 		} active;
 	} wm;
+#endif
 
 	struct {
 		struct mutex mutex;
@@ -2053,7 +2067,11 @@ intel_crtc_needs_color_update(const struct intel_crtc_state *crtc_state)
 
 static inline u32 intel_plane_ggtt_offset(const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	return i915_ggtt_offset(plane_state->ggtt_vma);
+#else
+	return plane_state->ggtt_vma->node.start;
+#endif
 }
 
 #endif /*  __INTEL_DISPLAY_TYPES_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_dmc.c b/drivers/gpu/drm/i915/display/intel_dmc.c
index 905b5dcdca14..5482ca6ccda7 100644
--- a/drivers/gpu/drm/i915/display/intel_dmc.c
+++ b/drivers/gpu/drm/i915/display/intel_dmc.c
@@ -30,6 +30,18 @@
 #include "intel_dmc.h"
 #include "intel_dmc_regs.h"
 
+#ifndef I915
+#include "xe_uc_fw.h"
+
+#define INTEL_UC_FIRMWARE_URL XE_UC_FIRMWARE_URL
+
+__printf(2, 3)
+static inline void
+i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
+{
+}
+#endif
+
 /**
  * DOC: DMC Firmware Support
  *
@@ -262,8 +274,11 @@ static const struct stepping_info *
 intel_get_stepping_info(struct drm_i915_private *i915,
 			struct stepping_info *si)
 {
+#ifdef I915
 	const char *step_name = intel_step_name(RUNTIME_INFO(i915)->step.display_step);
-
+#else
+	const char *step_name = xe_step_name(i915->info.step.display);
+#endif
 	si->stepping = step_name[0];
 	si->substepping = step_name[1];
 	return si;
diff --git a/drivers/gpu/drm/i915/display/intel_dp.c b/drivers/gpu/drm/i915/display/intel_dp.c
index bf80f296a8fd..55973e9aeca3 100644
--- a/drivers/gpu/drm/i915/display/intel_dp.c
+++ b/drivers/gpu/drm/i915/display/intel_dp.c
@@ -43,8 +43,10 @@
 #include <drm/drm_edid.h>
 #include <drm/drm_probe_helper.h>
 
+#ifdef I915
 #include "g4x_dp.h"
 #include "i915_debugfs.h"
+#endif
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "intel_atomic.h"
@@ -2164,8 +2166,10 @@ intel_dp_compute_config(struct intel_encoder *encoder,
 	if (pipe_config->splitter.enable)
 		pipe_config->dp_m_n.data_m *= pipe_config->splitter.link_count;
 
+#ifdef I915
 	if (!HAS_DDI(dev_priv))
 		g4x_dp_set_clock(encoder, pipe_config);
+#endif
 
 	intel_vrr_compute_config(pipe_config, conn_state);
 	intel_psr_compute_config(intel_dp, pipe_config, conn_state);
@@ -5209,9 +5213,11 @@ intel_edp_add_properties(struct intel_dp *intel_dp)
 static void intel_edp_backlight_setup(struct intel_dp *intel_dp,
 				      struct intel_connector *connector)
 {
-	struct drm_i915_private *i915 = dp_to_i915(intel_dp);
 	enum pipe pipe = INVALID_PIPE;
 
+#ifdef I915
+	struct drm_i915_private *i915 = dp_to_i915(intel_dp);
+
 	if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915)) {
 		/*
 		 * Figure out the current pipe for the initial backlight setup.
@@ -5231,6 +5237,7 @@ static void intel_edp_backlight_setup(struct intel_dp *intel_dp,
 			    connector->base.base.id, connector->base.name,
 			    pipe_name(pipe));
 	}
+#endif
 
 	intel_backlight_setup(connector, pipe);
 }
@@ -5427,8 +5434,10 @@ intel_dp_init_connector(struct intel_digital_port *dig_port,
 	intel_dp_set_default_sink_rates(intel_dp);
 	intel_dp_set_default_max_sink_lane_count(intel_dp);
 
+#ifdef I915
 	if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv))
 		intel_dp->pps.active_pipe = vlv_active_pipe(intel_dp);
+#endif
 
 	drm_dbg_kms(&dev_priv->drm,
 		    "Adding %s connector on [ENCODER:%d:%s]\n",
diff --git a/drivers/gpu/drm/i915/display/intel_dp_aux.c b/drivers/gpu/drm/i915/display/intel_dp_aux.c
index 220aa88c67ee..b4b9d2e1fec7 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_aux.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_aux.c
@@ -5,7 +5,11 @@
 
 #include "i915_drv.h"
 #include "i915_reg.h"
+#ifdef I915
 #include "i915_trace.h"
+#else
+#define trace_i915_reg_rw(a...) do { } while (0)
+#endif
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_dp_aux.h"
diff --git a/drivers/gpu/drm/i915/display/intel_dpio_phy.h b/drivers/gpu/drm/i915/display/intel_dpio_phy.h
index 9c7725dacb47..952e8d446425 100644
--- a/drivers/gpu/drm/i915/display/intel_dpio_phy.h
+++ b/drivers/gpu/drm/i915/display/intel_dpio_phy.h
@@ -7,6 +7,7 @@
 #define __INTEL_DPIO_PHY_H__
 
 #include <linux/types.h>
+#include "intel_display.h"
 
 enum pipe;
 enum port;
@@ -26,6 +27,7 @@ enum dpio_phy {
 	DPIO_PHY2,
 };
 
+#ifdef I915
 void bxt_port_to_phy_channel(struct drm_i915_private *dev_priv, enum port port,
 			     enum dpio_phy *phy, enum dpio_channel *ch);
 void bxt_ddi_phy_set_signal_levels(struct intel_encoder *encoder,
@@ -71,4 +73,17 @@ void vlv_phy_pre_encoder_enable(struct intel_encoder *encoder,
 void vlv_phy_reset_lanes(struct intel_encoder *encoder,
 			 const struct intel_crtc_state *old_crtc_state);
 
+#else
+#define bxt_port_to_phy_channel(xe, port, phy, ch) do { *phy = 0; *ch = 0; } while (xe && port && 0)
+static inline void bxt_ddi_phy_set_signal_levels(struct intel_encoder *x,
+						 const struct intel_crtc_state *y) {}
+#define bxt_ddi_phy_init(xe, phy) do { } while (xe && phy && 0)
+#define bxt_ddi_phy_uninit(xe, phy) do { } while (xe && phy && 0)
+#define bxt_ddi_phy_is_enabled(xe, phy) (xe && phy && 0)
+static inline bool bxt_ddi_phy_verify_state(struct xe_device *xe, enum dpio_phy phy) { return false; }
+#define bxt_ddi_phy_calc_lane_lat_optim_mask(x) (x && 0)
+#define bxt_ddi_phy_set_lane_optim_mask(x, y) do { } while (x && y && 0)
+#define bxt_ddi_phy_get_lane_lat_optim_mask(x) (x && 0)
+#endif
+
 #endif /* __INTEL_DPIO_PHY_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_dpll.c b/drivers/gpu/drm/i915/display/intel_dpll.c
index c236aafe9be0..bfc214b36585 100644
--- a/drivers/gpu/drm/i915/display/intel_dpll.c
+++ b/drivers/gpu/drm/i915/display/intel_dpll.c
@@ -17,7 +17,10 @@
 #include "intel_panel.h"
 #include "intel_pps.h"
 #include "intel_snps_phy.h"
+
+#ifdef I915
 #include "vlv_sideband.h"
+#endif
 
 struct intel_dpll_funcs {
 	int (*crtc_compute_clock)(struct intel_atomic_state *state,
@@ -1594,6 +1597,7 @@ void i9xx_enable_pll(const struct intel_crtc_state *crtc_state)
 	}
 }
 
+#ifdef I915
 static void vlv_pllb_recal_opamp(struct drm_i915_private *dev_priv,
 				 enum pipe pipe)
 {
@@ -2005,6 +2009,7 @@ void chv_disable_pll(struct drm_i915_private *dev_priv, enum pipe pipe)
 
 	vlv_dpio_put(dev_priv);
 }
+#endif
 
 void i9xx_disable_pll(const struct intel_crtc_state *crtc_state)
 {
@@ -2023,7 +2028,7 @@ void i9xx_disable_pll(const struct intel_crtc_state *crtc_state)
 	intel_de_posting_read(dev_priv, DPLL(pipe));
 }
 
-
+#ifdef I915
 /**
  * vlv_force_pll_off - forcibly disable just the PLL
  * @dev_priv: i915 private structure
@@ -2039,6 +2044,7 @@ void vlv_force_pll_off(struct drm_i915_private *dev_priv, enum pipe pipe)
 	else
 		vlv_disable_pll(dev_priv, pipe);
 }
+#endif
 
 /* Only for pre-ILK configs */
 static void assert_pll(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/display/intel_dpll_mgr.c b/drivers/gpu/drm/i915/display/intel_dpll_mgr.c
index 1974eb580ed1..56b4055c9ef4 100644
--- a/drivers/gpu/drm/i915/display/intel_dpll_mgr.c
+++ b/drivers/gpu/drm/i915/display/intel_dpll_mgr.c
@@ -607,6 +607,7 @@ static void hsw_ddi_spll_enable(struct drm_i915_private *dev_priv,
 static void hsw_ddi_wrpll_disable(struct drm_i915_private *dev_priv,
 				  struct intel_shared_dpll *pll)
 {
+#ifdef I915
 	const enum intel_dpll_id id = pll->info->id;
 	u32 val;
 
@@ -620,11 +621,13 @@ static void hsw_ddi_wrpll_disable(struct drm_i915_private *dev_priv,
 	 */
 	if (dev_priv->pch_ssc_use & BIT(id))
 		intel_init_pch_refclk(dev_priv);
+#endif
 }
 
 static void hsw_ddi_spll_disable(struct drm_i915_private *dev_priv,
 				 struct intel_shared_dpll *pll)
 {
+#ifdef I915
 	enum intel_dpll_id id = pll->info->id;
 	u32 val;
 
@@ -638,6 +641,7 @@ static void hsw_ddi_spll_disable(struct drm_i915_private *dev_priv,
 	 */
 	if (dev_priv->pch_ssc_use & BIT(id))
 		intel_init_pch_refclk(dev_priv);
+#endif
 }
 
 static bool hsw_ddi_wrpll_get_hw_state(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/display/intel_dsb.c b/drivers/gpu/drm/i915/display/intel_dsb.c
index 3d63c1bf1e4f..0295348df562 100644
--- a/drivers/gpu/drm/i915/display/intel_dsb.c
+++ b/drivers/gpu/drm/i915/display/intel_dsb.c
@@ -4,11 +4,18 @@
  *
  */
 
+// As with intelde_dpt, this depends on some gem internals, fortunately easier to fix..
+#ifdef I915
 #include "gem/i915_gem_internal.h"
+#else
+#include "xe_bo.h"
+#include "xe_gt.h"
+#endif
 
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "intel_de.h"
+#include "intel_dsb.h"
 #include "intel_display_types.h"
 #include "intel_dsb.h"
 
@@ -26,8 +33,12 @@ struct intel_dsb {
 	enum dsb_id id;
 
 	u32 *cmd_buf;
-	struct i915_vma *vma;
 	struct intel_crtc *crtc;
+#ifdef I915
+	struct i915_vma *vma;
+#else
+	struct xe_bo *obj;
+#endif
 
 	/*
 	 * free_pos will point the first free entry position
@@ -70,6 +81,43 @@ struct intel_dsb {
 #define DSB_BYTE_EN_SHIFT		20
 #define DSB_REG_VALUE_MASK		0xfffff
 
+static u32 dsb_ggtt_offset(struct intel_dsb *dsb)
+{
+#ifdef I915
+	return i915_ggtt_offset(dsb->vma);
+#else
+	return xe_bo_ggtt_addr(dsb->obj);
+#endif
+}
+
+static void dsb_write(struct intel_dsb *dsb, u32 idx, u32 val)
+{
+#ifdef I915
+	dsb->cmd_buf[idx] = val;
+#else
+	iosys_map_wr(&dsb->obj->vmap, idx * 4, u32, val);
+#endif
+}
+
+
+static u32 dsb_read(struct intel_dsb *dsb, u32 idx)
+{
+#ifdef I915
+	return dsb->cmd_buf[idx];
+#else
+	return iosys_map_rd(&dsb->obj->vmap, idx * 4, u32);
+#endif
+}
+
+static void dsb_memset(struct intel_dsb *dsb, u32 idx, u32 val, u32 sz)
+{
+#ifdef I915
+	memset(&dsb->cmd_buf[idx], val, sz);
+#else
+	iosys_map_memset(&dsb->obj->vmap, idx * 4, val, sz);
+#endif
+}
+
 static bool is_dsb_busy(struct drm_i915_private *i915, enum pipe pipe,
 			enum dsb_id id)
 {
@@ -130,8 +178,12 @@ void intel_dsb_indexed_reg_write(struct intel_dsb *dsb,
 {
 	struct intel_crtc *crtc = dsb->crtc;
 	struct drm_i915_private *dev_priv = to_i915(crtc->base.dev);
-	u32 *buf = dsb->cmd_buf;
-	u32 reg_val;
+	u32 reg_val, old_val;
+
+	if (!dsb) {
+		intel_de_write_fw(dev_priv, reg, val);
+		return;
+	}
 
 	if (drm_WARN_ON(&dev_priv->drm, dsb->free_pos >= DSB_BUF_SIZE)) {
 		drm_dbg_kms(&dev_priv->drm, "DSB buffer overflow\n");
@@ -154,7 +206,7 @@ void intel_dsb_indexed_reg_write(struct intel_dsb *dsb,
 	 * we are writing odd no of dwords, Zeros will be added in the end for
 	 * padding.
 	 */
-	reg_val = buf[dsb->ins_start_offset + 1] & DSB_REG_VALUE_MASK;
+	reg_val = dsb_read(dsb, dsb->ins_start_offset + 1) & DSB_REG_VALUE_MASK;
 	if (reg_val != i915_mmio_reg_offset(reg)) {
 		/* Every instruction should be 8 byte aligned. */
 		dsb->free_pos = ALIGN(dsb->free_pos, 2);
@@ -162,26 +214,27 @@ void intel_dsb_indexed_reg_write(struct intel_dsb *dsb,
 		dsb->ins_start_offset = dsb->free_pos;
 
 		/* Update the size. */
-		buf[dsb->free_pos++] = 1;
+		dsb_write(dsb, dsb->free_pos++, 1);
 
 		/* Update the opcode and reg. */
-		buf[dsb->free_pos++] = (DSB_OPCODE_INDEXED_WRITE  <<
-					DSB_OPCODE_SHIFT) |
-					i915_mmio_reg_offset(reg);
+		dsb_write(dsb, dsb->free_pos++,
+			  (DSB_OPCODE_INDEXED_WRITE << DSB_OPCODE_SHIFT) |
+			  i915_mmio_reg_offset(reg));
 
 		/* Update the value. */
-		buf[dsb->free_pos++] = val;
+		dsb_write(dsb, dsb->free_pos++, val);
 	} else {
 		/* Update the new value. */
-		buf[dsb->free_pos++] = val;
+		dsb_write(dsb, dsb->free_pos++, val);
 
 		/* Update the size. */
-		buf[dsb->ins_start_offset]++;
+		old_val = dsb_read(dsb, dsb->ins_start_offset);
+		dsb_write(dsb, dsb->ins_start_offset, old_val + 1);
 	}
 
 	/* if number of data words is odd, then the last dword should be 0.*/
 	if (dsb->free_pos & 0x1)
-		buf[dsb->free_pos] = 0;
+		dsb_write(dsb, dsb->free_pos, 0);
 }
 
 /**
@@ -201,7 +254,11 @@ void intel_dsb_reg_write(struct intel_dsb *dsb,
 {
 	struct intel_crtc *crtc = dsb->crtc;
 	struct drm_i915_private *dev_priv = to_i915(crtc->base.dev);
-	u32 *buf = dsb->cmd_buf;
+
+	if (!dsb) {
+		intel_de_write_fw(dev_priv, reg, val);
+		return;
+	}
 
 	if (drm_WARN_ON(&dev_priv->drm, dsb->free_pos >= DSB_BUF_SIZE)) {
 		drm_dbg_kms(&dev_priv->drm, "DSB buffer overflow\n");
@@ -209,10 +266,11 @@ void intel_dsb_reg_write(struct intel_dsb *dsb,
 	}
 
 	dsb->ins_start_offset = dsb->free_pos;
-	buf[dsb->free_pos++] = val;
-	buf[dsb->free_pos++] = (DSB_OPCODE_MMIO_WRITE  << DSB_OPCODE_SHIFT) |
-			       (DSB_BYTE_EN << DSB_BYTE_EN_SHIFT) |
-			       i915_mmio_reg_offset(reg);
+	dsb_write(dsb, dsb->free_pos++, val);
+	dsb_write(dsb, dsb->free_pos++,
+		  (DSB_OPCODE_MMIO_WRITE  << DSB_OPCODE_SHIFT) |
+		  (DSB_BYTE_EN << DSB_BYTE_EN_SHIFT) |
+		  i915_mmio_reg_offset(reg));
 }
 
 /**
@@ -240,12 +298,11 @@ void intel_dsb_commit(struct intel_dsb *dsb)
 		goto reset;
 	}
 	intel_de_write(dev_priv, DSB_HEAD(pipe, dsb->id),
-		       i915_ggtt_offset(dsb->vma));
+		       dsb_ggtt_offset(dsb));
 
-	tail = ALIGN(dsb->free_pos * 4, CACHELINE_BYTES);
+	tail = ALIGN(dsb->free_pos * 4, 64);
 	if (tail > dsb->free_pos * 4)
-		memset(&dsb->cmd_buf[dsb->free_pos], 0,
-		       (tail - dsb->free_pos * 4));
+		dsb_memset(dsb, dsb->free_pos, 0, (tail - dsb->free_pos * 4));
 
 	if (is_dsb_busy(dev_priv, pipe, dsb->id)) {
 		drm_err(&dev_priv->drm,
@@ -254,9 +311,9 @@ void intel_dsb_commit(struct intel_dsb *dsb)
 	}
 	drm_dbg_kms(&dev_priv->drm,
 		    "DSB execution started - head 0x%x, tail 0x%x\n",
-		    i915_ggtt_offset(dsb->vma), tail);
+		    dsb_ggtt_offset(dsb), tail);
 	intel_de_write(dev_priv, DSB_TAIL(pipe, dsb->id),
-		       i915_ggtt_offset(dsb->vma) + tail);
+		       dsb_ggtt_offset(dsb) + tail);
 	if (wait_for(!is_dsb_busy(dev_priv, pipe, dsb->id), 1)) {
 		drm_err(&dev_priv->drm,
 			"Timed out waiting for DSB workload completion.\n");
@@ -284,9 +341,9 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
 	struct drm_i915_private *i915 = to_i915(crtc->base.dev);
 	struct intel_dsb *dsb;
 	struct drm_i915_gem_object *obj;
-	struct i915_vma *vma;
-	u32 *buf;
+	__maybe_unused struct i915_vma *vma;
 	intel_wakeref_t wakeref;
+	__maybe_unused u32 *buf;
 
 	if (!HAS_DSB(i915))
 		return NULL;
@@ -297,6 +354,7 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
 
 	wakeref = intel_runtime_pm_get(&i915->runtime_pm);
 
+#ifdef I915
 	obj = i915_gem_object_create_internal(i915, DSB_BUF_SIZE);
 	if (IS_ERR(obj))
 		goto out_put_rpm;
@@ -319,6 +377,18 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
 	dsb->vma = vma;
 	dsb->crtc = crtc;
 	dsb->cmd_buf = buf;
+#else
+	obj = xe_bo_create_pin_map(i915, to_gt(i915), NULL, DSB_BUF_SIZE,
+				   ttm_bo_type_kernel,
+				   XE_BO_CREATE_VRAM_IF_DGFX(to_gt(i915)) |
+				   XE_BO_CREATE_GGTT_BIT);
+	if (IS_ERR(obj)) {
+		kfree(dsb);
+		goto out_put_rpm;
+	}
+	dsb->obj = obj;
+#endif
+	dsb->id = DSB1;
 	dsb->free_pos = 0;
 	dsb->ins_start_offset = 0;
 
@@ -343,6 +413,10 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
  */
 void intel_dsb_cleanup(struct intel_dsb *dsb)
 {
+#ifdef I915
 	i915_vma_unpin_and_release(&dsb->vma, I915_VMA_RELEASE_MAP);
+#else
+	xe_bo_unpin_map_no_vm(dsb->obj);
+#endif
 	kfree(dsb);
 }
diff --git a/drivers/gpu/drm/i915/display/intel_dsi_vbt.c b/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
index 2cbc1292ab38..b45552d96c0c 100644
--- a/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
+++ b/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
@@ -46,9 +46,11 @@
 #include "intel_dsi.h"
 #include "intel_dsi_vbt.h"
 #include "intel_gmbus_regs.h"
+#ifdef I915
 #include "vlv_dsi.h"
 #include "vlv_dsi_regs.h"
 #include "vlv_sideband.h"
+#endif
 
 #define MIPI_TRANSFER_MODE_SHIFT	0
 #define MIPI_VIRTUAL_CHANNEL_SHIFT	1
@@ -76,6 +78,7 @@ struct gpio_map {
 	bool init;
 };
 
+#ifdef I915
 static struct gpio_map vlv_gpio_table[] = {
 	{ VLV_GPIO_NC_0_HV_DDI0_HPD },
 	{ VLV_GPIO_NC_1_HV_DDI0_DDC_SDA },
@@ -90,6 +93,7 @@ static struct gpio_map vlv_gpio_table[] = {
 	{ VLV_GPIO_NC_10_PANEL1_BKLTEN },
 	{ VLV_GPIO_NC_11_PANEL1_BKLTCTL },
 };
+#endif
 
 struct i2c_adapter_lookup {
 	u16 slave_addr;
@@ -219,10 +223,10 @@ static const u8 *mipi_exec_send_packet(struct intel_dsi *intel_dsi,
 		mipi_dsi_dcs_write_buffer(dsi_device, data, len);
 		break;
 	}
-
+#ifdef I915
 	if (DISPLAY_VER(dev_priv) < 11)
 		vlv_dsi_wait_for_fifo_empty(intel_dsi, port);
-
+#endif
 out:
 	data += len;
 
@@ -242,6 +246,7 @@ static const u8 *mipi_exec_delay(struct intel_dsi *intel_dsi, const u8 *data)
 	return data;
 }
 
+#ifdef I915
 static void vlv_exec_gpio(struct intel_connector *connector,
 			  u8 gpio_source, u8 gpio_index, bool value)
 {
@@ -370,6 +375,7 @@ static void bxt_exec_gpio(struct intel_connector *connector,
 
 	gpiod_set_value(gpio_desc, value);
 }
+#endif
 
 static void icl_exec_gpio(struct intel_connector *connector,
 			  u8 gpio_source, u8 gpio_index, bool value)
@@ -491,12 +497,14 @@ static const u8 *mipi_exec_gpio(struct intel_dsi *intel_dsi, const u8 *data)
 		icl_native_gpio_set_value(dev_priv, gpio_number, value);
 	else if (DISPLAY_VER(dev_priv) >= 11)
 		icl_exec_gpio(connector, gpio_source, gpio_index, value);
+#ifdef I915
 	else if (IS_VALLEYVIEW(dev_priv))
 		vlv_exec_gpio(connector, gpio_source, gpio_number, value);
 	else if (IS_CHERRYVIEW(dev_priv))
 		chv_exec_gpio(connector, gpio_source, gpio_number, value);
 	else
 		bxt_exec_gpio(connector, gpio_source, gpio_index, value);
+#endif
 
 	return data;
 }
@@ -821,8 +829,10 @@ void intel_dsi_log_params(struct intel_dsi *intel_dsi)
 		    intel_dsi->clk_lp_to_hs_count);
 	drm_dbg_kms(&i915->drm, "HS to LP Clock Count 0x%x\n",
 		    intel_dsi->clk_hs_to_lp_count);
+#ifdef I915
 	drm_dbg_kms(&i915->drm, "BTA %s\n",
 		    str_enabled_disabled(!(intel_dsi->video_frmt_cfg_bits & DISABLE_VIDEO_BTA)));
+#endif
 }
 
 bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
@@ -841,9 +851,7 @@ bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
 	intel_dsi->eotp_pkt = mipi_config->eot_pkt_disabled ? 0 : 1;
 	intel_dsi->clock_stop = mipi_config->enable_clk_stop ? 1 : 0;
 	intel_dsi->lane_count = mipi_config->lane_cnt + 1;
-	intel_dsi->pixel_format =
-			pixel_format_from_register_bits(
-				mipi_config->videomode_color_format << 7);
+	intel_dsi->pixel_format = mipi_config->videomode_color_format << 7;
 
 	intel_dsi->dual_link = mipi_config->dual_link;
 	intel_dsi->pixel_overlap = mipi_config->pixel_overlap;
@@ -857,7 +865,7 @@ bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
 	intel_dsi->init_count = mipi_config->master_init_timer;
 	intel_dsi->bw_timer = mipi_config->dbi_bw_timer;
 	intel_dsi->video_frmt_cfg_bits =
-		mipi_config->bta_enabled ? DISABLE_VIDEO_BTA : 0;
+		mipi_config->bta_enabled ? BIT(3) : 0;
 	intel_dsi->bgr_enabled = mipi_config->rgb_flip;
 
 	/* Starting point, adjusted depending on dual link and burst mode */
@@ -940,6 +948,7 @@ bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
  * If the GOP did not initialize the panel (HDMI inserted) we may need to also
  * change the pinmux for the SoC's PWM0 pin from GPIO to PWM.
  */
+#ifdef I915
 static struct gpiod_lookup_table pmic_panel_gpio_table = {
 	/* Intel GFX is consumer */
 	.dev_id = "0000:00:02.0",
@@ -963,9 +972,11 @@ static const struct pinctrl_map soc_pwm_pinctrl_map[] = {
 	PIN_MAP_MUX_GROUP("0000:00:02.0", "soc_pwm0", "INT33FC:00",
 			  "pwm0_grp", "pwm"),
 };
+#endif
 
 void intel_dsi_vbt_gpio_init(struct intel_dsi *intel_dsi, bool panel_is_on)
 {
+#ifdef I915
 	struct drm_device *dev = intel_dsi->base.base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct intel_connector *connector = intel_dsi->attached_connector;
@@ -1018,10 +1029,12 @@ void intel_dsi_vbt_gpio_init(struct intel_dsi *intel_dsi, bool panel_is_on)
 			intel_dsi->gpio_backlight = NULL;
 		}
 	}
+#endif
 }
 
 void intel_dsi_vbt_gpio_cleanup(struct intel_dsi *intel_dsi)
 {
+#ifdef I915
 	struct drm_device *dev = intel_dsi->base.base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct intel_connector *connector = intel_dsi->attached_connector;
@@ -1045,4 +1058,5 @@ void intel_dsi_vbt_gpio_cleanup(struct intel_dsi *intel_dsi)
 		pinctrl_unregister_mappings(soc_pwm_pinctrl_map);
 		gpiod_remove_lookup_table(&soc_panel_gpio_table);
 	}
+#endif
 }
diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
index 56cdacf33db2..e0a8d9e9df9a 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -4,6 +4,7 @@
  */
 
 #include <drm/drm_blend.h>
+#include <drm/drm_damage_helper.h>
 #include <drm/drm_framebuffer.h>
 #include <drm/drm_modeset_helper.h>
 
@@ -14,6 +15,16 @@
 #include "intel_fb.h"
 #include "intel_frontbuffer.h"
 
+#ifdef I915
+/*
+ * i915 requires obj->__do_not_access.base,
+ * xe uses obj->ttm.base
+ */
+#define ttm __do_not_access
+#else
+#include <drm/ttm/ttm_bo.h>
+#endif
+
 #define check_array_bounds(i915, a, i) drm_WARN_ON(&(i915)->drm, (i) >= ARRAY_SIZE(a))
 
 /*
@@ -697,6 +708,7 @@ intel_fb_align_height(const struct drm_framebuffer *fb,
 	return ALIGN(height, tile_height);
 }
 
+#ifdef I915
 static unsigned int intel_fb_modifier_to_tiling(u64 fb_modifier)
 {
 	u8 tiling_caps = lookup_modifier(fb_modifier)->plane_caps &
@@ -716,6 +728,7 @@ static unsigned int intel_fb_modifier_to_tiling(u64 fb_modifier)
 		return I915_TILING_NONE;
 	}
 }
+#endif
 
 static bool intel_modifier_uses_dpt(struct drm_i915_private *i915, u64 modifier)
 {
@@ -1234,7 +1247,6 @@ static bool intel_plane_needs_remap(const struct intel_plane_state *plane_state)
 static int convert_plane_offset_to_xy(const struct intel_framebuffer *fb, int color_plane,
 				      int plane_width, int *x, int *y)
 {
-	struct drm_i915_gem_object *obj = intel_fb_obj(&fb->base);
 	int ret;
 
 	ret = intel_fb_offset_to_xy(x, y, &fb->base, color_plane);
@@ -1258,13 +1270,15 @@ static int convert_plane_offset_to_xy(const struct intel_framebuffer *fb, int co
 	 * fb layout agrees with the fence layout. We already check that the
 	 * fb stride matches the fence stride elsewhere.
 	 */
-	if (color_plane == 0 && i915_gem_object_is_tiled(obj) &&
+#ifdef I915
+	if (color_plane == 0 && i915_gem_object_is_tiled(intel_fb_obj(&fb->base)) &&
 	    (*x + plane_width) * fb->base.format->cpp[color_plane] > fb->base.pitches[color_plane]) {
 		drm_dbg_kms(fb->base.dev,
 			    "bad fb plane %d offset: 0x%x\n",
 			    color_plane, fb->base.offsets[color_plane]);
 		return -EINVAL;
 	}
+#endif
 
 	return 0;
 }
@@ -1611,10 +1625,10 @@ int intel_fill_fb_info(struct drm_i915_private *i915, struct intel_framebuffer *
 		max_size = max(max_size, offset + size);
 	}
 
-	if (mul_u32_u32(max_size, tile_size) > obj->base.size) {
+	if (mul_u32_u32(max_size, tile_size) > obj->ttm.base.size) {
 		drm_dbg_kms(&i915->drm,
 			    "fb too big for bo (need %llu bytes, have %zu bytes)\n",
-			    mul_u32_u32(max_size, tile_size), obj->base.size);
+			    mul_u32_u32(max_size, tile_size), obj->ttm.base.size);
 		return -EINVAL;
 	}
 
@@ -1830,8 +1844,10 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
 
 	drm_framebuffer_cleanup(fb);
 
+#ifdef I915
 	if (intel_fb_uses_dpt(fb))
 		intel_dpt_destroy(intel_fb->dpt_vm);
+#endif
 
 	drm_gem_object_put(fb->obj[0]);
 	kfree(intel_fb);
@@ -1842,47 +1858,53 @@ static int intel_user_framebuffer_create_handle(struct drm_framebuffer *fb,
 						unsigned int *handle)
 {
 	struct drm_i915_gem_object *obj = intel_fb_obj(fb);
-	struct drm_i915_private *i915 = to_i915(obj->base.dev);
 
+#ifdef I915
 	if (i915_gem_object_is_userptr(obj)) {
-		drm_dbg(&i915->drm,
+		drm_dbg(fb->dev,
 			"attempting to use a userptr for a framebuffer, denied\n");
 		return -EINVAL;
 	}
+#endif
 
-	return drm_gem_handle_create(file, &obj->base, handle);
+	return drm_gem_handle_create(file, &obj->ttm.base, handle);
 }
 
+#ifdef I915
 static int intel_user_framebuffer_dirty(struct drm_framebuffer *fb,
 					struct drm_file *file,
 					unsigned int flags, unsigned int color,
 					struct drm_clip_rect *clips,
 					unsigned int num_clips)
 {
-	struct drm_i915_gem_object *obj = intel_fb_obj(fb);
-
-	i915_gem_object_flush_if_display(obj);
+	i915_gem_object_flush_if_display(intel_fb_obj(fb));
 	intel_frontbuffer_flush(to_intel_framebuffer(fb), ORIGIN_DIRTYFB);
 
 	return 0;
 }
+#endif
 
 static const struct drm_framebuffer_funcs intel_fb_funcs = {
 	.destroy = intel_user_framebuffer_destroy,
 	.create_handle = intel_user_framebuffer_create_handle,
+#ifdef I915
 	.dirty = intel_user_framebuffer_dirty,
+#else
+	.dirty = drm_atomic_helper_dirtyfb,
+#endif
 };
 
 int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			   struct drm_i915_gem_object *obj,
 			   struct drm_mode_fb_cmd2 *mode_cmd)
 {
-	struct drm_i915_private *dev_priv = to_i915(obj->base.dev);
+	struct drm_i915_private *dev_priv = to_i915(obj->ttm.base.dev);
 	struct drm_framebuffer *fb = &intel_fb->base;
 	u32 max_stride;
-	unsigned int tiling, stride;
 	int ret = -EINVAL;
 	int i;
+#ifdef I915
+	unsigned tiling, stride;
 
 	i915_gem_object_lock(obj, NULL);
 	tiling = i915_gem_object_get_tiling(obj);
@@ -1909,6 +1931,29 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			goto err;
 		}
 	}
+#else
+	ret = ttm_bo_reserve(&obj->ttm, true, false, NULL);
+	if (ret)
+		goto err;
+	ret = -EINVAL;
+
+	if (!(obj->flags & XE_BO_SCANOUT_BIT)) {
+		/*
+		 * XE_BO_SCANOUT_BIT should ideally be set at creation, or is
+		 * automatically set when creating FB. We cannot change caching
+		 * mode when the object is VM_BINDed, so we can only set
+		 * coherency with display when unbound.
+		 */
+		if (XE_IOCTL_ERR(dev_priv, !list_empty(&obj->vmas))) {
+			ttm_bo_unreserve(&obj->ttm);
+			goto err;
+		}
+		obj->flags |= XE_BO_SCANOUT_BIT;
+	}
+	ttm_bo_unreserve(&obj->ttm);
+#endif
+
+	atomic_set(&intel_fb->bits, 0);
 
 	if (!drm_any_plane_has_format(&dev_priv->drm,
 				      mode_cmd->pixel_format,
@@ -1919,6 +1964,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 		goto err;
 	}
 
+#ifdef I915
 	/*
 	 * gen2/3 display engine uses the fence if present,
 	 * so the tiling mode must match the fb modifier exactly.
@@ -1929,6 +1975,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			    "tiling_mode must match fb modifier exactly on gen2/3\n");
 		goto err;
 	}
+#endif
 
 	max_stride = intel_fb_max_stride(dev_priv, mode_cmd->pixel_format,
 					 mode_cmd->modifier[0]);
@@ -1941,6 +1988,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 		goto err;
 	}
 
+#ifdef I915
 	/*
 	 * If there's a fence, enforce that
 	 * the fb pitch and fence stride match.
@@ -1951,6 +1999,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			    mode_cmd->pitches[0], stride);
 		goto err;
 	}
+#endif
 
 	/* FIXME need to adjust LINOFF/TILEOFF accordingly. */
 	if (mode_cmd->offsets[0] != 0) {
@@ -1991,13 +2040,14 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			}
 		}
 
-		fb->obj[i] = &obj->base;
+		fb->obj[i] = &obj->ttm.base;
 	}
 
 	ret = intel_fill_fb_info(dev_priv, intel_fb);
 	if (ret)
 		goto err;
 
+#ifdef I915
 	if (intel_fb_uses_dpt(fb)) {
 		struct i915_address_space *vm;
 
@@ -2009,6 +2059,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 
 		intel_fb->dpt_vm = vm;
 	}
+#endif
 
 	ret = drm_framebuffer_init(&dev_priv->drm, fb, &intel_fb_funcs);
 	if (ret) {
@@ -2031,22 +2082,35 @@ intel_user_framebuffer_create(struct drm_device *dev,
 	struct drm_framebuffer *fb;
 	struct drm_i915_gem_object *obj;
 	struct drm_mode_fb_cmd2 mode_cmd = *user_mode_cmd;
-	struct drm_i915_private *i915;
+	struct drm_i915_private *i915 = to_i915(dev);
 
+#ifdef I915
 	obj = i915_gem_object_lookup(filp, mode_cmd.handles[0]);
 	if (!obj)
 		return ERR_PTR(-ENOENT);
 
 	/* object is backed with LMEM for discrete */
-	i915 = to_i915(obj->base.dev);
 	if (HAS_LMEM(i915) && !i915_gem_object_can_migrate(obj, INTEL_REGION_LMEM_0)) {
 		/* object is "remote", not in local memory */
 		i915_gem_object_put(obj);
 		return ERR_PTR(-EREMOTE);
 	}
+#else
+	struct drm_gem_object *gem = drm_gem_object_lookup(filp, mode_cmd.handles[0]);
+	if (!gem)
+		return ERR_PTR(-ENOENT);
+
+	obj = gem_to_xe_bo(gem);
+	/* Require vram exclusive objects, but allow dma-buf imports */
+	if (IS_DGFX(i915) && obj->flags & XE_BO_CREATE_SYSTEM_BIT &&
+	    obj->ttm.type != ttm_bo_type_sg) {
+		drm_gem_object_put(gem);
+		return ERR_PTR(-EREMOTE);
+	}
+#endif
 
 	fb = intel_framebuffer_create(obj, &mode_cmd);
-	i915_gem_object_put(obj);
+	drm_gem_object_put(&obj->ttm.base);
 
 	return fb;
 }
diff --git a/drivers/gpu/drm/i915/display/intel_fbc.c b/drivers/gpu/drm/i915/display/intel_fbc.c
index 5e69d3c11d21..77c848b5b7ae 100644
--- a/drivers/gpu/drm/i915/display/intel_fbc.c
+++ b/drivers/gpu/drm/i915/display/intel_fbc.c
@@ -45,7 +45,9 @@
 
 #include "i915_drv.h"
 #include "i915_utils.h"
+#ifdef I915
 #include "i915_vgpu.h"
+#endif
 #include "intel_cdclk.h"
 #include "intel_de.h"
 #include "intel_display_trace.h"
@@ -53,6 +55,32 @@
 #include "intel_fbc.h"
 #include "intel_frontbuffer.h"
 
+#ifdef I915
+
+#define i915_gem_stolen_initialized(i915) (drm_mm_initialized(&(i915)->mm.stolen))
+
+#else
+
+/* No stolen memory support in xe yet */
+static int i915_gem_stolen_insert_node_in_range(struct xe_device *xe, void *ptr, u32 size, u32 align, u32 start, u32 end)
+{
+	return -ENODEV;
+}
+
+static int i915_gem_stolen_insert_node(struct xe_device *xe, void *ptr, u32 size, u32 align)
+{
+	XE_WARN_ON(1);
+	return -ENODEV;
+}
+
+static void i915_gem_stolen_remove_node(struct xe_device *xe, void *ptr)
+{
+}
+
+#define i915_gem_stolen_initialized(xe) ((xe) && 0)
+
+#endif
+
 #define for_each_fbc_id(__dev_priv, __fbc_id) \
 	for ((__fbc_id) = INTEL_FBC_A; (__fbc_id) < I915_MAX_FBCS; (__fbc_id)++) \
 		for_each_if(RUNTIME_INFO(__dev_priv)->fbc_mask & BIT(__fbc_id))
@@ -329,6 +357,7 @@ static void i8xx_fbc_nuke(struct intel_fbc *fbc)
 
 static void i8xx_fbc_program_cfb(struct intel_fbc *fbc)
 {
+#ifdef I915
 	struct drm_i915_private *i915 = fbc->i915;
 
 	GEM_BUG_ON(range_overflows_end_t(u64, i915->dsm.start,
@@ -340,6 +369,7 @@ static void i8xx_fbc_program_cfb(struct intel_fbc *fbc)
 		       i915->dsm.start + fbc->compressed_fb.start);
 	intel_de_write(i915, FBC_LL_BASE,
 		       i915->dsm.start + fbc->compressed_llb.start);
+#endif
 }
 
 static const struct intel_fbc_funcs i8xx_fbc_funcs = {
@@ -604,8 +634,10 @@ static void ivb_fbc_activate(struct intel_fbc *fbc)
 	else if (DISPLAY_VER(i915) == 9)
 		skl_fbc_program_cfb_stride(fbc);
 
+#ifdef I915
 	if (to_gt(i915)->ggtt->num_fences)
 		snb_fbc_program_fence(fbc);
+#endif
 
 	intel_de_write(i915, ILK_DPFC_CONTROL(fbc->id),
 		       DPFC_CTL_EN | ivb_dpfc_ctl(fbc));
@@ -710,10 +742,14 @@ static u64 intel_fbc_stolen_end(struct drm_i915_private *i915)
 	 * reserved range size, so it always assumes the maximum (8mb) is used.
 	 * If we enable FBC using a CFB on that memory range we'll get FIFO
 	 * underruns, even if that range is not reserved by the BIOS. */
+#ifdef I915
 	if (IS_BROADWELL(i915) ||
 	    (DISPLAY_VER(i915) == 9 && !IS_BROXTON(i915)))
 		end = resource_size(&i915->dsm) - 8 * 1024 * 1024;
 	else
+#else
+	/* TODO */
+#endif
 		end = U64_MAX;
 
 	return min(end, intel_fbc_cfb_base_max(i915));
@@ -799,7 +835,7 @@ static int intel_fbc_alloc_cfb(struct intel_fbc *fbc,
 	if (drm_mm_node_allocated(&fbc->compressed_llb))
 		i915_gem_stolen_remove_node(i915, &fbc->compressed_llb);
 err:
-	if (drm_mm_initialized(&i915->mm.stolen))
+	if (i915_gem_stolen_initialized(i915))
 		drm_info_once(&i915->drm, "not enough stolen space for compressed buffer (need %d more bytes), disabling. Hint: you may be able to increase stolen memory size in the BIOS to avoid this.\n", size);
 	return -ENOSPC;
 }
@@ -970,7 +1006,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 				   struct intel_crtc *crtc,
 				   struct intel_plane *plane)
 {
-	struct drm_i915_private *i915 = to_i915(state->base.dev);
+	__maybe_unused struct drm_i915_private *i915 = to_i915(state->base.dev);
 	const struct intel_crtc_state *crtc_state =
 		intel_atomic_get_new_crtc_state(state, crtc);
 	const struct intel_plane_state *plane_state =
@@ -985,7 +1021,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 
 	/* FBC1 compression interval: arbitrary choice of 1 second */
 	fbc_state->interval = drm_mode_vrefresh(&crtc_state->hw.adjusted_mode);
-
+#ifdef I915
 	fbc_state->fence_y_offset = intel_plane_fence_y_offset(plane_state);
 
 	drm_WARN_ON(&i915->drm, plane_state->flags & PLANE_HAS_FENCE &&
@@ -995,6 +1031,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 	    plane_state->ggtt_vma->fence)
 		fbc_state->fence_id = plane_state->ggtt_vma->fence->id;
 	else
+#endif
 		fbc_state->fence_id = -1;
 
 	fbc_state->cfb_stride = intel_fbc_cfb_stride(plane_state);
@@ -1004,6 +1041,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 
 static bool intel_fbc_is_fence_ok(const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	struct drm_i915_private *i915 = to_i915(plane_state->uapi.plane->dev);
 
 	/*
@@ -1021,6 +1059,9 @@ static bool intel_fbc_is_fence_ok(const struct intel_plane_state *plane_state)
 	return DISPLAY_VER(i915) >= 9 ||
 		(plane_state->flags & PLANE_HAS_FENCE &&
 		 plane_state->ggtt_vma->fence);
+#else
+	return true;
+#endif
 }
 
 static bool intel_fbc_is_cfb_ok(const struct intel_plane_state *plane_state)
@@ -1706,7 +1747,7 @@ void intel_fbc_init(struct drm_i915_private *i915)
 {
 	enum intel_fbc_id fbc_id;
 
-	if (!drm_mm_initialized(&i915->mm.stolen))
+	if (!i915_gem_stolen_initialized(i915))
 		RUNTIME_INFO(i915)->fbc_mask = 0;
 
 	if (need_fbc_vtd_wa(i915))
diff --git a/drivers/gpu/drm/i915/display/intel_fbdev.c b/drivers/gpu/drm/i915/display/intel_fbdev.c
index 8ccdf1a964ff..176e0e44a268 100644
--- a/drivers/gpu/drm/i915/display/intel_fbdev.c
+++ b/drivers/gpu/drm/i915/display/intel_fbdev.c
@@ -41,7 +41,11 @@
 #include <drm/drm_fb_helper.h>
 #include <drm/drm_fourcc.h>
 
+#ifdef I915
 #include "gem/i915_gem_lmem.h"
+#else
+#include "xe_gt.h"
+#endif
 
 #include "i915_drv.h"
 #include "intel_display_types.h"
@@ -50,6 +54,14 @@
 #include "intel_fbdev.h"
 #include "intel_frontbuffer.h"
 
+#ifdef I915
+/*
+ * i915 requires obj->__do_not_access.base,
+ * xe uses obj->ttm.base
+ */
+#define ttm __do_not_access
+#endif
+
 struct intel_fbdev {
 	struct drm_fb_helper helper;
 	struct intel_framebuffer *fb;
@@ -147,14 +159,19 @@ static int intelfb_alloc(struct drm_fb_helper *helper,
 	mode_cmd.width = sizes->surface_width;
 	mode_cmd.height = sizes->surface_height;
 
+#ifdef I915
 	mode_cmd.pitches[0] = ALIGN(mode_cmd.width *
 				    DIV_ROUND_UP(sizes->surface_bpp, 8), 64);
+#else
+	mode_cmd.pitches[0] = ALIGN(mode_cmd.width *
+				    DIV_ROUND_UP(sizes->surface_bpp, 8), GEN8_PAGE_SIZE);
+#endif
 	mode_cmd.pixel_format = drm_mode_legacy_fb_format(sizes->surface_bpp,
 							  sizes->surface_depth);
 
 	size = mode_cmd.pitches[0] * mode_cmd.height;
 	size = PAGE_ALIGN(size);
-
+#ifdef I915
 	obj = ERR_PTR(-ENODEV);
 	if (HAS_LMEM(dev_priv)) {
 		obj = i915_gem_object_create_lmem(dev_priv, size,
@@ -170,6 +187,13 @@ static int intelfb_alloc(struct drm_fb_helper *helper,
 		if (IS_ERR(obj))
 			obj = i915_gem_object_create_shmem(dev_priv, size);
 	}
+#else
+	/* XXX: Care about stolen? */
+	obj = xe_bo_create_pin_map(dev_priv, to_gt(dev_priv), NULL, size,
+				   ttm_bo_type_kernel,
+				   XE_BO_CREATE_VRAM_IF_DGFX(to_gt(dev_priv)) |
+				   XE_BO_CREATE_PINNED_BIT | XE_BO_SCANOUT_BIT);
+#endif
 
 	if (IS_ERR(obj)) {
 		drm_err(&dev_priv->drm, "failed to allocate framebuffer (%pe)\n", obj);
@@ -177,10 +201,16 @@ static int intelfb_alloc(struct drm_fb_helper *helper,
 	}
 
 	fb = intel_framebuffer_create(obj, &mode_cmd);
-	i915_gem_object_put(obj);
-	if (IS_ERR(fb))
+	if (IS_ERR(fb)) {
+#ifdef I915
+		i915_gem_object_put(obj);
+#else
+		xe_bo_unpin_map_no_vm(obj);
+#endif
 		return PTR_ERR(fb);
+	}
 
+	drm_gem_object_put(&obj->ttm.base);
 	ifbdev->fb = to_intel_framebuffer(fb);
 	return 0;
 }
@@ -194,7 +224,6 @@ static int intelfb_create(struct drm_fb_helper *helper,
 	struct drm_device *dev = helper->dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct pci_dev *pdev = to_pci_dev(dev_priv->drm.dev);
-	struct i915_ggtt *ggtt = to_gt(dev_priv)->ggtt;
 	const struct i915_gtt_view view = {
 		.type = I915_GTT_VIEW_NORMAL,
 	};
@@ -264,6 +293,7 @@ static int intelfb_create(struct drm_fb_helper *helper,
 
 	/* setup aperture base/size for vesafb takeover */
 	obj = intel_fb_obj(&intel_fb->base);
+#ifdef I915
 	if (i915_gem_object_is_lmem(obj)) {
 		struct intel_memory_region *mem = obj->mm.region;
 
@@ -276,6 +306,8 @@ static int intelfb_create(struct drm_fb_helper *helper,
 					i915_gem_object_get_dma_address(obj, 0));
 		info->fix.smem_len = obj->base.size;
 	} else {
+		struct i915_ggtt *ggtt = to_gt(dev_priv)->ggtt;
+
 		info->apertures->ranges[0].base = ggtt->gmadr.start;
 		info->apertures->ranges[0].size = ggtt->mappable_end;
 
@@ -284,8 +316,36 @@ static int intelfb_create(struct drm_fb_helper *helper,
 			(unsigned long)(ggtt->gmadr.start + i915_ggtt_offset(vma));
 		info->fix.smem_len = vma->size;
 	}
-
 	vaddr = i915_vma_pin_iomap(vma);
+
+#else
+	/* XXX: Could be pure fiction.. */
+	if (obj->flags & XE_BO_CREATE_VRAM0_BIT) {
+		struct xe_gt *gt = to_gt(dev_priv);
+		bool lmem;
+
+		info->apertures->ranges[0].base = gt->mem.vram.io_start;
+		info->apertures->ranges[0].size = gt->mem.vram.size;
+
+		info->fix.smem_start =
+			(unsigned long)(gt->mem.vram.io_start + xe_bo_addr(obj, 0, 4096, &lmem));
+		info->fix.smem_len = obj->ttm.base.size;
+
+	} else {
+		struct pci_dev *pdev = to_pci_dev(dev_priv->drm.dev);
+
+		info->apertures->ranges[0].base = pci_resource_start(pdev, 2);
+		info->apertures->ranges[0].size =
+			pci_resource_end(pdev, 2) - pci_resource_start(pdev, 2);
+
+		info->fix.smem_start = info->apertures->ranges[0].base + xe_bo_ggtt_addr(obj);
+		info->fix.smem_len = obj->ttm.base.size;
+	}
+
+	/* TODO: ttm_bo_kmap? */
+	vaddr = obj->vmap.vaddr;
+#endif
+
 	if (IS_ERR(vaddr)) {
 		drm_err(&dev_priv->drm,
 			"Failed to remap framebuffer into virtual memory (%pe)\n", vaddr);
@@ -293,7 +353,7 @@ static int intelfb_create(struct drm_fb_helper *helper,
 		goto out_unpin;
 	}
 	info->screen_base = vaddr;
-	info->screen_size = vma->size;
+	info->screen_size = obj->ttm.base.size;
 
 	drm_fb_helper_fill_info(info, &ifbdev->helper, sizes);
 
@@ -301,14 +361,23 @@ static int intelfb_create(struct drm_fb_helper *helper,
 	 * If the object is stolen however, it will be full of whatever
 	 * garbage was left in there.
 	 */
+#ifdef I915
 	if (!i915_gem_object_is_shmem(vma->obj) && !prealloc)
+#else
+	/* XXX: Check stolen bit? */
+	if (!(obj->flags & XE_BO_CREATE_SYSTEM_BIT) && !prealloc)
+#endif
 		memset_io(info->screen_base, 0, info->screen_size);
 
 	/* Use default scratch pixmap (info->pixmap.flags = FB_PIXMAP_SYSTEM) */
 
 	drm_dbg_kms(&dev_priv->drm, "allocated %dx%d fb: 0x%08x\n",
 		    ifbdev->fb->base.width, ifbdev->fb->base.height,
+#ifdef I915
 		    i915_ggtt_offset(vma));
+#else
+		    (u32)vma->node.start);
+#endif
 	ifbdev->vma = vma;
 	ifbdev->vma_flags = flags;
 
@@ -339,8 +408,17 @@ static void intel_fbdev_destroy(struct intel_fbdev *ifbdev)
 	if (ifbdev->vma)
 		intel_unpin_fb_vma(ifbdev->vma, ifbdev->vma_flags);
 
-	if (ifbdev->fb)
+	if (ifbdev->fb) {
+#ifndef I915
+		struct xe_bo *bo = intel_fb_obj(&ifbdev->fb->base);
+
+		/* Unpin our kernel fb first */
+		xe_bo_lock_no_vm(bo, NULL);
+		xe_bo_unpin(bo);
+		xe_bo_unlock_no_vm(bo);
+#endif
 		drm_framebuffer_remove(&ifbdev->fb->base);
+	}
 
 	kfree(ifbdev);
 }
@@ -387,12 +465,12 @@ static bool intel_fbdev_init_bios(struct drm_device *dev,
 			continue;
 		}
 
-		if (obj->base.size > max_size) {
+		if (obj->ttm.base.size > max_size) {
 			drm_dbg_kms(&i915->drm,
 				    "found possible fb from [PLANE:%d:%s]\n",
 				    plane->base.base.id, plane->base.name);
 			fb = to_intel_framebuffer(plane_state->uapi.fb);
-			max_size = obj->base.size;
+			max_size = obj->ttm.base.size;
 		}
 	}
 
@@ -658,8 +736,13 @@ void intel_fbdev_set_suspend(struct drm_device *dev, int state, bool synchronous
 	 * been restored from swap. If the object is stolen however, it will be
 	 * full of whatever garbage was left in there.
 	 */
+#ifdef I915
 	if (state == FBINFO_STATE_RUNNING &&
 	    !i915_gem_object_is_shmem(intel_fb_obj(&ifbdev->fb->base)))
+#else
+	if (state == FBINFO_STATE_RUNNING &&
+	    !(intel_fb_obj(&ifbdev->fb->base)->flags & XE_BO_CREATE_SYSTEM_BIT))
+#endif
 		memset_io(info->screen_base, 0, info->screen_size);
 
 	drm_fb_helper_set_suspend(&ifbdev->helper, state);
diff --git a/drivers/gpu/drm/i915/display/intel_gmbus.c b/drivers/gpu/drm/i915/display/intel_gmbus.c
index 0bc4f6b48e80..2d099f4c52cd 100644
--- a/drivers/gpu/drm/i915/display/intel_gmbus.c
+++ b/drivers/gpu/drm/i915/display/intel_gmbus.c
@@ -39,7 +39,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_gmbus.h"
-#include "intel_gmbus_regs.h"
+#include "../i915/display/intel_gmbus_regs.h"
 
 struct intel_gmbus {
 	struct i2c_adapter adapter;
diff --git a/drivers/gpu/drm/i915/display/intel_lpe_audio.h b/drivers/gpu/drm/i915/display/intel_lpe_audio.h
index f848c5038714..1e236df9273b 100644
--- a/drivers/gpu/drm/i915/display/intel_lpe_audio.h
+++ b/drivers/gpu/drm/i915/display/intel_lpe_audio.h
@@ -12,11 +12,19 @@ enum pipe;
 enum port;
 struct drm_i915_private;
 
+#ifdef I915
 int  intel_lpe_audio_init(struct drm_i915_private *dev_priv);
 void intel_lpe_audio_teardown(struct drm_i915_private *dev_priv);
 void intel_lpe_audio_irq_handler(struct drm_i915_private *dev_priv);
 void intel_lpe_audio_notify(struct drm_i915_private *dev_priv,
 			    enum pipe pipe, enum port port,
 			    const void *eld, int ls_clock, bool dp_output);
+#else
+#define intel_lpe_audio_init(xe) (-ENODEV)
+#define intel_lpe_audio_teardown(xe) BUG_ON(1)
+#define intel_lpe_audio_irq_handler(xe) do { } while (0)
+#define intel_lpe_audio_notify(xe, a, b, c, d, e) do { } while (0)
+
+#endif
 
 #endif /* __INTEL_LPE_AUDIO_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_modeset_setup.c b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
index 96395bfbd41d..6f705acaf225 100644
--- a/drivers/gpu/drm/i915/display/intel_modeset_setup.c
+++ b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
@@ -721,18 +721,21 @@ void intel_modeset_setup_hw_state(struct drm_i915_private *i915,
 
 	intel_dpll_sanitize_state(i915);
 
-	if (IS_G4X(i915)) {
+	if (DISPLAY_VER(i915) >= 9) {
+		skl_wm_get_hw_state(i915);
+		skl_wm_sanitize(i915);
+	}
+#ifdef I915
+	else if (IS_G4X(i915)) {
 		g4x_wm_get_hw_state(i915);
 		g4x_wm_sanitize(i915);
 	} else if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915)) {
 		vlv_wm_get_hw_state(i915);
 		vlv_wm_sanitize(i915);
-	} else if (DISPLAY_VER(i915) >= 9) {
-		skl_wm_get_hw_state(i915);
-		skl_wm_sanitize(i915);
 	} else if (HAS_PCH_SPLIT(i915)) {
 		ilk_wm_get_hw_state(i915);
 	}
+#endif
 
 	for_each_intel_crtc(&i915->drm, crtc) {
 		struct intel_crtc_state *crtc_state =
diff --git a/drivers/gpu/drm/i915/display/intel_opregion.c b/drivers/gpu/drm/i915/display/intel_opregion.c
index e0184745632c..057a68237efe 100644
--- a/drivers/gpu/drm/i915/display/intel_opregion.c
+++ b/drivers/gpu/drm/i915/display/intel_opregion.c
@@ -37,7 +37,7 @@
 #include "intel_backlight.h"
 #include "intel_display_types.h"
 #include "intel_opregion.h"
-#include "intel_pci_config.h"
+#include "../i915/intel_pci_config.h"
 
 #define OPREGION_HEADER_OFFSET 0
 #define OPREGION_ACPI_OFFSET   0x100
diff --git a/drivers/gpu/drm/i915/display/intel_pch_display.h b/drivers/gpu/drm/i915/display/intel_pch_display.h
index 41a63413cb3d..e8b50e9a4969 100644
--- a/drivers/gpu/drm/i915/display/intel_pch_display.h
+++ b/drivers/gpu/drm/i915/display/intel_pch_display.h
@@ -15,6 +15,7 @@ struct intel_crtc;
 struct intel_crtc_state;
 struct intel_link_m_n;
 
+#ifdef I915
 bool intel_has_pch_trancoder(struct drm_i915_private *i915,
 			     enum pipe pch_transcoder);
 enum pipe intel_crtc_pch_transcoder(struct intel_crtc *crtc);
@@ -41,5 +42,20 @@ void intel_pch_transcoder_get_m2_n2(struct intel_crtc *crtc,
 				    struct intel_link_m_n *m_n);
 
 void intel_pch_sanitize(struct drm_i915_private *i915);
+#else
+#define intel_has_pch_trancoder(xe, pipe) (xe && pipe && 0)
+#define intel_crtc_pch_transcoder(crtc) ((crtc)->pipe)
+#define ilk_pch_pre_enable(state, crtc) do { } while (0)
+#define ilk_pch_enable(state, crtc) do { } while (0)
+#define ilk_pch_disable(state, crtc) do { } while (0)
+#define ilk_pch_post_disable(state, crtc) do { } while (0)
+#define ilk_pch_get_config(crtc) do { } while (0)
+#define lpt_pch_enable(state, crtc) do { } while (0)
+#define lpt_pch_disable(state, crtc) do { } while (0)
+#define lpt_pch_get_config(crtc) do { } while (0)
+#define intel_pch_transcoder_get_m1_n1(crtc, m_n) memset((m_n), 0, sizeof(*m_n))
+#define intel_pch_transcoder_get_m2_n2(crtc, m_n) memset((m_n), 0, sizeof(*m_n))
+#define intel_pch_sanitize(xe) do { } while (0)
+#endif
 
 #endif
diff --git a/drivers/gpu/drm/i915/display/intel_pch_refclk.h b/drivers/gpu/drm/i915/display/intel_pch_refclk.h
index 9bcf56629f24..aa4f6e0b1127 100644
--- a/drivers/gpu/drm/i915/display/intel_pch_refclk.h
+++ b/drivers/gpu/drm/i915/display/intel_pch_refclk.h
@@ -11,6 +11,7 @@
 struct drm_i915_private;
 struct intel_crtc_state;
 
+#ifdef I915
 void lpt_program_iclkip(const struct intel_crtc_state *crtc_state);
 void lpt_disable_iclkip(struct drm_i915_private *dev_priv);
 int lpt_get_iclkip(struct drm_i915_private *dev_priv);
@@ -18,5 +19,12 @@ int lpt_iclkip(const struct intel_crtc_state *crtc_state);
 
 void intel_init_pch_refclk(struct drm_i915_private *dev_priv);
 void lpt_disable_clkout_dp(struct drm_i915_private *dev_priv);
+#else
+#define lpt_program_iclkip(cstate) do { } while (0)
+#define lpt_disable_iclkip(xe) do { } while (0)
+#define lpt_get_iclkip(xe) (WARN_ON(-ENODEV))
+#define intel_init_pch_refclk(xe) do { } while (0)
+#define lpt_disable_clkout_dp(xe) do { } while (0)
+#endif
 
 #endif
diff --git a/drivers/gpu/drm/i915/display/intel_pipe_crc.c b/drivers/gpu/drm/i915/display/intel_pipe_crc.c
index e9774670e3f6..a4b7a8ec3720 100644
--- a/drivers/gpu/drm/i915/display/intel_pipe_crc.c
+++ b/drivers/gpu/drm/i915/display/intel_pipe_crc.c
@@ -34,6 +34,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_pipe_crc.h"
+#include "i915_irq.h"
 
 static const char * const pipe_crc_sources[] = {
 	[INTEL_PIPE_CRC_SOURCE_NONE] = "none",
diff --git a/drivers/gpu/drm/i915/display/intel_sprite.c b/drivers/gpu/drm/i915/display/intel_sprite.c
index e6b4d24b9cd0..000561eadcc1 100644
--- a/drivers/gpu/drm/i915/display/intel_sprite.c
+++ b/drivers/gpu/drm/i915/display/intel_sprite.c
@@ -43,8 +43,10 @@
 
 #include "i915_drv.h"
 #include "i915_reg.h"
+#ifdef I915
 #include "i915_vgpu.h"
 #include "i9xx_plane.h"
+#endif
 #include "intel_atomic_plane.h"
 #include "intel_crtc.h"
 #include "intel_de.h"
@@ -112,6 +114,7 @@ int intel_plane_check_src_coordinates(struct intel_plane_state *plane_state)
 	return 0;
 }
 
+#ifdef I915
 static void i9xx_plane_linear_gamma(u16 gamma[8])
 {
 	/* The points are not evenly spaced. */
@@ -325,6 +328,7 @@ static u32 vlv_sprite_ctl_crtc(const struct intel_crtc_state *crtc_state)
 static u32 vlv_sprite_ctl(const struct intel_crtc_state *crtc_state,
 			  const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	const struct drm_framebuffer *fb = plane_state->hw.fb;
 	unsigned int rotation = plane_state->hw.rotation;
 	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
@@ -396,6 +400,9 @@ static u32 vlv_sprite_ctl(const struct intel_crtc_state *crtc_state,
 		sprctl |= SP_SOURCE_KEY;
 
 	return sprctl;
+#else
+	return 0;
+#endif
 }
 
 static void vlv_sprite_update_gamma(const struct intel_plane_state *plane_state)
@@ -447,6 +454,7 @@ vlv_sprite_update_arm(struct intel_plane *plane,
 		      const struct intel_crtc_state *crtc_state,
 		      const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	struct drm_i915_private *dev_priv = to_i915(plane->base.dev);
 	enum pipe pipe = plane->pipe;
 	enum plane_id plane_id = plane->id;
@@ -486,6 +494,7 @@ vlv_sprite_update_arm(struct intel_plane *plane,
 	intel_de_write_fw(dev_priv, SPCNTR(pipe, plane_id), sprctl);
 	intel_de_write_fw(dev_priv, SPSURF(pipe, plane_id),
 			  intel_plane_ggtt_offset(plane_state) + sprsurf_offset);
+#endif
 
 	vlv_sprite_update_clrc(plane_state);
 	vlv_sprite_update_gamma(plane_state);
@@ -711,6 +720,7 @@ static bool ivb_need_sprite_gamma(const struct intel_plane_state *plane_state)
 static u32 ivb_sprite_ctl(const struct intel_crtc_state *crtc_state,
 			  const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	struct drm_i915_private *dev_priv =
 		to_i915(plane_state->uapi.plane->dev);
 	const struct drm_framebuffer *fb = plane_state->hw.fb;
@@ -780,6 +790,9 @@ static u32 ivb_sprite_ctl(const struct intel_crtc_state *crtc_state,
 		sprctl |= SPRITE_SOURCE_KEY;
 
 	return sprctl;
+#else
+	return 0;
+#endif
 }
 
 static void ivb_sprite_linear_gamma(const struct intel_plane_state *plane_state,
@@ -1723,10 +1736,13 @@ static const struct drm_plane_funcs vlv_sprite_funcs = {
 	.format_mod_supported = vlv_sprite_format_mod_supported,
 };
 
+#endif
+
 struct intel_plane *
 intel_sprite_plane_create(struct drm_i915_private *dev_priv,
 			  enum pipe pipe, int sprite)
 {
+#ifdef I915
 	struct intel_plane *plane;
 	const struct drm_plane_funcs *plane_funcs;
 	unsigned int supported_rotations;
@@ -1846,4 +1862,9 @@ intel_sprite_plane_create(struct drm_i915_private *dev_priv,
 	intel_plane_free(plane);
 
 	return ERR_PTR(ret);
+#else
+	BUG_ON(1);
+	return ERR_PTR(-ENODEV);
+#endif
 }
+
diff --git a/drivers/gpu/drm/i915/display/intel_vbt_defs.h b/drivers/gpu/drm/i915/display/intel_vbt_defs.h
index a9f44abfc9fc..001de8fe4e64 100644
--- a/drivers/gpu/drm/i915/display/intel_vbt_defs.h
+++ b/drivers/gpu/drm/i915/display/intel_vbt_defs.h
@@ -30,7 +30,7 @@
  *
  * Please do NOT include anywhere else.
  */
-#ifndef _INTEL_BIOS_PRIVATE
+#if !defined(_INTEL_BIOS_PRIVATE) && !defined(HDRTEST)
 #error "intel_vbt_defs.h is private to intel_bios.c"
 #endif
 
diff --git a/drivers/gpu/drm/i915/display/intel_vga.c b/drivers/gpu/drm/i915/display/intel_vga.c
index a69bfcac9a94..b15dcc84ae8c 100644
--- a/drivers/gpu/drm/i915/display/intel_vga.c
+++ b/drivers/gpu/drm/i915/display/intel_vga.c
@@ -101,6 +101,7 @@ void intel_vga_reset_io_mem(struct drm_i915_private *i915)
 static int
 intel_vga_set_state(struct drm_i915_private *i915, bool enable_decode)
 {
+#ifdef I915
 	unsigned int reg = DISPLAY_VER(i915) >= 6 ? SNB_GMCH_CTRL : INTEL_GMCH_CTRL;
 	u16 gmch_ctrl;
 
@@ -123,6 +124,10 @@ intel_vga_set_state(struct drm_i915_private *i915, bool enable_decode)
 	}
 
 	return 0;
+#else
+	/* Only works on some machines because bios forgets to lock the reg. */
+	return -EIO;
+#endif
 }
 
 static unsigned int
diff --git a/drivers/gpu/drm/i915/display/skl_scaler.c b/drivers/gpu/drm/i915/display/skl_scaler.c
index d7390067b7d4..8c918734416d 100644
--- a/drivers/gpu/drm/i915/display/skl_scaler.c
+++ b/drivers/gpu/drm/i915/display/skl_scaler.c
@@ -244,6 +244,7 @@ int skl_update_scaler_plane(struct intel_crtc_state *crtc_state,
 	if (ret || plane_state->scaler_id < 0)
 		return ret;
 
+#ifdef I915
 	/* check colorkey */
 	if (plane_state->ckey.flags) {
 		drm_dbg_kms(&dev_priv->drm,
@@ -252,6 +253,7 @@ int skl_update_scaler_plane(struct intel_crtc_state *crtc_state,
 			    intel_plane->base.name);
 		return -EINVAL;
 	}
+#endif
 
 	/* Check src format */
 	switch (fb->format->format) {
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index 2f5524f380b0..b08aa1a06784 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -22,7 +22,11 @@
 #include "skl_scaler.h"
 #include "skl_universal_plane.h"
 #include "skl_watermark.h"
+#ifdef I915
 #include "pxp/intel_pxp.h"
+#else
+// TODO: pxp?
+#endif
 
 static const u32 skl_plane_formats[] = {
 	DRM_FORMAT_C8,
@@ -895,7 +899,9 @@ static u32 skl_plane_ctl(const struct intel_crtc_state *crtc_state,
 		to_i915(plane_state->uapi.plane->dev);
 	const struct drm_framebuffer *fb = plane_state->hw.fb;
 	unsigned int rotation = plane_state->hw.rotation;
+#ifdef I915
 	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
+#endif
 	u32 plane_ctl;
 
 	plane_ctl = PLANE_CTL_ENABLE;
@@ -919,10 +925,12 @@ static u32 skl_plane_ctl(const struct intel_crtc_state *crtc_state,
 		plane_ctl |= icl_plane_ctl_flip(rotation &
 						DRM_MODE_REFLECT_MASK);
 
+#ifdef I915
 	if (key->flags & I915_SET_COLORKEY_DESTINATION)
 		plane_ctl |= PLANE_CTL_KEY_ENABLE_DESTINATION;
 	else if (key->flags & I915_SET_COLORKEY_SOURCE)
 		plane_ctl |= PLANE_CTL_KEY_ENABLE_SOURCE;
+#endif
 
 	/* Wa_22012358565:adl-p */
 	if (DISPLAY_VER(dev_priv) == 13)
@@ -999,9 +1007,13 @@ static u32 skl_surf_address(const struct intel_plane_state *plane_state,
 		 * The DPT object contains only one vma, so the VMA's offset
 		 * within the DPT is always 0.
 		 */
-		drm_WARN_ON(&i915->drm, plane_state->dpt_vma->node.start);
 		drm_WARN_ON(&i915->drm, offset & 0x1fffff);
+#ifdef I915
+		drm_WARN_ON(&i915->drm, plane_state->dpt_vma->node.start);
 		return offset >> 9;
+#else
+		return 0;
+#endif
 	} else {
 		drm_WARN_ON(&i915->drm, offset & 0xfff);
 		return offset;
@@ -1044,26 +1056,35 @@ static u32 skl_plane_aux_dist(const struct intel_plane_state *plane_state,
 
 static u32 skl_plane_keyval(const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 
 	return key->min_value;
+#else
+	return 0;
+#endif
 }
 
 static u32 skl_plane_keymax(const struct intel_plane_state *plane_state)
 {
-	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 	u8 alpha = plane_state->hw.alpha >> 8;
+#ifdef I915
+	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 
 	return (key->max_value & 0xffffff) | PLANE_KEYMAX_ALPHA(alpha);
+#else
+	return PLANE_KEYMAX_ALPHA(alpha);
+#endif
 }
 
 static u32 skl_plane_keymsk(const struct intel_plane_state *plane_state)
 {
-	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 	u8 alpha = plane_state->hw.alpha >> 8;
-	u32 keymsk;
-
+	u32 keymsk = 0;
+#ifdef I915
+	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 	keymsk = key->channel_mask & 0x7ffffff;
+#endif
 	if (alpha < 0xff)
 		keymsk |= PLANE_KEYMSK_ALPHA_ENABLE;
 
@@ -1319,7 +1340,7 @@ skl_plane_async_flip(struct intel_plane *plane,
 			  skl_plane_surf(plane_state, 0));
 }
 
-static bool intel_format_is_p01x(u32 format)
+static inline bool intel_format_is_p01x(u32 format)
 {
 	switch (format) {
 	case DRM_FORMAT_P010:
@@ -1402,6 +1423,7 @@ static int skl_plane_check_fb(const struct intel_crtc_state *crtc_state,
 		return -EINVAL;
 	}
 
+#ifdef I915
 	/* Wa_1606054188:tgl,adl-s */
 	if ((IS_ALDERLAKE_S(dev_priv) || IS_TIGERLAKE(dev_priv)) &&
 	    plane_state->ckey.flags & I915_SET_COLORKEY_SOURCE &&
@@ -1410,6 +1432,7 @@ static int skl_plane_check_fb(const struct intel_crtc_state *crtc_state,
 			    "Source color keying not supported with P01x formats\n");
 		return -EINVAL;
 	}
+#endif
 
 	return 0;
 }
@@ -1847,9 +1870,14 @@ static bool skl_fb_scalable(const struct drm_framebuffer *fb)
 
 static bool bo_has_valid_encryption(struct drm_i915_gem_object *obj)
 {
+#ifdef I915
 	struct drm_i915_private *i915 = to_i915(obj->base.dev);
 
 	return intel_pxp_key_check(i915->pxp, obj, false) == 0;
+#else
+#define i915_gem_object_is_protected(x) ((x) && 0)
+	return false;
+#endif
 }
 
 static bool pxp_is_borked(struct drm_i915_gem_object *obj)
@@ -1872,7 +1900,12 @@ static int skl_plane_check(struct intel_crtc_state *crtc_state,
 		return ret;
 
 	/* use scaler when colorkey is not required */
-	if (!plane_state->ckey.flags && skl_fb_scalable(fb)) {
+#ifdef I915
+	if (!plane_state->ckey.flags && skl_fb_scalable(fb))
+#else
+	if (skl_fb_scalable(fb))
+#endif
+	{
 		min_scale = 1;
 		max_scale = skl_plane_max_scale(dev_priv, fb);
 	}
@@ -2435,11 +2468,15 @@ skl_get_initial_plane_config(struct intel_crtc *crtc,
 		fb->modifier = DRM_FORMAT_MOD_LINEAR;
 		break;
 	case PLANE_CTL_TILED_X:
+#ifdef I915
 		plane_config->tiling = I915_TILING_X;
+#endif
 		fb->modifier = I915_FORMAT_MOD_X_TILED;
 		break;
 	case PLANE_CTL_TILED_Y:
+#ifdef I915
 		plane_config->tiling = I915_TILING_Y;
+#endif
 		if (val & PLANE_CTL_RENDER_DECOMPRESSION_ENABLE)
 			if (DISPLAY_VER(dev_priv) >= 12)
 				fb->modifier = I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS;
diff --git a/drivers/gpu/drm/i915/display/skl_watermark.c b/drivers/gpu/drm/i915/display/skl_watermark.c
index e254fb21b47f..381d4f75e7c8 100644
--- a/drivers/gpu/drm/i915/display/skl_watermark.c
+++ b/drivers/gpu/drm/i915/display/skl_watermark.c
@@ -16,7 +16,7 @@
 #include "skl_watermark.h"
 
 #include "i915_drv.h"
-#include "i915_fixed.h"
+#include "../i915/i915_fixed.h"
 #include "i915_reg.h"
 #include "intel_pm.h"
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index f8eb807b56f9..3b9e20dd6039 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -7,8 +7,9 @@
 #define __INTEL_GT_REGS__
 
 #include "i915_reg_defs.h"
+#ifdef I915
 #include "display/intel_display_reg_defs.h"	/* VLV_DISPLAY_BASE */
-
+#endif
 #define MCR_REG(offset)	((const i915_mcr_reg_t){ .reg = (offset) })
 
 /*
diff --git a/drivers/gpu/drm/i915/i915_reg_defs.h b/drivers/gpu/drm/i915/i915_reg_defs.h
index be43580a6979..1e3966609844 100644
--- a/drivers/gpu/drm/i915/i915_reg_defs.h
+++ b/drivers/gpu/drm/i915/i915_reg_defs.h
@@ -132,9 +132,13 @@ typedef struct {
 
 #define _MMIO(r) ((const i915_reg_t){ .reg = (r) })
 
+#ifdef I915
 typedef struct {
 	u32 reg;
 } i915_mcr_reg_t;
+#else
+#define i915_mcr_reg_t i915_reg_t
+#endif
 
 #define INVALID_MMIO_REG _MMIO(0)
 
@@ -143,8 +147,12 @@ typedef struct {
  * simply operations on the register's offset and don't care about the MCR vs
  * non-MCR nature of the register.
  */
+#ifdef I915
 #define i915_mmio_reg_offset(r) \
 	_Generic((r), i915_reg_t: (r).reg, i915_mcr_reg_t: (r).reg)
+#else
+#define i915_mmio_reg_offset(r) ((r).reg)
+#endif
 #define i915_mmio_reg_equal(a, b) (i915_mmio_reg_offset(a) == i915_mmio_reg_offset(b))
 #define i915_mmio_reg_valid(r) (!i915_mmio_reg_equal(r, INVALID_MMIO_REG))
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 18/20] drm/i915/display: Remaining changes to make xe compile
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Xe, the new Intel GPU driver, will re-use the i915 display.

At least for now, the plan is to use symbolic links and
adjust the build so we are building the display either for
i915 or for xe.

The display can be split out if needed.
Also the compilation is optional at this time

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Rodrigo changed i915_reg_defs.h, commit msg, and rebased]
---
 drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
 .../gpu/drm/i915/display/intel_atomic_plane.c |  25 ++-
 .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
 drivers/gpu/drm/i915/display/intel_bw.c       |   2 +-
 drivers/gpu/drm/i915/display/intel_cdclk.c    |  23 ++-
 drivers/gpu/drm/i915/display/intel_color.c    |   1 +
 drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
 drivers/gpu/drm/i915/display/intel_cursor.c   |   8 +-
 drivers/gpu/drm/i915/display/intel_display.c  | 150 ++++++++++++++++--
 drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
 .../gpu/drm/i915/display/intel_display_core.h |   5 +-
 .../drm/i915/display/intel_display_debugfs.c  |   8 +
 .../drm/i915/display/intel_display_power.c    |  35 ++--
 .../drm/i915/display/intel_display_power.h    |   5 +
 .../i915/display/intel_display_power_map.c    |   7 +
 .../i915/display/intel_display_power_well.c   |  17 +-
 .../drm/i915/display/intel_display_trace.h    |   6 +
 .../drm/i915/display/intel_display_types.h    |  24 ++-
 drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
 drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
 drivers/gpu/drm/i915/display/intel_dp_aux.c   |   4 +
 drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 ++
 drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
 drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
 drivers/gpu/drm/i915/display/intel_dsb.c      | 124 ++++++++++++---
 drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 ++-
 drivers/gpu/drm/i915/display/intel_fb.c       |  96 +++++++++--
 drivers/gpu/drm/i915/display/intel_fbc.c      |  49 +++++-
 drivers/gpu/drm/i915/display/intel_fbdev.c    | 101 ++++++++++--
 drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
 .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
 .../drm/i915/display/intel_modeset_setup.c    |  11 +-
 drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
 .../gpu/drm/i915/display/intel_pch_display.h  |  16 ++
 .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
 drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
 drivers/gpu/drm/i915/display/intel_sprite.c   |  21 +++
 drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
 drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
 drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
 .../drm/i915/display/skl_universal_plane.c    |  51 +++++-
 drivers/gpu/drm/i915/display/skl_watermark.c  |   2 +-
 drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
 drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
 44 files changed, 811 insertions(+), 129 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_atomic.c b/drivers/gpu/drm/i915/display/intel_atomic.c
index 6621aa245caf..56875afa592f 100644
--- a/drivers/gpu/drm/i915/display/intel_atomic.c
+++ b/drivers/gpu/drm/i915/display/intel_atomic.c
@@ -522,7 +522,9 @@ void intel_atomic_state_free(struct drm_atomic_state *_state)
 	drm_atomic_state_default_release(&state->base);
 	kfree(state->global_objs);
 
+#ifdef I915
 	i915_sw_fence_fini(&state->commit_ready);
+#endif
 
 	kfree(state);
 }
diff --git a/drivers/gpu/drm/i915/display/intel_atomic_plane.c b/drivers/gpu/drm/i915/display/intel_atomic_plane.c
index 10e1fc9d0698..acb32396e73c 100644
--- a/drivers/gpu/drm/i915/display/intel_atomic_plane.c
+++ b/drivers/gpu/drm/i915/display/intel_atomic_plane.c
@@ -34,7 +34,9 @@
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_fourcc.h>
 
+#ifdef I915
 #include "gt/intel_rps.h"
+#endif
 
 #include "intel_atomic_plane.h"
 #include "intel_cdclk.h"
@@ -107,7 +109,9 @@ intel_plane_duplicate_state(struct drm_plane *plane)
 	__drm_atomic_helper_plane_duplicate_state(plane, &intel_state->uapi);
 
 	intel_state->ggtt_vma = NULL;
+#ifdef I915
 	intel_state->dpt_vma = NULL;
+#endif
 	intel_state->flags = 0;
 
 	/* add reference to fb */
@@ -132,7 +136,9 @@ intel_plane_destroy_state(struct drm_plane *plane,
 	struct intel_plane_state *plane_state = to_intel_plane_state(state);
 
 	drm_WARN_ON(plane->dev, plane_state->ggtt_vma);
+#ifdef I915
 	drm_WARN_ON(plane->dev, plane_state->dpt_vma);
+#endif
 
 	__drm_atomic_helper_plane_destroy_state(&plane_state->uapi);
 	if (plane_state->hw.fb)
@@ -937,6 +943,7 @@ int intel_atomic_plane_check_clipping(struct intel_plane_state *plane_state,
 	return 0;
 }
 
+#ifdef I915
 struct wait_rps_boost {
 	struct wait_queue_entry wait;
 
@@ -994,6 +1001,7 @@ static void add_rps_boost_after_vblank(struct drm_crtc *crtc,
 
 	add_wait_queue(drm_crtc_vblank_waitqueue(crtc), &wait->wait);
 }
+#endif
 
 /**
  * intel_prepare_plane_fb - Prepare fb for usage on plane
@@ -1011,10 +1019,11 @@ static int
 intel_prepare_plane_fb(struct drm_plane *_plane,
 		       struct drm_plane_state *_new_plane_state)
 {
-	struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
-	struct intel_plane *plane = to_intel_plane(_plane);
 	struct intel_plane_state *new_plane_state =
 		to_intel_plane_state(_new_plane_state);
+#ifdef I915
+	struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
+	struct intel_plane *plane = to_intel_plane(_plane);
 	struct intel_atomic_state *state =
 		to_intel_atomic_state(new_plane_state->uapi.state);
 	struct drm_i915_private *dev_priv = to_i915(plane->base.dev);
@@ -1113,6 +1122,12 @@ intel_prepare_plane_fb(struct drm_plane *_plane,
 	intel_plane_unpin_fb(new_plane_state);
 
 	return ret;
+#else
+	if (!intel_fb_obj(new_plane_state->hw.fb))
+		return 0;
+
+	return intel_plane_pin_fb(new_plane_state);
+#endif
 }
 
 /**
@@ -1128,18 +1143,20 @@ intel_cleanup_plane_fb(struct drm_plane *plane,
 {
 	struct intel_plane_state *old_plane_state =
 		to_intel_plane_state(_old_plane_state);
-	struct intel_atomic_state *state =
+	__maybe_unused struct intel_atomic_state *state =
 		to_intel_atomic_state(old_plane_state->uapi.state);
-	struct drm_i915_private *dev_priv = to_i915(plane->dev);
+	__maybe_unused struct drm_i915_private *dev_priv = to_i915(plane->dev);
 	struct drm_i915_gem_object *obj = intel_fb_obj(old_plane_state->hw.fb);
 
 	if (!obj)
 		return;
 
+#ifdef I915
 	if (state->rps_interactive) {
 		intel_rps_mark_interactive(&to_gt(dev_priv)->rps, false);
 		state->rps_interactive = false;
 	}
+#endif
 
 	/* Should only be called after a successful intel_prepare_plane_fb()! */
 	intel_plane_unpin_fb(old_plane_state);
diff --git a/drivers/gpu/drm/i915/display/intel_backlight.c b/drivers/gpu/drm/i915/display/intel_backlight.c
index 5b7da72c95b8..e63eb43622e0 100644
--- a/drivers/gpu/drm/i915/display/intel_backlight.c
+++ b/drivers/gpu/drm/i915/display/intel_backlight.c
@@ -19,7 +19,7 @@
 #include "intel_dp_aux_backlight.h"
 #include "intel_dsi_dcs_backlight.h"
 #include "intel_panel.h"
-#include "intel_pci_config.h"
+#include "../i915/intel_pci_config.h"
 #include "intel_pps.h"
 #include "intel_quirks.h"
 
diff --git a/drivers/gpu/drm/i915/display/intel_bw.c b/drivers/gpu/drm/i915/display/intel_bw.c
index 54e03a3eaa0f..67b4e947589c 100644
--- a/drivers/gpu/drm/i915/display/intel_bw.c
+++ b/drivers/gpu/drm/i915/display/intel_bw.c
@@ -15,7 +15,7 @@
 #include "intel_display_core.h"
 #include "intel_display_types.h"
 #include "skl_watermark.h"
-#include "intel_mchbar_regs.h"
+#include "../i915/intel_mchbar_regs.h"
 
 /* Parameters for Qclk Geyserville (QGV) */
 struct intel_qgv_point {
diff --git a/drivers/gpu/drm/i915/display/intel_cdclk.c b/drivers/gpu/drm/i915/display/intel_cdclk.c
index 80e2db6b5ea4..3b6a37403f25 100644
--- a/drivers/gpu/drm/i915/display/intel_cdclk.c
+++ b/drivers/gpu/drm/i915/display/intel_cdclk.c
@@ -23,7 +23,6 @@
 
 #include <linux/time.h>
 
-#include "hsw_ips.h"
 #include "i915_reg.h"
 #include "intel_atomic.h"
 #include "intel_atomic_plane.h"
@@ -33,10 +32,14 @@
 #include "intel_crtc.h"
 #include "intel_de.h"
 #include "intel_display_types.h"
-#include "intel_mchbar_regs.h"
-#include "intel_pci_config.h"
+#include "../i915/intel_mchbar_regs.h"
+#include "../i915/intel_pci_config.h"
 #include "intel_psr.h"
+
+#ifdef I915
+#include "hsw_ips.h"
 #include "vlv_sideband.h"
+#endif
 
 /**
  * DOC: CDCLK / RAWCLK
@@ -474,6 +477,7 @@ static void hsw_get_cdclk(struct drm_i915_private *dev_priv,
 		cdclk_config->cdclk = 540000;
 }
 
+#ifdef I915
 static int vlv_calc_cdclk(struct drm_i915_private *dev_priv, int min_cdclk)
 {
 	int freq_320 = (dev_priv->hpll_freq <<  1) % 320000 != 0 ?
@@ -712,6 +716,7 @@ static void chv_set_cdclk(struct drm_i915_private *dev_priv,
 
 	intel_display_power_put(dev_priv, POWER_DOMAIN_DISPLAY_CORE, wakeref);
 }
+#endif
 
 static int bdw_calc_cdclk(int min_cdclk)
 {
@@ -2375,9 +2380,11 @@ int intel_crtc_compute_min_cdclk(const struct intel_crtc_state *crtc_state)
 
 	min_cdclk = intel_pixel_rate_to_cdclk(crtc_state);
 
+#ifdef I915
 	/* pixel rate mustn't exceed 95% of cdclk with IPS on BDW */
 	if (IS_BROADWELL(dev_priv) && hsw_crtc_state_ips_capable(crtc_state))
 		min_cdclk = DIV_ROUND_UP(min_cdclk * 100, 95);
+#endif
 
 	/* BSpec says "Do not use DisplayPort with CDCLK less than 432 MHz,
 	 * audio enabled, port width x4, and link rate HBR2 (5.4 GHz), or else
@@ -2571,6 +2578,7 @@ static int bxt_compute_min_voltage_level(struct intel_cdclk_state *cdclk_state)
 	return min_voltage_level;
 }
 
+#ifdef I915
 static int vlv_modeset_calc_cdclk(struct intel_cdclk_state *cdclk_state)
 {
 	struct intel_atomic_state *state = cdclk_state->base.state;
@@ -2599,6 +2607,7 @@ static int vlv_modeset_calc_cdclk(struct intel_cdclk_state *cdclk_state)
 
 	return 0;
 }
+#endif
 
 static int bdw_modeset_calc_cdclk(struct intel_cdclk_state *cdclk_state)
 {
@@ -3101,12 +3110,14 @@ static int pch_rawclk(struct drm_i915_private *dev_priv)
 	return (intel_de_read(dev_priv, PCH_RAWCLK_FREQ) & RAWCLK_FREQ_MASK) * 1000;
 }
 
+#ifdef I915
 static int vlv_hrawclk(struct drm_i915_private *dev_priv)
 {
 	/* RAWCLK_FREQ_VLV register updated from power well code */
 	return vlv_get_cck_clock_hpll(dev_priv, "hrawclk",
 				      CCK_DISPLAY_REF_CLOCK_CONTROL);
 }
+#endif
 
 static int i9xx_hrawclk(struct drm_i915_private *dev_priv)
 {
@@ -3188,8 +3199,10 @@ u32 intel_read_rawclk(struct drm_i915_private *dev_priv)
 		freq = cnp_rawclk(dev_priv);
 	else if (HAS_PCH_SPLIT(dev_priv))
 		freq = pch_rawclk(dev_priv);
+#ifdef I915
 	else if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv))
 		freq = vlv_hrawclk(dev_priv);
+#endif
 	else if (DISPLAY_VER(dev_priv) >= 3)
 		freq = i9xx_hrawclk(dev_priv);
 	else
@@ -3246,6 +3259,7 @@ static const struct intel_cdclk_funcs bdw_cdclk_funcs = {
 	.modeset_calc_cdclk = bdw_modeset_calc_cdclk,
 };
 
+#ifdef I915
 static const struct intel_cdclk_funcs chv_cdclk_funcs = {
 	.get_cdclk = vlv_get_cdclk,
 	.set_cdclk = chv_set_cdclk,
@@ -3257,6 +3271,7 @@ static const struct intel_cdclk_funcs vlv_cdclk_funcs = {
 	.set_cdclk = vlv_set_cdclk,
 	.modeset_calc_cdclk = vlv_modeset_calc_cdclk,
 };
+#endif
 
 static const struct intel_cdclk_funcs hsw_cdclk_funcs = {
 	.get_cdclk = hsw_get_cdclk,
@@ -3378,10 +3393,12 @@ void intel_init_cdclk_hooks(struct drm_i915_private *dev_priv)
 		dev_priv->display.funcs.cdclk = &bdw_cdclk_funcs;
 	} else if (IS_HASWELL(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &hsw_cdclk_funcs;
+#ifdef I915
 	} else if (IS_CHERRYVIEW(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &chv_cdclk_funcs;
 	} else if (IS_VALLEYVIEW(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &vlv_cdclk_funcs;
+#endif
 	} else if (IS_SANDYBRIDGE(dev_priv) || IS_IVYBRIDGE(dev_priv)) {
 		dev_priv->display.funcs.cdclk = &fixed_400mhz_cdclk_funcs;
 	} else if (IS_IRONLAKE(dev_priv)) {
diff --git a/drivers/gpu/drm/i915/display/intel_color.c b/drivers/gpu/drm/i915/display/intel_color.c
index d57631b0bb9a..22f42ec3ee03 100644
--- a/drivers/gpu/drm/i915/display/intel_color.c
+++ b/drivers/gpu/drm/i915/display/intel_color.c
@@ -26,6 +26,7 @@
 #include "intel_color.h"
 #include "intel_de.h"
 #include "intel_display_types.h"
+#include "intel_dpll.h"
 #include "intel_dsb.h"
 
 struct intel_color_funcs {
diff --git a/drivers/gpu/drm/i915/display/intel_crtc.c b/drivers/gpu/drm/i915/display/intel_crtc.c
index 037fc140b585..5214bfe86a13 100644
--- a/drivers/gpu/drm/i915/display/intel_crtc.c
+++ b/drivers/gpu/drm/i915/display/intel_crtc.c
@@ -12,8 +12,10 @@
 #include <drm/drm_vblank_work.h>
 
 #include "i915_irq.h"
+#ifdef I915
 #include "i915_vgpu.h"
 #include "i9xx_plane.h"
+#endif
 #include "icl_dsi.h"
 #include "intel_atomic.h"
 #include "intel_atomic_plane.h"
@@ -306,7 +308,11 @@ int intel_crtc_init(struct drm_i915_private *dev_priv, enum pipe pipe)
 		primary = skl_universal_plane_create(dev_priv, pipe,
 						     PLANE_PRIMARY);
 	else
+#ifdef I915
 		primary = intel_primary_plane_create(dev_priv, pipe);
+#else
+		BUG_ON(1);
+#endif
 	if (IS_ERR(primary)) {
 		ret = PTR_ERR(primary);
 		goto fail;
@@ -655,13 +661,15 @@ void intel_pipe_update_end(struct intel_crtc_state *new_crtc_state)
 					 drm_crtc_accurate_vblank_count(&crtc->base) + 1,
 					 false);
 	} else if (new_crtc_state->uapi.event) {
+		unsigned long flags;
+
 		drm_WARN_ON(&dev_priv->drm,
 			    drm_crtc_vblank_get(&crtc->base) != 0);
 
-		spin_lock(&crtc->base.dev->event_lock);
+		spin_lock_irqsave(&crtc->base.dev->event_lock, flags);
 		drm_crtc_arm_vblank_event(&crtc->base,
 					  new_crtc_state->uapi.event);
-		spin_unlock(&crtc->base.dev->event_lock);
+		spin_unlock_irqrestore(&crtc->base.dev->event_lock, flags);
 
 		new_crtc_state->uapi.event = NULL;
 	}
@@ -684,8 +692,10 @@ void intel_pipe_update_end(struct intel_crtc_state *new_crtc_state)
 
 	local_irq_enable();
 
+#ifdef I915
 	if (intel_vgpu_active(dev_priv))
 		return;
+#endif
 
 	if (crtc->debug.start_vbl_count &&
 	    crtc->debug.start_vbl_count != end_vbl_count) {
diff --git a/drivers/gpu/drm/i915/display/intel_cursor.c b/drivers/gpu/drm/i915/display/intel_cursor.c
index 371009f8e194..5bdd66e66202 100644
--- a/drivers/gpu/drm/i915/display/intel_cursor.c
+++ b/drivers/gpu/drm/i915/display/intel_cursor.c
@@ -31,15 +31,15 @@ static const u32 intel_cursor_formats[] = {
 
 static u32 intel_cursor_base(const struct intel_plane_state *plane_state)
 {
-	struct drm_i915_private *dev_priv =
+	__maybe_unused struct drm_i915_private *dev_priv =
 		to_i915(plane_state->uapi.plane->dev);
-	const struct drm_framebuffer *fb = plane_state->hw.fb;
-	const struct drm_i915_gem_object *obj = intel_fb_obj(fb);
 	u32 base;
 
+#ifdef I915
 	if (INTEL_INFO(dev_priv)->display.cursor_needs_physical)
-		base = sg_dma_address(obj->mm.pages->sgl);
+		base = sg_dma_address(intel_fb_obj(plane_state->hw.fb)->mm.pages->sgl);
 	else
+#endif
 		base = intel_plane_ggtt_offset(plane_state);
 
 	return base + plane_state->view.color_plane[0].offset;
diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
index ef9bab4043ee..5a0a8179b0dc 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -46,7 +46,7 @@
 #include <drm/drm_rect.h>
 
 #include "display/intel_audio.h"
-#include "display/intel_crt.h"
+#include "display/intel_backlight.h"
 #include "display/intel_ddi.h"
 #include "display/intel_display_debugfs.h"
 #include "display/intel_display_power.h"
@@ -55,24 +55,36 @@
 #include "display/intel_dpll.h"
 #include "display/intel_dpll_mgr.h"
 #include "display/intel_drrs.h"
+#include "display/intel_dsb.h"
 #include "display/intel_dsi.h"
-#include "display/intel_dvo.h"
 #include "display/intel_fb.h"
 #include "display/intel_gmbus.h"
 #include "display/intel_hdmi.h"
 #include "display/intel_lvds.h"
-#include "display/intel_sdvo.h"
 #include "display/intel_snps_phy.h"
-#include "display/intel_tv.h"
 #include "display/intel_vdsc.h"
 #include "display/intel_vrr.h"
 
+#ifdef I915
+#include "display/intel_crt.h"
+#include "display/intel_dvo.h"
+#include "display/intel_overlay.h"
+#include "display/intel_sdvo.h"
+#include "display/intel_tv.h"
+
 #include "gem/i915_gem_lmem.h"
 #include "gem/i915_gem_object.h"
 
 #include "g4x_dp.h"
 #include "g4x_hdmi.h"
 #include "hsw_ips.h"
+#include "i9xx_plane.h"
+#include "vlv_dsi.h"
+#include "vlv_dsi_pll.h"
+#include "vlv_dsi_regs.h"
+#include "vlv_sideband.h"
+#endif
+
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "i915_utils.h"
@@ -101,7 +113,6 @@
 #include "intel_hti.h"
 #include "intel_modeset_verify.h"
 #include "intel_modeset_setup.h"
-#include "intel_overlay.h"
 #include "intel_panel.h"
 #include "intel_pch_display.h"
 #include "intel_pch_refclk.h"
@@ -114,14 +125,16 @@
 #include "intel_sprite.h"
 #include "intel_tc.h"
 #include "intel_vga.h"
-#include "i9xx_plane.h"
 #include "skl_scaler.h"
 #include "skl_universal_plane.h"
 #include "skl_watermark.h"
+
+#ifdef I915
 #include "vlv_dsi.h"
 #include "vlv_dsi_pll.h"
 #include "vlv_dsi_regs.h"
 #include "vlv_sideband.h"
+#endif
 
 static void intel_set_transcoder_timings(const struct intel_crtc_state *crtc_state);
 static void intel_set_pipe_src_size(const struct intel_crtc_state *crtc_state);
@@ -224,6 +237,7 @@ static int intel_compute_global_watermarks(struct intel_atomic_state *state)
 	return 0;
 }
 
+#ifdef I915
 /* returns HPLL frequency in kHz */
 int vlv_get_hpll_vco(struct drm_i915_private *dev_priv)
 {
@@ -280,6 +294,7 @@ static void intel_update_czclk(struct drm_i915_private *dev_priv)
 	drm_dbg(&dev_priv->drm, "CZ clock rate: %d kHz\n",
 		dev_priv->czclk_freq);
 }
+#endif
 
 static bool is_hdr_mode(const struct intel_crtc_state *crtc_state)
 {
@@ -879,14 +894,17 @@ __intel_display_resume(struct drm_i915_private *i915,
 	return intel_display_commit_duplicated_state(to_intel_atomic_state(state), ctx);
 }
 
+#ifdef I915
 static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
 {
 	return (INTEL_INFO(dev_priv)->gpu_reset_clobbers_display &&
 		intel_has_gpu_reset(to_gt(dev_priv)));
 }
+#endif
 
 void intel_display_prepare_reset(struct drm_i915_private *dev_priv)
 {
+#ifdef I915
 	struct drm_modeset_acquire_ctx *ctx = &dev_priv->display.restore.reset_ctx;
 	struct drm_atomic_state *state;
 	int ret;
@@ -945,10 +963,12 @@ void intel_display_prepare_reset(struct drm_i915_private *dev_priv)
 
 	dev_priv->display.restore.modeset_state = state;
 	state->acquire_ctx = ctx;
+#endif
 }
 
 void intel_display_finish_reset(struct drm_i915_private *i915)
 {
+#ifdef I915
 	struct drm_modeset_acquire_ctx *ctx = &i915->display.restore.reset_ctx;
 	struct drm_atomic_state *state;
 	int ret;
@@ -996,6 +1016,7 @@ void intel_display_finish_reset(struct drm_i915_private *i915)
 	mutex_unlock(&i915->drm.mode_config.mutex);
 
 	clear_bit_unlock(I915_RESET_MODESET, &to_gt(i915)->reset.flags);
+#endif
 }
 
 static void icl_set_pipe_chicken(const struct intel_crtc_state *crtc_state)
@@ -3123,6 +3144,7 @@ static void i9xx_get_pfit_config(struct intel_crtc_state *crtc_state)
 		intel_de_read(dev_priv, PFIT_PGM_RATIOS);
 }
 
+#ifdef I915
 static void vlv_crtc_clock_get(struct intel_crtc *crtc,
 			       struct intel_crtc_state *pipe_config)
 {
@@ -3183,6 +3205,7 @@ static void chv_crtc_clock_get(struct intel_crtc *crtc,
 
 	pipe_config->port_clock = chv_calc_dpll_params(refclk, &clock);
 }
+#endif
 
 static enum intel_output_format
 bdw_get_pipemisc_output_format(struct intel_crtc *crtc)
@@ -3287,7 +3310,7 @@ static bool i9xx_get_pipe_config(struct intel_crtc *crtc,
 	intel_get_pipe_src_size(crtc, pipe_config);
 
 	i9xx_get_pfit_config(pipe_config);
-
+#ifdef I915
 	if (DISPLAY_VER(dev_priv) >= 4) {
 		/* No way to read it out on pipes B and C */
 		if (IS_CHERRYVIEW(dev_priv) && crtc->pipe != PIPE_A)
@@ -3329,6 +3352,7 @@ static bool i9xx_get_pipe_config(struct intel_crtc *crtc,
 	else if (IS_VALLEYVIEW(dev_priv))
 		vlv_crtc_clock_get(crtc, pipe_config);
 	else
+#endif
 		i9xx_crtc_clock_get(crtc, pipe_config);
 
 	/*
@@ -3987,6 +4011,7 @@ static bool bxt_get_dsi_transcoder_state(struct intel_crtc *crtc,
 					 struct intel_crtc_state *pipe_config,
 					 struct intel_display_power_domain_set *power_domain_set)
 {
+#ifdef I915
 	struct drm_device *dev = crtc->base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	enum transcoder cpu_transcoder;
@@ -4025,6 +4050,7 @@ static bool bxt_get_dsi_transcoder_state(struct intel_crtc *crtc,
 		pipe_config->cpu_transcoder = cpu_transcoder;
 		break;
 	}
+#endif
 
 	return transcoder_is_dsi(pipe_config->cpu_transcoder);
 }
@@ -4129,7 +4155,9 @@ static bool hsw_get_pipe_config(struct intel_crtc *crtc,
 			ilk_get_pfit_config(pipe_config);
 	}
 
+#ifdef I915
 	hsw_ips_get_config(pipe_config);
+#endif
 
 	if (pipe_config->cpu_transcoder != TRANSCODER_EDP &&
 	    !transcoder_is_dsi(pipe_config->cpu_transcoder)) {
@@ -4762,8 +4790,8 @@ static u16 hsw_linetime_wm(const struct intel_crtc_state *crtc_state)
 	return min(linetime_wm, 0x1ff);
 }
 
-static u16 hsw_ips_linetime_wm(const struct intel_crtc_state *crtc_state,
-			       const struct intel_cdclk_state *cdclk_state)
+static inline u16 hsw_ips_linetime_wm(const struct intel_crtc_state *crtc_state,
+				      const struct intel_cdclk_state *cdclk_state)
 {
 	const struct drm_display_mode *pipe_mode =
 		&crtc_state->hw.pipe_mode;
@@ -4806,13 +4834,14 @@ static int hsw_compute_linetime_wm(struct intel_atomic_state *state,
 	struct drm_i915_private *dev_priv = to_i915(crtc->base.dev);
 	struct intel_crtc_state *crtc_state =
 		intel_atomic_get_new_crtc_state(state, crtc);
-	const struct intel_cdclk_state *cdclk_state;
+	__maybe_unused const struct intel_cdclk_state *cdclk_state;
 
 	if (DISPLAY_VER(dev_priv) >= 9)
 		crtc_state->linetime = skl_linetime_wm(crtc_state);
 	else
 		crtc_state->linetime = hsw_linetime_wm(crtc_state);
 
+#ifdef I915
 	if (!hsw_crtc_supports_ips(crtc))
 		return 0;
 
@@ -4822,6 +4851,7 @@ static int hsw_compute_linetime_wm(struct intel_atomic_state *state,
 
 	crtc_state->ips_linetime = hsw_ips_linetime_wm(crtc_state,
 						       cdclk_state);
+#endif
 
 	return 0;
 }
@@ -4890,11 +4920,13 @@ static int intel_crtc_atomic_check(struct intel_atomic_state *state,
 			return ret;
 	}
 
+#ifdef I915
 	if (HAS_IPS(dev_priv)) {
 		ret = hsw_ips_compute_config(state, crtc);
 		if (ret)
 			return ret;
 	}
+#endif
 
 	if (DISPLAY_VER(dev_priv) >= 9 ||
 	    IS_BROADWELL(dev_priv) || IS_HASWELL(dev_priv)) {
@@ -5503,6 +5535,7 @@ pipe_config_mismatch(bool fastset, const struct intel_crtc *crtc,
 
 static bool fastboot_enabled(struct drm_i915_private *dev_priv)
 {
+#ifdef I915
 	if (dev_priv->params.fastboot != -1)
 		return dev_priv->params.fastboot;
 
@@ -5516,6 +5549,9 @@ static bool fastboot_enabled(struct drm_i915_private *dev_priv)
 
 	/* Disabled by default on all others */
 	return false;
+#else
+	return true;
+#endif
 }
 
 bool
@@ -7333,6 +7369,7 @@ static void skl_commit_modeset_enables(struct intel_atomic_state *state)
 	drm_WARN_ON(&dev_priv->drm, update_pipes);
 }
 
+#ifdef I915
 static void intel_atomic_helper_free_state(struct drm_i915_private *dev_priv)
 {
 	struct intel_atomic_state *state, *next;
@@ -7350,9 +7387,11 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 
 	intel_atomic_helper_free_state(dev_priv);
 }
+#endif
 
 static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_state)
 {
+#ifdef I915
 	struct wait_queue_entry wait_fence, wait_reset;
 	struct drm_i915_private *dev_priv = to_i915(intel_state->base.dev);
 
@@ -7376,6 +7415,24 @@ static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_stat
 	finish_wait(bit_waitqueue(&to_gt(dev_priv)->reset.flags,
 				  I915_RESET_MODESET),
 		    &wait_reset);
+#else
+	struct intel_plane_state *plane_state;
+	struct intel_plane *plane;
+	int i;
+
+	for_each_new_intel_plane_in_state(intel_state, plane, plane_state, i) {
+		struct xe_bo *bo;
+
+		if (plane_state->uapi.fence)
+			dma_fence_wait(plane_state->uapi.fence, false);
+		bo = intel_fb_obj(plane_state->hw.fb);
+		if (!bo)
+			continue;
+
+		/* TODO: May deadlock, need to grab all fences in prepare_plane_fb */
+		dma_resv_wait_timeout(bo->ttm.base.resv, DMA_RESV_USAGE_KERNEL, false, MAX_SCHEDULE_TIMEOUT);
+	}
+#endif
 }
 
 static void intel_atomic_cleanup_work(struct work_struct *work)
@@ -7394,9 +7451,45 @@ static void intel_atomic_cleanup_work(struct work_struct *work)
 	drm_atomic_helper_commit_cleanup_done(&state->base);
 	drm_atomic_state_put(&state->base);
 
+#ifdef I915
 	intel_atomic_helper_free_state(i915);
+#endif
 }
 
+#ifndef I915
+static int i915_gem_object_read_from_page(struct xe_bo *bo,
+					  u32 ofs, u64 *ptr, u32 size)
+{
+	struct ttm_bo_kmap_obj map;
+	void *virtual;
+	bool is_iomem;
+	int ret;
+	struct ww_acquire_ctx ww;
+
+	XE_BUG_ON(size != 8);
+
+	ret = xe_bo_lock(bo, &ww, 0, true);
+	if (ret)
+		return ret;
+
+	ret = ttm_bo_kmap(&bo->ttm, ofs >> PAGE_SHIFT, 1, &map);
+	if (ret)
+		goto out_unlock;
+
+	ofs &= ~PAGE_MASK;
+	virtual = ttm_kmap_obj_virtual(&map, &is_iomem);
+	if (is_iomem)
+		*ptr = readq((void __iomem *)(virtual + ofs));
+	else
+		*ptr = *(u64 *)(virtual + ofs);
+
+	ttm_bo_kunmap(&map);
+out_unlock:
+	xe_bo_unlock(bo, &ww);
+	return ret;
+}
+#endif
+
 static void intel_atomic_prepare_plane_clear_colors(struct intel_atomic_state *state)
 {
 	struct drm_i915_private *i915 = to_i915(state->base.dev);
@@ -7629,6 +7722,7 @@ static void intel_atomic_commit_work(struct work_struct *work)
 	intel_atomic_commit_tail(state);
 }
 
+#ifdef I915
 static int
 intel_atomic_commit_ready(struct i915_sw_fence *fence,
 			  enum i915_sw_fence_notify notify)
@@ -7653,6 +7747,7 @@ intel_atomic_commit_ready(struct i915_sw_fence *fence,
 
 	return NOTIFY_DONE;
 }
+#endif
 
 static void intel_atomic_track_fbs(struct intel_atomic_state *state)
 {
@@ -7677,9 +7772,11 @@ static int intel_atomic_commit(struct drm_device *dev,
 
 	state->wakeref = intel_runtime_pm_get(&dev_priv->runtime_pm);
 
+#ifdef I915
 	drm_atomic_state_get(&state->base);
 	i915_sw_fence_init(&state->commit_ready,
 			   intel_atomic_commit_ready);
+#endif
 
 	/*
 	 * The intel_legacy_cursor_update() fast path takes care
@@ -7783,7 +7880,7 @@ static void intel_plane_possible_crtcs_init(struct drm_i915_private *dev_priv)
 	}
 }
 
-
+#ifdef I915
 int intel_get_pipe_from_crtc_id_ioctl(struct drm_device *dev, void *data,
 				      struct drm_file *file)
 {
@@ -7800,6 +7897,7 @@ int intel_get_pipe_from_crtc_id_ioctl(struct drm_device *dev, void *data,
 
 	return 0;
 }
+#endif
 
 static u32 intel_encoder_possible_clones(struct intel_encoder *encoder)
 {
@@ -7827,7 +7925,7 @@ static u32 intel_encoder_possible_crtcs(struct intel_encoder *encoder)
 	return possible_crtcs;
 }
 
-static bool ilk_has_edp_a(struct drm_i915_private *dev_priv)
+static inline bool ilk_has_edp_a(struct drm_i915_private *dev_priv)
 {
 	if (!IS_MOBILE(dev_priv))
 		return false;
@@ -7841,7 +7939,7 @@ static bool ilk_has_edp_a(struct drm_i915_private *dev_priv)
 	return true;
 }
 
-static bool intel_ddi_crt_present(struct drm_i915_private *dev_priv)
+static inline bool intel_ddi_crt_present(struct drm_i915_private *dev_priv)
 {
 	if (DISPLAY_VER(dev_priv) >= 9)
 		return false;
@@ -7866,7 +7964,6 @@ static bool intel_ddi_crt_present(struct drm_i915_private *dev_priv)
 static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 {
 	struct intel_encoder *encoder;
-	bool dpd_is_edp = false;
 
 	intel_pps_unlock_regs_wa(dev_priv);
 
@@ -7926,7 +8023,9 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 		intel_ddi_init(dev_priv, PORT_A);
 		intel_ddi_init(dev_priv, PORT_B);
 		intel_ddi_init(dev_priv, PORT_C);
+#ifdef I915
 		vlv_dsi_init(dev_priv);
+#endif
 	} else if (DISPLAY_VER(dev_priv) >= 9) {
 		intel_ddi_init(dev_priv, PORT_A);
 		intel_ddi_init(dev_priv, PORT_B);
@@ -7935,9 +8034,10 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 		intel_ddi_init(dev_priv, PORT_E);
 	} else if (HAS_DDI(dev_priv)) {
 		u32 found;
-
+#ifdef I915
 		if (intel_ddi_crt_present(dev_priv))
 			intel_crt_init(dev_priv);
+#endif
 
 		/* Haswell uses DDI functions to detect digital outputs. */
 		found = intel_de_read(dev_priv, DDI_BUF_CTL(PORT_A)) & DDI_INIT_DISPLAY_DETECTED;
@@ -7953,7 +8053,9 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 			intel_ddi_init(dev_priv, PORT_D);
 		if (found & SFUSE_STRAP_DDIF_DETECTED)
 			intel_ddi_init(dev_priv, PORT_F);
+#ifdef I915
 	} else if (HAS_PCH_SPLIT(dev_priv)) {
+		bool dpd_is_edp = false;
 		int found;
 
 		/*
@@ -8090,6 +8192,7 @@ static void intel_setup_outputs(struct drm_i915_private *dev_priv)
 
 		intel_crt_init(dev_priv);
 		intel_dvo_init(dev_priv);
+#endif
 	}
 
 	for_each_intel_encoder(&dev_priv->drm, encoder) {
@@ -8277,6 +8380,10 @@ static const struct intel_display_funcs skl_display_funcs = {
 	.get_initial_plane_config = skl_get_initial_plane_config,
 };
 
+#ifndef I915
+#define i9xx_get_initial_plane_config skl_get_initial_plane_config
+#endif
+
 static const struct intel_display_funcs ddi_display_funcs = {
 	.get_pipe_config = hsw_get_pipe_config,
 	.crtc_enable = hsw_crtc_enable,
@@ -8661,9 +8768,11 @@ int intel_modeset_init_noirq(struct drm_i915_private *i915)
 	if (ret)
 		goto cleanup_vga_client_pw_domain_dmc;
 
+#ifdef I915
 	init_llist_head(&i915->display.atomic_helper.free_list);
 	INIT_WORK(&i915->display.atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
+#endif
 
 	intel_init_quirks(i915);
 
@@ -8716,7 +8825,9 @@ int intel_modeset_init_nogem(struct drm_i915_private *i915)
 	intel_shared_dpll_init(i915);
 	intel_fdi_pll_freq_update(i915);
 
+#ifdef I915
 	intel_update_czclk(i915);
+#endif
 	intel_modeset_init_hw(i915);
 	intel_dpll_update_ref_clks(i915);
 
@@ -8923,11 +9034,14 @@ void intel_display_resume(struct drm_device *dev)
 		drm_atomic_state_put(state);
 }
 
-static void intel_hpd_poll_fini(struct drm_i915_private *i915)
+void intel_hpd_poll_fini(struct drm_i915_private *i915)
 {
 	struct intel_connector *connector;
 	struct drm_connector_list_iter conn_iter;
 
+	if (!HAS_DISPLAY(i915))
+		return;
+
 	/* Kill all the work that may have been queued by hpd. */
 	drm_connector_list_iter_begin(&i915->drm, &conn_iter);
 	for_each_intel_connector_iter(connector, &conn_iter) {
@@ -8950,8 +9064,10 @@ void intel_modeset_driver_remove(struct drm_i915_private *i915)
 	flush_workqueue(i915->display.wq.flip);
 	flush_workqueue(i915->display.wq.modeset);
 
+#ifdef I915
 	flush_work(&i915->display.atomic_helper.free_work);
 	drm_WARN_ON(&i915->drm, !llist_empty(&i915->display.atomic_helper.free_list));
+#endif
 
 	/*
 	 * MST topology needs to be suspended so we don't have any calls to
@@ -9011,12 +9127,14 @@ bool intel_modeset_probe_defer(struct pci_dev *pdev)
 {
 	struct drm_privacy_screen *privacy_screen;
 
+#ifdef I915
 	/*
 	 * apple-gmux is needed on dual GPU MacBook Pro
 	 * to probe the panel if we're the inactive GPU.
 	 */
 	if (vga_switcheroo_client_probe_defer(pdev))
 		return true;
+#endif
 
 	/* If the LCD panel has a privacy-screen, wait for it */
 	privacy_screen = drm_privacy_screen_get(&pdev->dev, NULL);
diff --git a/drivers/gpu/drm/i915/display/intel_display.h b/drivers/gpu/drm/i915/display/intel_display.h
index ef73730f32b0..b063d16f4767 100644
--- a/drivers/gpu/drm/i915/display/intel_display.h
+++ b/drivers/gpu/drm/i915/display/intel_display.h
@@ -545,6 +545,7 @@ int vlv_get_cck_clock(struct drm_i915_private *dev_priv,
 		      const char *name, u32 reg, int ref_freq);
 int vlv_get_cck_clock_hpll(struct drm_i915_private *dev_priv,
 			   const char *name, u32 reg);
+void intel_hpd_poll_fini(struct drm_i915_private *i915);
 void intel_init_display_hooks(struct drm_i915_private *dev_priv);
 unsigned int intel_fb_xy_to_linear(int x, int y,
 				   const struct intel_plane_state *state,
@@ -670,10 +671,16 @@ void assert_transcoder(struct drm_i915_private *dev_priv,
  * enable distros and users to tailor their preferred amount of i915 abrt
  * spam.
  */
+#ifdef I915
+#define i915_display_verbose_check (i915_modparams.verbose_state_checks)
+#else
+#define i915_display_verbose_check 1
+#endif
+
 #define I915_STATE_WARN(condition, format...) ({			\
 	int __ret_warn_on = !!(condition);				\
 	if (unlikely(__ret_warn_on))					\
-		if (!WARN(i915_modparams.verbose_state_checks, format))	\
+		if (!WARN(i915_display_verbose_check, format))	\
 			DRM_ERROR(format);				\
 	unlikely(__ret_warn_on);					\
 })
diff --git a/drivers/gpu/drm/i915/display/intel_display_core.h b/drivers/gpu/drm/i915/display/intel_display_core.h
index 57ddce3ba02b..1c65b5b2893e 100644
--- a/drivers/gpu/drm/i915/display/intel_display_core.h
+++ b/drivers/gpu/drm/i915/display/intel_display_core.h
@@ -227,12 +227,13 @@ struct intel_wm {
 	u16 skl_latency[8];
 
 	/* current hardware state */
+#ifdef I915
 	union {
 		struct ilk_wm_values hw;
 		struct vlv_wm_values vlv;
 		struct g4x_wm_values g4x;
 	};
-
+#endif
 	u8 max_level;
 
 	/*
@@ -274,10 +275,12 @@ struct intel_display {
 	} funcs;
 
 	/* Grouping using anonymous structs. Keep sorted. */
+#ifdef I915
 	struct intel_atomic_helper {
 		struct llist_head free_list;
 		struct work_struct free_work;
 	} atomic_helper;
+#endif
 
 	struct {
 		/* backlight registers and fields in struct intel_panel */
diff --git a/drivers/gpu/drm/i915/display/intel_display_debugfs.c b/drivers/gpu/drm/i915/display/intel_display_debugfs.c
index 7bcd90384a46..6c40ca8a709f 100644
--- a/drivers/gpu/drm/i915/display/intel_display_debugfs.c
+++ b/drivers/gpu/drm/i915/display/intel_display_debugfs.c
@@ -8,7 +8,11 @@
 #include <drm/drm_debugfs.h>
 #include <drm/drm_fourcc.h>
 
+#ifdef I915
 #include "i915_debugfs.h"
+#else
+#define i915_debugfs_describe_obj(a, b) do { } while (0)
+#endif
 #include "i915_irq.h"
 #include "i915_reg.h"
 #include "intel_de.h"
@@ -51,6 +55,7 @@ static int i915_frontbuffer_tracking(struct seq_file *m, void *unused)
 
 static int i915_ips_status(struct seq_file *m, void *unused)
 {
+#ifdef I915
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
 	intel_wakeref_t wakeref;
 
@@ -74,6 +79,9 @@ static int i915_ips_status(struct seq_file *m, void *unused)
 	intel_runtime_pm_put(&dev_priv->runtime_pm, wakeref);
 
 	return 0;
+#else
+	return -ENODEV;
+#endif
 }
 
 static int i915_sr_status(struct seq_file *m, void *unused)
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.c b/drivers/gpu/drm/i915/display/intel_display_power.c
index cdb36e3f96cd..c3a57ec0d2f3 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power.c
@@ -16,11 +16,17 @@
 #include "intel_display_power_well.h"
 #include "intel_display_types.h"
 #include "intel_dmc.h"
-#include "intel_mchbar_regs.h"
+#include "../i915/intel_mchbar_regs.h"
 #include "intel_pch_refclk.h"
 #include "intel_snps_phy.h"
 #include "skl_watermark.h"
+
+#ifdef I915
 #include "vlv_sideband.h"
+#else
+#define PUNIT_REG_ISPSSPM0 0
+#define PUNIT_REG_VEDSSPM0 0
+#endif
 
 #define for_each_power_domain_well(__dev_priv, __power_well, __domain)	\
 	for_each_power_well(__dev_priv, __power_well)				\
@@ -212,8 +218,10 @@ bool __intel_display_power_is_enabled(struct drm_i915_private *dev_priv,
 	struct i915_power_well *power_well;
 	bool is_enabled;
 
+#ifdef I915
 	if (dev_priv->runtime_pm.suspended)
 		return false;
+#endif
 
 	is_enabled = true;
 
@@ -621,7 +629,6 @@ release_async_put_domains(struct i915_power_domains *power_domains,
 	struct drm_i915_private *dev_priv =
 		container_of(power_domains, struct drm_i915_private,
 			     display.power.domains);
-	struct intel_runtime_pm *rpm = &dev_priv->runtime_pm;
 	enum intel_display_power_domain domain;
 	intel_wakeref_t wakeref;
 
@@ -630,8 +637,8 @@ release_async_put_domains(struct i915_power_domains *power_domains,
 	 * wakeref to make the state checker happy about the HW access during
 	 * power well disabling.
 	 */
-	assert_rpm_raw_wakeref_held(rpm);
-	wakeref = intel_runtime_pm_get(rpm);
+	assert_rpm_raw_wakeref_held(&dev_priv->runtime_pm);
+	wakeref = intel_runtime_pm_get(&dev_priv->runtime_pm);
 
 	for_each_power_domain(domain, mask) {
 		/* Clear before put, so put's sanity check is happy. */
@@ -639,7 +646,7 @@ release_async_put_domains(struct i915_power_domains *power_domains,
 		__intel_display_power_put_domain(dev_priv, domain);
 	}
 
-	intel_runtime_pm_put(rpm, wakeref);
+	intel_runtime_pm_put(&dev_priv->runtime_pm, wakeref);
 }
 
 static void
@@ -649,8 +656,7 @@ intel_display_power_put_async_work(struct work_struct *work)
 		container_of(work, struct drm_i915_private,
 			     display.power.domains.async_put_work.work);
 	struct i915_power_domains *power_domains = &dev_priv->display.power.domains;
-	struct intel_runtime_pm *rpm = &dev_priv->runtime_pm;
-	intel_wakeref_t new_work_wakeref = intel_runtime_pm_get_raw(rpm);
+	intel_wakeref_t new_work_wakeref = intel_runtime_pm_get_raw(&dev_priv->runtime_pm);
 	intel_wakeref_t old_work_wakeref = 0;
 
 	mutex_lock(&power_domains->lock);
@@ -689,9 +695,9 @@ intel_display_power_put_async_work(struct work_struct *work)
 	mutex_unlock(&power_domains->lock);
 
 	if (old_work_wakeref)
-		intel_runtime_pm_put_raw(rpm, old_work_wakeref);
+		intel_runtime_pm_put_raw(&dev_priv->runtime_pm, old_work_wakeref);
 	if (new_work_wakeref)
-		intel_runtime_pm_put_raw(rpm, new_work_wakeref);
+		intel_runtime_pm_put_raw(&dev_priv->runtime_pm, new_work_wakeref);
 }
 
 /**
@@ -709,8 +715,7 @@ void __intel_display_power_put_async(struct drm_i915_private *i915,
 				     intel_wakeref_t wakeref)
 {
 	struct i915_power_domains *power_domains = &i915->display.power.domains;
-	struct intel_runtime_pm *rpm = &i915->runtime_pm;
-	intel_wakeref_t work_wakeref = intel_runtime_pm_get_raw(rpm);
+	intel_wakeref_t work_wakeref = intel_runtime_pm_get_raw(&i915->runtime_pm);
 
 	mutex_lock(&power_domains->lock);
 
@@ -737,9 +742,9 @@ void __intel_display_power_put_async(struct drm_i915_private *i915,
 	mutex_unlock(&power_domains->lock);
 
 	if (work_wakeref)
-		intel_runtime_pm_put_raw(rpm, work_wakeref);
+		intel_runtime_pm_put_raw(&i915->runtime_pm, work_wakeref);
 
-	intel_runtime_pm_put(rpm, wakeref);
+	intel_runtime_pm_put(&i915->runtime_pm, wakeref);
 }
 
 /**
@@ -1830,6 +1835,7 @@ static void vlv_cmnlane_wa(struct drm_i915_private *dev_priv)
 
 static bool vlv_punit_is_power_gated(struct drm_i915_private *dev_priv, u32 reg0)
 {
+#ifdef I915
 	bool ret;
 
 	vlv_punit_get(dev_priv);
@@ -1837,6 +1843,9 @@ static bool vlv_punit_is_power_gated(struct drm_i915_private *dev_priv, u32 reg0
 	vlv_punit_put(dev_priv);
 
 	return ret;
+#else
+	return false;
+#endif
 }
 
 static void assert_ved_power_gated(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/display/intel_display_power.h b/drivers/gpu/drm/i915/display/intel_display_power.h
index d220f6b16e00..3aae045749f7 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power.h
+++ b/drivers/gpu/drm/i915/display/intel_display_power.h
@@ -7,6 +7,11 @@
 #define __INTEL_DISPLAY_POWER_H__
 
 #include "intel_wakeref.h"
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include "intel_runtime_pm.h"
 
 enum aux_ch;
 enum dpio_channel;
diff --git a/drivers/gpu/drm/i915/display/intel_display_power_map.c b/drivers/gpu/drm/i915/display/intel_display_power_map.c
index f5d66ca85b19..6e1facc66af3 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power_map.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power_map.c
@@ -6,7 +6,10 @@
 #include "i915_drv.h"
 #include "i915_reg.h"
 
+#ifdef I915
 #include "vlv_sideband_reg.h"
+#else
+#endif
 
 #include "intel_display_power_map.h"
 #include "intel_display_power_well.h"
@@ -197,6 +200,7 @@ I915_DECL_PW_DOMAINS(vlv_pwdoms_dpio_tx_bc_lanes,
 	POWER_DOMAIN_INIT);
 
 static const struct i915_power_well_desc vlv_power_wells_main[] = {
+#ifdef I915
 	{
 		.instances = &I915_PW_INSTANCES(
 			I915_PW("display", &vlv_pwdoms_display,
@@ -224,6 +228,7 @@ static const struct i915_power_well_desc vlv_power_wells_main[] = {
 		),
 		.ops = &vlv_dpio_cmn_power_well_ops,
 	},
+#endif
 };
 
 static const struct i915_power_well_desc_list vlv_power_wells[] = {
@@ -274,6 +279,7 @@ I915_DECL_PW_DOMAINS(chv_pwdoms_dpio_cmn_d,
 	POWER_DOMAIN_INIT);
 
 static const struct i915_power_well_desc chv_power_wells_main[] = {
+#ifdef I915
 	{
 		/*
 		 * Pipe A power well is the new disp2d well. Pipe B and C
@@ -295,6 +301,7 @@ static const struct i915_power_well_desc chv_power_wells_main[] = {
 		),
 		.ops = &chv_dpio_cmn_power_well_ops,
 	},
+#endif
 };
 
 static const struct i915_power_well_desc_list chv_power_wells[] = {
diff --git a/drivers/gpu/drm/i915/display/intel_display_power_well.c b/drivers/gpu/drm/i915/display/intel_display_power_well.c
index a1d75956ae97..9683cb661f62 100644
--- a/drivers/gpu/drm/i915/display/intel_display_power_well.c
+++ b/drivers/gpu/drm/i915/display/intel_display_power_well.c
@@ -8,7 +8,6 @@
 #include "intel_backlight_regs.h"
 #include "intel_combo_phy.h"
 #include "intel_combo_phy_regs.h"
-#include "intel_crt.h"
 #include "intel_de.h"
 #include "intel_display_power_well.h"
 #include "intel_display_types.h"
@@ -22,8 +21,12 @@
 #include "intel_tc.h"
 #include "intel_vga.h"
 #include "skl_watermark.h"
+
+#ifdef I915
+#include "intel_crt.h"
 #include "vlv_sideband.h"
 #include "vlv_sideband_reg.h"
+#endif
 
 struct i915_power_well_regs {
 	i915_reg_t bios;
@@ -1061,6 +1064,7 @@ static void i830_pipes_power_well_sync_hw(struct drm_i915_private *dev_priv,
 		i830_pipes_power_well_disable(dev_priv, power_well);
 }
 
+#ifdef I915
 static void vlv_set_power_well(struct drm_i915_private *dev_priv,
 			       struct i915_power_well *power_well, bool enable)
 {
@@ -1719,6 +1723,7 @@ static void chv_pipe_power_well_disable(struct drm_i915_private *dev_priv,
 
 	chv_set_pipe_power_well(dev_priv, power_well, false);
 }
+#endif
 
 static void
 tgl_tc_cold_request(struct drm_i915_private *i915, bool block)
@@ -1843,17 +1848,21 @@ const struct i915_power_well_ops i9xx_always_on_power_well_ops = {
 };
 
 const struct i915_power_well_ops chv_pipe_power_well_ops = {
+#ifdef I915
 	.sync_hw = chv_pipe_power_well_sync_hw,
 	.enable = chv_pipe_power_well_enable,
 	.disable = chv_pipe_power_well_disable,
 	.is_enabled = chv_pipe_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops chv_dpio_cmn_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = chv_dpio_cmn_power_well_enable,
 	.disable = chv_dpio_cmn_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops i830_pipes_power_well_ops = {
@@ -1894,23 +1903,29 @@ const struct i915_power_well_ops bxt_dpio_cmn_power_well_ops = {
 
 const struct i915_power_well_ops vlv_display_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = vlv_display_power_well_enable,
 	.disable = vlv_display_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops vlv_dpio_cmn_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = vlv_dpio_cmn_power_well_enable,
 	.disable = vlv_dpio_cmn_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 const struct i915_power_well_ops vlv_dpio_power_well_ops = {
 	.sync_hw = i9xx_power_well_sync_hw_noop,
+#ifdef I915
 	.enable = vlv_power_well_enable,
 	.disable = vlv_power_well_disable,
 	.is_enabled = vlv_power_well_enabled,
+#endif
 };
 
 static const struct i915_power_well_regs icl_aux_power_well_regs = {
diff --git a/drivers/gpu/drm/i915/display/intel_display_trace.h b/drivers/gpu/drm/i915/display/intel_display_trace.h
index 725aba3fa531..391ddb94062b 100644
--- a/drivers/gpu/drm/i915/display/intel_display_trace.h
+++ b/drivers/gpu/drm/i915/display/intel_display_trace.h
@@ -185,6 +185,7 @@ TRACE_EVENT(intel_memory_cxsr,
 		      __entry->frame[PIPE_C], __entry->scanline[PIPE_C])
 );
 
+#ifdef I915
 TRACE_EVENT(g4x_wm,
 	    TP_PROTO(struct intel_crtc *crtc, const struct g4x_wm_values *wm),
 	    TP_ARGS(crtc, wm),
@@ -277,6 +278,7 @@ TRACE_EVENT(vlv_wm,
 		      __entry->primary, __entry->sprite0, __entry->sprite1, __entry->cursor,
 		      __entry->sr_plane, __entry->sr_cursor)
 );
+#endif
 
 TRACE_EVENT(vlv_fifo_size,
 	    TP_PROTO(struct intel_crtc *crtc, u32 sprite0_start, u32 sprite1_start, u32 fifo_size),
@@ -648,6 +650,10 @@ TRACE_EVENT(intel_frontbuffer_flush,
 /* This part must be outside protection */
 #undef TRACE_INCLUDE_PATH
 #undef TRACE_INCLUDE_FILE
+#ifdef I915
 #define TRACE_INCLUDE_PATH ../../drivers/gpu/drm/i915/display
+#else
+#define TRACE_INCLUDE_PATH ../../drivers/gpu/drm/xe/display
+#endif
 #define TRACE_INCLUDE_FILE intel_display_trace
 #include <trace/define_trace.h>
diff --git a/drivers/gpu/drm/i915/display/intel_display_types.h b/drivers/gpu/drm/i915/display/intel_display_types.h
index 34250a9cf3e1..3bd391d33e42 100644
--- a/drivers/gpu/drm/i915/display/intel_display_types.h
+++ b/drivers/gpu/drm/i915/display/intel_display_types.h
@@ -46,6 +46,7 @@
 #include <drm/i915_mei_hdcp_interface.h>
 #include <media/cec-notifier.h>
 
+#include "i915_utils.h"
 #include "i915_vma.h"
 #include "i915_vma_types.h"
 #include "intel_bios.h"
@@ -141,7 +142,9 @@ struct intel_framebuffer {
 		struct intel_fb_view remapped_view;
 	};
 
+#ifdef I915
 	struct i915_address_space *dpt_vm;
+#endif
 };
 
 enum intel_hotplug_state {
@@ -653,7 +656,9 @@ struct intel_atomic_state {
 
 	bool rps_interactive;
 
+#ifdef I915
 	struct i915_sw_fence commit_ready;
+#endif
 
 	struct llist_node freed;
 };
@@ -679,7 +684,11 @@ struct intel_plane_state {
 	} hw;
 
 	struct i915_vma *ggtt_vma;
+#ifdef I915
 	struct i915_vma *dpt_vma;
+#else
+	struct i915_vma embed_vma;
+#endif
 	unsigned long flags;
 #define PLANE_HAS_FENCE BIT(0)
 
@@ -739,9 +748,9 @@ struct intel_plane_state {
 	 * this plane. They're calculated by the linked plane's wm code.
 	 */
 	u32 planar_slave;
-
+#ifdef I915
 	struct drm_intel_sprite_colorkey ckey;
-
+#endif
 	struct drm_rect psr2_sel_fetch_area;
 
 	/* Clear Color Value */
@@ -851,6 +860,7 @@ struct skl_pipe_wm {
 	bool use_sagv_wm;
 };
 
+#ifdef I915
 enum vlv_wm_level {
 	VLV_WM_LEVEL_PM2,
 	VLV_WM_LEVEL_PM5,
@@ -884,6 +894,7 @@ struct g4x_wm_state {
 	bool hpll_en;
 	bool fbc_en;
 };
+#endif
 
 struct intel_crtc_wm_state {
 	union {
@@ -927,7 +938,7 @@ struct intel_crtc_wm_state {
 			/* pre-icl: for planar Y */
 			struct skl_ddb_entry plane_ddb_y[I915_MAX_PLANES];
 		} skl;
-
+#ifdef I915
 		struct {
 			struct g4x_pipe_wm raw[NUM_VLV_WM_LEVELS]; /* not inverted */
 			struct vlv_wm_state intermediate; /* inverted */
@@ -940,6 +951,7 @@ struct intel_crtc_wm_state {
 			struct g4x_wm_state intermediate;
 			struct g4x_wm_state optimal;
 		} g4x;
+#endif
 	};
 
 	/*
@@ -1387,6 +1399,7 @@ struct intel_crtc {
 	bool pch_fifo_underrun_disabled;
 
 	/* per-pipe watermark state */
+#ifdef I915
 	struct {
 		/* watermarks currently being used  */
 		union {
@@ -1395,6 +1408,7 @@ struct intel_crtc {
 			struct g4x_wm_state g4x;
 		} active;
 	} wm;
+#endif
 
 	struct {
 		struct mutex mutex;
@@ -2053,7 +2067,11 @@ intel_crtc_needs_color_update(const struct intel_crtc_state *crtc_state)
 
 static inline u32 intel_plane_ggtt_offset(const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	return i915_ggtt_offset(plane_state->ggtt_vma);
+#else
+	return plane_state->ggtt_vma->node.start;
+#endif
 }
 
 #endif /*  __INTEL_DISPLAY_TYPES_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_dmc.c b/drivers/gpu/drm/i915/display/intel_dmc.c
index 905b5dcdca14..5482ca6ccda7 100644
--- a/drivers/gpu/drm/i915/display/intel_dmc.c
+++ b/drivers/gpu/drm/i915/display/intel_dmc.c
@@ -30,6 +30,18 @@
 #include "intel_dmc.h"
 #include "intel_dmc_regs.h"
 
+#ifndef I915
+#include "xe_uc_fw.h"
+
+#define INTEL_UC_FIRMWARE_URL XE_UC_FIRMWARE_URL
+
+__printf(2, 3)
+static inline void
+i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
+{
+}
+#endif
+
 /**
  * DOC: DMC Firmware Support
  *
@@ -262,8 +274,11 @@ static const struct stepping_info *
 intel_get_stepping_info(struct drm_i915_private *i915,
 			struct stepping_info *si)
 {
+#ifdef I915
 	const char *step_name = intel_step_name(RUNTIME_INFO(i915)->step.display_step);
-
+#else
+	const char *step_name = xe_step_name(i915->info.step.display);
+#endif
 	si->stepping = step_name[0];
 	si->substepping = step_name[1];
 	return si;
diff --git a/drivers/gpu/drm/i915/display/intel_dp.c b/drivers/gpu/drm/i915/display/intel_dp.c
index bf80f296a8fd..55973e9aeca3 100644
--- a/drivers/gpu/drm/i915/display/intel_dp.c
+++ b/drivers/gpu/drm/i915/display/intel_dp.c
@@ -43,8 +43,10 @@
 #include <drm/drm_edid.h>
 #include <drm/drm_probe_helper.h>
 
+#ifdef I915
 #include "g4x_dp.h"
 #include "i915_debugfs.h"
+#endif
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "intel_atomic.h"
@@ -2164,8 +2166,10 @@ intel_dp_compute_config(struct intel_encoder *encoder,
 	if (pipe_config->splitter.enable)
 		pipe_config->dp_m_n.data_m *= pipe_config->splitter.link_count;
 
+#ifdef I915
 	if (!HAS_DDI(dev_priv))
 		g4x_dp_set_clock(encoder, pipe_config);
+#endif
 
 	intel_vrr_compute_config(pipe_config, conn_state);
 	intel_psr_compute_config(intel_dp, pipe_config, conn_state);
@@ -5209,9 +5213,11 @@ intel_edp_add_properties(struct intel_dp *intel_dp)
 static void intel_edp_backlight_setup(struct intel_dp *intel_dp,
 				      struct intel_connector *connector)
 {
-	struct drm_i915_private *i915 = dp_to_i915(intel_dp);
 	enum pipe pipe = INVALID_PIPE;
 
+#ifdef I915
+	struct drm_i915_private *i915 = dp_to_i915(intel_dp);
+
 	if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915)) {
 		/*
 		 * Figure out the current pipe for the initial backlight setup.
@@ -5231,6 +5237,7 @@ static void intel_edp_backlight_setup(struct intel_dp *intel_dp,
 			    connector->base.base.id, connector->base.name,
 			    pipe_name(pipe));
 	}
+#endif
 
 	intel_backlight_setup(connector, pipe);
 }
@@ -5427,8 +5434,10 @@ intel_dp_init_connector(struct intel_digital_port *dig_port,
 	intel_dp_set_default_sink_rates(intel_dp);
 	intel_dp_set_default_max_sink_lane_count(intel_dp);
 
+#ifdef I915
 	if (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv))
 		intel_dp->pps.active_pipe = vlv_active_pipe(intel_dp);
+#endif
 
 	drm_dbg_kms(&dev_priv->drm,
 		    "Adding %s connector on [ENCODER:%d:%s]\n",
diff --git a/drivers/gpu/drm/i915/display/intel_dp_aux.c b/drivers/gpu/drm/i915/display/intel_dp_aux.c
index 220aa88c67ee..b4b9d2e1fec7 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_aux.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_aux.c
@@ -5,7 +5,11 @@
 
 #include "i915_drv.h"
 #include "i915_reg.h"
+#ifdef I915
 #include "i915_trace.h"
+#else
+#define trace_i915_reg_rw(a...) do { } while (0)
+#endif
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_dp_aux.h"
diff --git a/drivers/gpu/drm/i915/display/intel_dpio_phy.h b/drivers/gpu/drm/i915/display/intel_dpio_phy.h
index 9c7725dacb47..952e8d446425 100644
--- a/drivers/gpu/drm/i915/display/intel_dpio_phy.h
+++ b/drivers/gpu/drm/i915/display/intel_dpio_phy.h
@@ -7,6 +7,7 @@
 #define __INTEL_DPIO_PHY_H__
 
 #include <linux/types.h>
+#include "intel_display.h"
 
 enum pipe;
 enum port;
@@ -26,6 +27,7 @@ enum dpio_phy {
 	DPIO_PHY2,
 };
 
+#ifdef I915
 void bxt_port_to_phy_channel(struct drm_i915_private *dev_priv, enum port port,
 			     enum dpio_phy *phy, enum dpio_channel *ch);
 void bxt_ddi_phy_set_signal_levels(struct intel_encoder *encoder,
@@ -71,4 +73,17 @@ void vlv_phy_pre_encoder_enable(struct intel_encoder *encoder,
 void vlv_phy_reset_lanes(struct intel_encoder *encoder,
 			 const struct intel_crtc_state *old_crtc_state);
 
+#else
+#define bxt_port_to_phy_channel(xe, port, phy, ch) do { *phy = 0; *ch = 0; } while (xe && port && 0)
+static inline void bxt_ddi_phy_set_signal_levels(struct intel_encoder *x,
+						 const struct intel_crtc_state *y) {}
+#define bxt_ddi_phy_init(xe, phy) do { } while (xe && phy && 0)
+#define bxt_ddi_phy_uninit(xe, phy) do { } while (xe && phy && 0)
+#define bxt_ddi_phy_is_enabled(xe, phy) (xe && phy && 0)
+static inline bool bxt_ddi_phy_verify_state(struct xe_device *xe, enum dpio_phy phy) { return false; }
+#define bxt_ddi_phy_calc_lane_lat_optim_mask(x) (x && 0)
+#define bxt_ddi_phy_set_lane_optim_mask(x, y) do { } while (x && y && 0)
+#define bxt_ddi_phy_get_lane_lat_optim_mask(x) (x && 0)
+#endif
+
 #endif /* __INTEL_DPIO_PHY_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_dpll.c b/drivers/gpu/drm/i915/display/intel_dpll.c
index c236aafe9be0..bfc214b36585 100644
--- a/drivers/gpu/drm/i915/display/intel_dpll.c
+++ b/drivers/gpu/drm/i915/display/intel_dpll.c
@@ -17,7 +17,10 @@
 #include "intel_panel.h"
 #include "intel_pps.h"
 #include "intel_snps_phy.h"
+
+#ifdef I915
 #include "vlv_sideband.h"
+#endif
 
 struct intel_dpll_funcs {
 	int (*crtc_compute_clock)(struct intel_atomic_state *state,
@@ -1594,6 +1597,7 @@ void i9xx_enable_pll(const struct intel_crtc_state *crtc_state)
 	}
 }
 
+#ifdef I915
 static void vlv_pllb_recal_opamp(struct drm_i915_private *dev_priv,
 				 enum pipe pipe)
 {
@@ -2005,6 +2009,7 @@ void chv_disable_pll(struct drm_i915_private *dev_priv, enum pipe pipe)
 
 	vlv_dpio_put(dev_priv);
 }
+#endif
 
 void i9xx_disable_pll(const struct intel_crtc_state *crtc_state)
 {
@@ -2023,7 +2028,7 @@ void i9xx_disable_pll(const struct intel_crtc_state *crtc_state)
 	intel_de_posting_read(dev_priv, DPLL(pipe));
 }
 
-
+#ifdef I915
 /**
  * vlv_force_pll_off - forcibly disable just the PLL
  * @dev_priv: i915 private structure
@@ -2039,6 +2044,7 @@ void vlv_force_pll_off(struct drm_i915_private *dev_priv, enum pipe pipe)
 	else
 		vlv_disable_pll(dev_priv, pipe);
 }
+#endif
 
 /* Only for pre-ILK configs */
 static void assert_pll(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/display/intel_dpll_mgr.c b/drivers/gpu/drm/i915/display/intel_dpll_mgr.c
index 1974eb580ed1..56b4055c9ef4 100644
--- a/drivers/gpu/drm/i915/display/intel_dpll_mgr.c
+++ b/drivers/gpu/drm/i915/display/intel_dpll_mgr.c
@@ -607,6 +607,7 @@ static void hsw_ddi_spll_enable(struct drm_i915_private *dev_priv,
 static void hsw_ddi_wrpll_disable(struct drm_i915_private *dev_priv,
 				  struct intel_shared_dpll *pll)
 {
+#ifdef I915
 	const enum intel_dpll_id id = pll->info->id;
 	u32 val;
 
@@ -620,11 +621,13 @@ static void hsw_ddi_wrpll_disable(struct drm_i915_private *dev_priv,
 	 */
 	if (dev_priv->pch_ssc_use & BIT(id))
 		intel_init_pch_refclk(dev_priv);
+#endif
 }
 
 static void hsw_ddi_spll_disable(struct drm_i915_private *dev_priv,
 				 struct intel_shared_dpll *pll)
 {
+#ifdef I915
 	enum intel_dpll_id id = pll->info->id;
 	u32 val;
 
@@ -638,6 +641,7 @@ static void hsw_ddi_spll_disable(struct drm_i915_private *dev_priv,
 	 */
 	if (dev_priv->pch_ssc_use & BIT(id))
 		intel_init_pch_refclk(dev_priv);
+#endif
 }
 
 static bool hsw_ddi_wrpll_get_hw_state(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/display/intel_dsb.c b/drivers/gpu/drm/i915/display/intel_dsb.c
index 3d63c1bf1e4f..0295348df562 100644
--- a/drivers/gpu/drm/i915/display/intel_dsb.c
+++ b/drivers/gpu/drm/i915/display/intel_dsb.c
@@ -4,11 +4,18 @@
  *
  */
 
+// As with intelde_dpt, this depends on some gem internals, fortunately easier to fix..
+#ifdef I915
 #include "gem/i915_gem_internal.h"
+#else
+#include "xe_bo.h"
+#include "xe_gt.h"
+#endif
 
 #include "i915_drv.h"
 #include "i915_reg.h"
 #include "intel_de.h"
+#include "intel_dsb.h"
 #include "intel_display_types.h"
 #include "intel_dsb.h"
 
@@ -26,8 +33,12 @@ struct intel_dsb {
 	enum dsb_id id;
 
 	u32 *cmd_buf;
-	struct i915_vma *vma;
 	struct intel_crtc *crtc;
+#ifdef I915
+	struct i915_vma *vma;
+#else
+	struct xe_bo *obj;
+#endif
 
 	/*
 	 * free_pos will point the first free entry position
@@ -70,6 +81,43 @@ struct intel_dsb {
 #define DSB_BYTE_EN_SHIFT		20
 #define DSB_REG_VALUE_MASK		0xfffff
 
+static u32 dsb_ggtt_offset(struct intel_dsb *dsb)
+{
+#ifdef I915
+	return i915_ggtt_offset(dsb->vma);
+#else
+	return xe_bo_ggtt_addr(dsb->obj);
+#endif
+}
+
+static void dsb_write(struct intel_dsb *dsb, u32 idx, u32 val)
+{
+#ifdef I915
+	dsb->cmd_buf[idx] = val;
+#else
+	iosys_map_wr(&dsb->obj->vmap, idx * 4, u32, val);
+#endif
+}
+
+
+static u32 dsb_read(struct intel_dsb *dsb, u32 idx)
+{
+#ifdef I915
+	return dsb->cmd_buf[idx];
+#else
+	return iosys_map_rd(&dsb->obj->vmap, idx * 4, u32);
+#endif
+}
+
+static void dsb_memset(struct intel_dsb *dsb, u32 idx, u32 val, u32 sz)
+{
+#ifdef I915
+	memset(&dsb->cmd_buf[idx], val, sz);
+#else
+	iosys_map_memset(&dsb->obj->vmap, idx * 4, val, sz);
+#endif
+}
+
 static bool is_dsb_busy(struct drm_i915_private *i915, enum pipe pipe,
 			enum dsb_id id)
 {
@@ -130,8 +178,12 @@ void intel_dsb_indexed_reg_write(struct intel_dsb *dsb,
 {
 	struct intel_crtc *crtc = dsb->crtc;
 	struct drm_i915_private *dev_priv = to_i915(crtc->base.dev);
-	u32 *buf = dsb->cmd_buf;
-	u32 reg_val;
+	u32 reg_val, old_val;
+
+	if (!dsb) {
+		intel_de_write_fw(dev_priv, reg, val);
+		return;
+	}
 
 	if (drm_WARN_ON(&dev_priv->drm, dsb->free_pos >= DSB_BUF_SIZE)) {
 		drm_dbg_kms(&dev_priv->drm, "DSB buffer overflow\n");
@@ -154,7 +206,7 @@ void intel_dsb_indexed_reg_write(struct intel_dsb *dsb,
 	 * we are writing odd no of dwords, Zeros will be added in the end for
 	 * padding.
 	 */
-	reg_val = buf[dsb->ins_start_offset + 1] & DSB_REG_VALUE_MASK;
+	reg_val = dsb_read(dsb, dsb->ins_start_offset + 1) & DSB_REG_VALUE_MASK;
 	if (reg_val != i915_mmio_reg_offset(reg)) {
 		/* Every instruction should be 8 byte aligned. */
 		dsb->free_pos = ALIGN(dsb->free_pos, 2);
@@ -162,26 +214,27 @@ void intel_dsb_indexed_reg_write(struct intel_dsb *dsb,
 		dsb->ins_start_offset = dsb->free_pos;
 
 		/* Update the size. */
-		buf[dsb->free_pos++] = 1;
+		dsb_write(dsb, dsb->free_pos++, 1);
 
 		/* Update the opcode and reg. */
-		buf[dsb->free_pos++] = (DSB_OPCODE_INDEXED_WRITE  <<
-					DSB_OPCODE_SHIFT) |
-					i915_mmio_reg_offset(reg);
+		dsb_write(dsb, dsb->free_pos++,
+			  (DSB_OPCODE_INDEXED_WRITE << DSB_OPCODE_SHIFT) |
+			  i915_mmio_reg_offset(reg));
 
 		/* Update the value. */
-		buf[dsb->free_pos++] = val;
+		dsb_write(dsb, dsb->free_pos++, val);
 	} else {
 		/* Update the new value. */
-		buf[dsb->free_pos++] = val;
+		dsb_write(dsb, dsb->free_pos++, val);
 
 		/* Update the size. */
-		buf[dsb->ins_start_offset]++;
+		old_val = dsb_read(dsb, dsb->ins_start_offset);
+		dsb_write(dsb, dsb->ins_start_offset, old_val + 1);
 	}
 
 	/* if number of data words is odd, then the last dword should be 0.*/
 	if (dsb->free_pos & 0x1)
-		buf[dsb->free_pos] = 0;
+		dsb_write(dsb, dsb->free_pos, 0);
 }
 
 /**
@@ -201,7 +254,11 @@ void intel_dsb_reg_write(struct intel_dsb *dsb,
 {
 	struct intel_crtc *crtc = dsb->crtc;
 	struct drm_i915_private *dev_priv = to_i915(crtc->base.dev);
-	u32 *buf = dsb->cmd_buf;
+
+	if (!dsb) {
+		intel_de_write_fw(dev_priv, reg, val);
+		return;
+	}
 
 	if (drm_WARN_ON(&dev_priv->drm, dsb->free_pos >= DSB_BUF_SIZE)) {
 		drm_dbg_kms(&dev_priv->drm, "DSB buffer overflow\n");
@@ -209,10 +266,11 @@ void intel_dsb_reg_write(struct intel_dsb *dsb,
 	}
 
 	dsb->ins_start_offset = dsb->free_pos;
-	buf[dsb->free_pos++] = val;
-	buf[dsb->free_pos++] = (DSB_OPCODE_MMIO_WRITE  << DSB_OPCODE_SHIFT) |
-			       (DSB_BYTE_EN << DSB_BYTE_EN_SHIFT) |
-			       i915_mmio_reg_offset(reg);
+	dsb_write(dsb, dsb->free_pos++, val);
+	dsb_write(dsb, dsb->free_pos++,
+		  (DSB_OPCODE_MMIO_WRITE  << DSB_OPCODE_SHIFT) |
+		  (DSB_BYTE_EN << DSB_BYTE_EN_SHIFT) |
+		  i915_mmio_reg_offset(reg));
 }
 
 /**
@@ -240,12 +298,11 @@ void intel_dsb_commit(struct intel_dsb *dsb)
 		goto reset;
 	}
 	intel_de_write(dev_priv, DSB_HEAD(pipe, dsb->id),
-		       i915_ggtt_offset(dsb->vma));
+		       dsb_ggtt_offset(dsb));
 
-	tail = ALIGN(dsb->free_pos * 4, CACHELINE_BYTES);
+	tail = ALIGN(dsb->free_pos * 4, 64);
 	if (tail > dsb->free_pos * 4)
-		memset(&dsb->cmd_buf[dsb->free_pos], 0,
-		       (tail - dsb->free_pos * 4));
+		dsb_memset(dsb, dsb->free_pos, 0, (tail - dsb->free_pos * 4));
 
 	if (is_dsb_busy(dev_priv, pipe, dsb->id)) {
 		drm_err(&dev_priv->drm,
@@ -254,9 +311,9 @@ void intel_dsb_commit(struct intel_dsb *dsb)
 	}
 	drm_dbg_kms(&dev_priv->drm,
 		    "DSB execution started - head 0x%x, tail 0x%x\n",
-		    i915_ggtt_offset(dsb->vma), tail);
+		    dsb_ggtt_offset(dsb), tail);
 	intel_de_write(dev_priv, DSB_TAIL(pipe, dsb->id),
-		       i915_ggtt_offset(dsb->vma) + tail);
+		       dsb_ggtt_offset(dsb) + tail);
 	if (wait_for(!is_dsb_busy(dev_priv, pipe, dsb->id), 1)) {
 		drm_err(&dev_priv->drm,
 			"Timed out waiting for DSB workload completion.\n");
@@ -284,9 +341,9 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
 	struct drm_i915_private *i915 = to_i915(crtc->base.dev);
 	struct intel_dsb *dsb;
 	struct drm_i915_gem_object *obj;
-	struct i915_vma *vma;
-	u32 *buf;
+	__maybe_unused struct i915_vma *vma;
 	intel_wakeref_t wakeref;
+	__maybe_unused u32 *buf;
 
 	if (!HAS_DSB(i915))
 		return NULL;
@@ -297,6 +354,7 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
 
 	wakeref = intel_runtime_pm_get(&i915->runtime_pm);
 
+#ifdef I915
 	obj = i915_gem_object_create_internal(i915, DSB_BUF_SIZE);
 	if (IS_ERR(obj))
 		goto out_put_rpm;
@@ -319,6 +377,18 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
 	dsb->vma = vma;
 	dsb->crtc = crtc;
 	dsb->cmd_buf = buf;
+#else
+	obj = xe_bo_create_pin_map(i915, to_gt(i915), NULL, DSB_BUF_SIZE,
+				   ttm_bo_type_kernel,
+				   XE_BO_CREATE_VRAM_IF_DGFX(to_gt(i915)) |
+				   XE_BO_CREATE_GGTT_BIT);
+	if (IS_ERR(obj)) {
+		kfree(dsb);
+		goto out_put_rpm;
+	}
+	dsb->obj = obj;
+#endif
+	dsb->id = DSB1;
 	dsb->free_pos = 0;
 	dsb->ins_start_offset = 0;
 
@@ -343,6 +413,10 @@ struct intel_dsb *intel_dsb_prepare(struct intel_crtc *crtc)
  */
 void intel_dsb_cleanup(struct intel_dsb *dsb)
 {
+#ifdef I915
 	i915_vma_unpin_and_release(&dsb->vma, I915_VMA_RELEASE_MAP);
+#else
+	xe_bo_unpin_map_no_vm(dsb->obj);
+#endif
 	kfree(dsb);
 }
diff --git a/drivers/gpu/drm/i915/display/intel_dsi_vbt.c b/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
index 2cbc1292ab38..b45552d96c0c 100644
--- a/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
+++ b/drivers/gpu/drm/i915/display/intel_dsi_vbt.c
@@ -46,9 +46,11 @@
 #include "intel_dsi.h"
 #include "intel_dsi_vbt.h"
 #include "intel_gmbus_regs.h"
+#ifdef I915
 #include "vlv_dsi.h"
 #include "vlv_dsi_regs.h"
 #include "vlv_sideband.h"
+#endif
 
 #define MIPI_TRANSFER_MODE_SHIFT	0
 #define MIPI_VIRTUAL_CHANNEL_SHIFT	1
@@ -76,6 +78,7 @@ struct gpio_map {
 	bool init;
 };
 
+#ifdef I915
 static struct gpio_map vlv_gpio_table[] = {
 	{ VLV_GPIO_NC_0_HV_DDI0_HPD },
 	{ VLV_GPIO_NC_1_HV_DDI0_DDC_SDA },
@@ -90,6 +93,7 @@ static struct gpio_map vlv_gpio_table[] = {
 	{ VLV_GPIO_NC_10_PANEL1_BKLTEN },
 	{ VLV_GPIO_NC_11_PANEL1_BKLTCTL },
 };
+#endif
 
 struct i2c_adapter_lookup {
 	u16 slave_addr;
@@ -219,10 +223,10 @@ static const u8 *mipi_exec_send_packet(struct intel_dsi *intel_dsi,
 		mipi_dsi_dcs_write_buffer(dsi_device, data, len);
 		break;
 	}
-
+#ifdef I915
 	if (DISPLAY_VER(dev_priv) < 11)
 		vlv_dsi_wait_for_fifo_empty(intel_dsi, port);
-
+#endif
 out:
 	data += len;
 
@@ -242,6 +246,7 @@ static const u8 *mipi_exec_delay(struct intel_dsi *intel_dsi, const u8 *data)
 	return data;
 }
 
+#ifdef I915
 static void vlv_exec_gpio(struct intel_connector *connector,
 			  u8 gpio_source, u8 gpio_index, bool value)
 {
@@ -370,6 +375,7 @@ static void bxt_exec_gpio(struct intel_connector *connector,
 
 	gpiod_set_value(gpio_desc, value);
 }
+#endif
 
 static void icl_exec_gpio(struct intel_connector *connector,
 			  u8 gpio_source, u8 gpio_index, bool value)
@@ -491,12 +497,14 @@ static const u8 *mipi_exec_gpio(struct intel_dsi *intel_dsi, const u8 *data)
 		icl_native_gpio_set_value(dev_priv, gpio_number, value);
 	else if (DISPLAY_VER(dev_priv) >= 11)
 		icl_exec_gpio(connector, gpio_source, gpio_index, value);
+#ifdef I915
 	else if (IS_VALLEYVIEW(dev_priv))
 		vlv_exec_gpio(connector, gpio_source, gpio_number, value);
 	else if (IS_CHERRYVIEW(dev_priv))
 		chv_exec_gpio(connector, gpio_source, gpio_number, value);
 	else
 		bxt_exec_gpio(connector, gpio_source, gpio_index, value);
+#endif
 
 	return data;
 }
@@ -821,8 +829,10 @@ void intel_dsi_log_params(struct intel_dsi *intel_dsi)
 		    intel_dsi->clk_lp_to_hs_count);
 	drm_dbg_kms(&i915->drm, "HS to LP Clock Count 0x%x\n",
 		    intel_dsi->clk_hs_to_lp_count);
+#ifdef I915
 	drm_dbg_kms(&i915->drm, "BTA %s\n",
 		    str_enabled_disabled(!(intel_dsi->video_frmt_cfg_bits & DISABLE_VIDEO_BTA)));
+#endif
 }
 
 bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
@@ -841,9 +851,7 @@ bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
 	intel_dsi->eotp_pkt = mipi_config->eot_pkt_disabled ? 0 : 1;
 	intel_dsi->clock_stop = mipi_config->enable_clk_stop ? 1 : 0;
 	intel_dsi->lane_count = mipi_config->lane_cnt + 1;
-	intel_dsi->pixel_format =
-			pixel_format_from_register_bits(
-				mipi_config->videomode_color_format << 7);
+	intel_dsi->pixel_format = mipi_config->videomode_color_format << 7;
 
 	intel_dsi->dual_link = mipi_config->dual_link;
 	intel_dsi->pixel_overlap = mipi_config->pixel_overlap;
@@ -857,7 +865,7 @@ bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
 	intel_dsi->init_count = mipi_config->master_init_timer;
 	intel_dsi->bw_timer = mipi_config->dbi_bw_timer;
 	intel_dsi->video_frmt_cfg_bits =
-		mipi_config->bta_enabled ? DISABLE_VIDEO_BTA : 0;
+		mipi_config->bta_enabled ? BIT(3) : 0;
 	intel_dsi->bgr_enabled = mipi_config->rgb_flip;
 
 	/* Starting point, adjusted depending on dual link and burst mode */
@@ -940,6 +948,7 @@ bool intel_dsi_vbt_init(struct intel_dsi *intel_dsi, u16 panel_id)
  * If the GOP did not initialize the panel (HDMI inserted) we may need to also
  * change the pinmux for the SoC's PWM0 pin from GPIO to PWM.
  */
+#ifdef I915
 static struct gpiod_lookup_table pmic_panel_gpio_table = {
 	/* Intel GFX is consumer */
 	.dev_id = "0000:00:02.0",
@@ -963,9 +972,11 @@ static const struct pinctrl_map soc_pwm_pinctrl_map[] = {
 	PIN_MAP_MUX_GROUP("0000:00:02.0", "soc_pwm0", "INT33FC:00",
 			  "pwm0_grp", "pwm"),
 };
+#endif
 
 void intel_dsi_vbt_gpio_init(struct intel_dsi *intel_dsi, bool panel_is_on)
 {
+#ifdef I915
 	struct drm_device *dev = intel_dsi->base.base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct intel_connector *connector = intel_dsi->attached_connector;
@@ -1018,10 +1029,12 @@ void intel_dsi_vbt_gpio_init(struct intel_dsi *intel_dsi, bool panel_is_on)
 			intel_dsi->gpio_backlight = NULL;
 		}
 	}
+#endif
 }
 
 void intel_dsi_vbt_gpio_cleanup(struct intel_dsi *intel_dsi)
 {
+#ifdef I915
 	struct drm_device *dev = intel_dsi->base.base.dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct intel_connector *connector = intel_dsi->attached_connector;
@@ -1045,4 +1058,5 @@ void intel_dsi_vbt_gpio_cleanup(struct intel_dsi *intel_dsi)
 		pinctrl_unregister_mappings(soc_pwm_pinctrl_map);
 		gpiod_remove_lookup_table(&soc_panel_gpio_table);
 	}
+#endif
 }
diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
index 56cdacf33db2..e0a8d9e9df9a 100644
--- a/drivers/gpu/drm/i915/display/intel_fb.c
+++ b/drivers/gpu/drm/i915/display/intel_fb.c
@@ -4,6 +4,7 @@
  */
 
 #include <drm/drm_blend.h>
+#include <drm/drm_damage_helper.h>
 #include <drm/drm_framebuffer.h>
 #include <drm/drm_modeset_helper.h>
 
@@ -14,6 +15,16 @@
 #include "intel_fb.h"
 #include "intel_frontbuffer.h"
 
+#ifdef I915
+/*
+ * i915 requires obj->__do_not_access.base,
+ * xe uses obj->ttm.base
+ */
+#define ttm __do_not_access
+#else
+#include <drm/ttm/ttm_bo.h>
+#endif
+
 #define check_array_bounds(i915, a, i) drm_WARN_ON(&(i915)->drm, (i) >= ARRAY_SIZE(a))
 
 /*
@@ -697,6 +708,7 @@ intel_fb_align_height(const struct drm_framebuffer *fb,
 	return ALIGN(height, tile_height);
 }
 
+#ifdef I915
 static unsigned int intel_fb_modifier_to_tiling(u64 fb_modifier)
 {
 	u8 tiling_caps = lookup_modifier(fb_modifier)->plane_caps &
@@ -716,6 +728,7 @@ static unsigned int intel_fb_modifier_to_tiling(u64 fb_modifier)
 		return I915_TILING_NONE;
 	}
 }
+#endif
 
 static bool intel_modifier_uses_dpt(struct drm_i915_private *i915, u64 modifier)
 {
@@ -1234,7 +1247,6 @@ static bool intel_plane_needs_remap(const struct intel_plane_state *plane_state)
 static int convert_plane_offset_to_xy(const struct intel_framebuffer *fb, int color_plane,
 				      int plane_width, int *x, int *y)
 {
-	struct drm_i915_gem_object *obj = intel_fb_obj(&fb->base);
 	int ret;
 
 	ret = intel_fb_offset_to_xy(x, y, &fb->base, color_plane);
@@ -1258,13 +1270,15 @@ static int convert_plane_offset_to_xy(const struct intel_framebuffer *fb, int co
 	 * fb layout agrees with the fence layout. We already check that the
 	 * fb stride matches the fence stride elsewhere.
 	 */
-	if (color_plane == 0 && i915_gem_object_is_tiled(obj) &&
+#ifdef I915
+	if (color_plane == 0 && i915_gem_object_is_tiled(intel_fb_obj(&fb->base)) &&
 	    (*x + plane_width) * fb->base.format->cpp[color_plane] > fb->base.pitches[color_plane]) {
 		drm_dbg_kms(fb->base.dev,
 			    "bad fb plane %d offset: 0x%x\n",
 			    color_plane, fb->base.offsets[color_plane]);
 		return -EINVAL;
 	}
+#endif
 
 	return 0;
 }
@@ -1611,10 +1625,10 @@ int intel_fill_fb_info(struct drm_i915_private *i915, struct intel_framebuffer *
 		max_size = max(max_size, offset + size);
 	}
 
-	if (mul_u32_u32(max_size, tile_size) > obj->base.size) {
+	if (mul_u32_u32(max_size, tile_size) > obj->ttm.base.size) {
 		drm_dbg_kms(&i915->drm,
 			    "fb too big for bo (need %llu bytes, have %zu bytes)\n",
-			    mul_u32_u32(max_size, tile_size), obj->base.size);
+			    mul_u32_u32(max_size, tile_size), obj->ttm.base.size);
 		return -EINVAL;
 	}
 
@@ -1830,8 +1844,10 @@ static void intel_user_framebuffer_destroy(struct drm_framebuffer *fb)
 
 	drm_framebuffer_cleanup(fb);
 
+#ifdef I915
 	if (intel_fb_uses_dpt(fb))
 		intel_dpt_destroy(intel_fb->dpt_vm);
+#endif
 
 	drm_gem_object_put(fb->obj[0]);
 	kfree(intel_fb);
@@ -1842,47 +1858,53 @@ static int intel_user_framebuffer_create_handle(struct drm_framebuffer *fb,
 						unsigned int *handle)
 {
 	struct drm_i915_gem_object *obj = intel_fb_obj(fb);
-	struct drm_i915_private *i915 = to_i915(obj->base.dev);
 
+#ifdef I915
 	if (i915_gem_object_is_userptr(obj)) {
-		drm_dbg(&i915->drm,
+		drm_dbg(fb->dev,
 			"attempting to use a userptr for a framebuffer, denied\n");
 		return -EINVAL;
 	}
+#endif
 
-	return drm_gem_handle_create(file, &obj->base, handle);
+	return drm_gem_handle_create(file, &obj->ttm.base, handle);
 }
 
+#ifdef I915
 static int intel_user_framebuffer_dirty(struct drm_framebuffer *fb,
 					struct drm_file *file,
 					unsigned int flags, unsigned int color,
 					struct drm_clip_rect *clips,
 					unsigned int num_clips)
 {
-	struct drm_i915_gem_object *obj = intel_fb_obj(fb);
-
-	i915_gem_object_flush_if_display(obj);
+	i915_gem_object_flush_if_display(intel_fb_obj(fb));
 	intel_frontbuffer_flush(to_intel_framebuffer(fb), ORIGIN_DIRTYFB);
 
 	return 0;
 }
+#endif
 
 static const struct drm_framebuffer_funcs intel_fb_funcs = {
 	.destroy = intel_user_framebuffer_destroy,
 	.create_handle = intel_user_framebuffer_create_handle,
+#ifdef I915
 	.dirty = intel_user_framebuffer_dirty,
+#else
+	.dirty = drm_atomic_helper_dirtyfb,
+#endif
 };
 
 int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			   struct drm_i915_gem_object *obj,
 			   struct drm_mode_fb_cmd2 *mode_cmd)
 {
-	struct drm_i915_private *dev_priv = to_i915(obj->base.dev);
+	struct drm_i915_private *dev_priv = to_i915(obj->ttm.base.dev);
 	struct drm_framebuffer *fb = &intel_fb->base;
 	u32 max_stride;
-	unsigned int tiling, stride;
 	int ret = -EINVAL;
 	int i;
+#ifdef I915
+	unsigned tiling, stride;
 
 	i915_gem_object_lock(obj, NULL);
 	tiling = i915_gem_object_get_tiling(obj);
@@ -1909,6 +1931,29 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			goto err;
 		}
 	}
+#else
+	ret = ttm_bo_reserve(&obj->ttm, true, false, NULL);
+	if (ret)
+		goto err;
+	ret = -EINVAL;
+
+	if (!(obj->flags & XE_BO_SCANOUT_BIT)) {
+		/*
+		 * XE_BO_SCANOUT_BIT should ideally be set at creation, or is
+		 * automatically set when creating FB. We cannot change caching
+		 * mode when the object is VM_BINDed, so we can only set
+		 * coherency with display when unbound.
+		 */
+		if (XE_IOCTL_ERR(dev_priv, !list_empty(&obj->vmas))) {
+			ttm_bo_unreserve(&obj->ttm);
+			goto err;
+		}
+		obj->flags |= XE_BO_SCANOUT_BIT;
+	}
+	ttm_bo_unreserve(&obj->ttm);
+#endif
+
+	atomic_set(&intel_fb->bits, 0);
 
 	if (!drm_any_plane_has_format(&dev_priv->drm,
 				      mode_cmd->pixel_format,
@@ -1919,6 +1964,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 		goto err;
 	}
 
+#ifdef I915
 	/*
 	 * gen2/3 display engine uses the fence if present,
 	 * so the tiling mode must match the fb modifier exactly.
@@ -1929,6 +1975,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			    "tiling_mode must match fb modifier exactly on gen2/3\n");
 		goto err;
 	}
+#endif
 
 	max_stride = intel_fb_max_stride(dev_priv, mode_cmd->pixel_format,
 					 mode_cmd->modifier[0]);
@@ -1941,6 +1988,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 		goto err;
 	}
 
+#ifdef I915
 	/*
 	 * If there's a fence, enforce that
 	 * the fb pitch and fence stride match.
@@ -1951,6 +1999,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			    mode_cmd->pitches[0], stride);
 		goto err;
 	}
+#endif
 
 	/* FIXME need to adjust LINOFF/TILEOFF accordingly. */
 	if (mode_cmd->offsets[0] != 0) {
@@ -1991,13 +2040,14 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 			}
 		}
 
-		fb->obj[i] = &obj->base;
+		fb->obj[i] = &obj->ttm.base;
 	}
 
 	ret = intel_fill_fb_info(dev_priv, intel_fb);
 	if (ret)
 		goto err;
 
+#ifdef I915
 	if (intel_fb_uses_dpt(fb)) {
 		struct i915_address_space *vm;
 
@@ -2009,6 +2059,7 @@ int intel_framebuffer_init(struct intel_framebuffer *intel_fb,
 
 		intel_fb->dpt_vm = vm;
 	}
+#endif
 
 	ret = drm_framebuffer_init(&dev_priv->drm, fb, &intel_fb_funcs);
 	if (ret) {
@@ -2031,22 +2082,35 @@ intel_user_framebuffer_create(struct drm_device *dev,
 	struct drm_framebuffer *fb;
 	struct drm_i915_gem_object *obj;
 	struct drm_mode_fb_cmd2 mode_cmd = *user_mode_cmd;
-	struct drm_i915_private *i915;
+	struct drm_i915_private *i915 = to_i915(dev);
 
+#ifdef I915
 	obj = i915_gem_object_lookup(filp, mode_cmd.handles[0]);
 	if (!obj)
 		return ERR_PTR(-ENOENT);
 
 	/* object is backed with LMEM for discrete */
-	i915 = to_i915(obj->base.dev);
 	if (HAS_LMEM(i915) && !i915_gem_object_can_migrate(obj, INTEL_REGION_LMEM_0)) {
 		/* object is "remote", not in local memory */
 		i915_gem_object_put(obj);
 		return ERR_PTR(-EREMOTE);
 	}
+#else
+	struct drm_gem_object *gem = drm_gem_object_lookup(filp, mode_cmd.handles[0]);
+	if (!gem)
+		return ERR_PTR(-ENOENT);
+
+	obj = gem_to_xe_bo(gem);
+	/* Require vram exclusive objects, but allow dma-buf imports */
+	if (IS_DGFX(i915) && obj->flags & XE_BO_CREATE_SYSTEM_BIT &&
+	    obj->ttm.type != ttm_bo_type_sg) {
+		drm_gem_object_put(gem);
+		return ERR_PTR(-EREMOTE);
+	}
+#endif
 
 	fb = intel_framebuffer_create(obj, &mode_cmd);
-	i915_gem_object_put(obj);
+	drm_gem_object_put(&obj->ttm.base);
 
 	return fb;
 }
diff --git a/drivers/gpu/drm/i915/display/intel_fbc.c b/drivers/gpu/drm/i915/display/intel_fbc.c
index 5e69d3c11d21..77c848b5b7ae 100644
--- a/drivers/gpu/drm/i915/display/intel_fbc.c
+++ b/drivers/gpu/drm/i915/display/intel_fbc.c
@@ -45,7 +45,9 @@
 
 #include "i915_drv.h"
 #include "i915_utils.h"
+#ifdef I915
 #include "i915_vgpu.h"
+#endif
 #include "intel_cdclk.h"
 #include "intel_de.h"
 #include "intel_display_trace.h"
@@ -53,6 +55,32 @@
 #include "intel_fbc.h"
 #include "intel_frontbuffer.h"
 
+#ifdef I915
+
+#define i915_gem_stolen_initialized(i915) (drm_mm_initialized(&(i915)->mm.stolen))
+
+#else
+
+/* No stolen memory support in xe yet */
+static int i915_gem_stolen_insert_node_in_range(struct xe_device *xe, void *ptr, u32 size, u32 align, u32 start, u32 end)
+{
+	return -ENODEV;
+}
+
+static int i915_gem_stolen_insert_node(struct xe_device *xe, void *ptr, u32 size, u32 align)
+{
+	XE_WARN_ON(1);
+	return -ENODEV;
+}
+
+static void i915_gem_stolen_remove_node(struct xe_device *xe, void *ptr)
+{
+}
+
+#define i915_gem_stolen_initialized(xe) ((xe) && 0)
+
+#endif
+
 #define for_each_fbc_id(__dev_priv, __fbc_id) \
 	for ((__fbc_id) = INTEL_FBC_A; (__fbc_id) < I915_MAX_FBCS; (__fbc_id)++) \
 		for_each_if(RUNTIME_INFO(__dev_priv)->fbc_mask & BIT(__fbc_id))
@@ -329,6 +357,7 @@ static void i8xx_fbc_nuke(struct intel_fbc *fbc)
 
 static void i8xx_fbc_program_cfb(struct intel_fbc *fbc)
 {
+#ifdef I915
 	struct drm_i915_private *i915 = fbc->i915;
 
 	GEM_BUG_ON(range_overflows_end_t(u64, i915->dsm.start,
@@ -340,6 +369,7 @@ static void i8xx_fbc_program_cfb(struct intel_fbc *fbc)
 		       i915->dsm.start + fbc->compressed_fb.start);
 	intel_de_write(i915, FBC_LL_BASE,
 		       i915->dsm.start + fbc->compressed_llb.start);
+#endif
 }
 
 static const struct intel_fbc_funcs i8xx_fbc_funcs = {
@@ -604,8 +634,10 @@ static void ivb_fbc_activate(struct intel_fbc *fbc)
 	else if (DISPLAY_VER(i915) == 9)
 		skl_fbc_program_cfb_stride(fbc);
 
+#ifdef I915
 	if (to_gt(i915)->ggtt->num_fences)
 		snb_fbc_program_fence(fbc);
+#endif
 
 	intel_de_write(i915, ILK_DPFC_CONTROL(fbc->id),
 		       DPFC_CTL_EN | ivb_dpfc_ctl(fbc));
@@ -710,10 +742,14 @@ static u64 intel_fbc_stolen_end(struct drm_i915_private *i915)
 	 * reserved range size, so it always assumes the maximum (8mb) is used.
 	 * If we enable FBC using a CFB on that memory range we'll get FIFO
 	 * underruns, even if that range is not reserved by the BIOS. */
+#ifdef I915
 	if (IS_BROADWELL(i915) ||
 	    (DISPLAY_VER(i915) == 9 && !IS_BROXTON(i915)))
 		end = resource_size(&i915->dsm) - 8 * 1024 * 1024;
 	else
+#else
+	/* TODO */
+#endif
 		end = U64_MAX;
 
 	return min(end, intel_fbc_cfb_base_max(i915));
@@ -799,7 +835,7 @@ static int intel_fbc_alloc_cfb(struct intel_fbc *fbc,
 	if (drm_mm_node_allocated(&fbc->compressed_llb))
 		i915_gem_stolen_remove_node(i915, &fbc->compressed_llb);
 err:
-	if (drm_mm_initialized(&i915->mm.stolen))
+	if (i915_gem_stolen_initialized(i915))
 		drm_info_once(&i915->drm, "not enough stolen space for compressed buffer (need %d more bytes), disabling. Hint: you may be able to increase stolen memory size in the BIOS to avoid this.\n", size);
 	return -ENOSPC;
 }
@@ -970,7 +1006,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 				   struct intel_crtc *crtc,
 				   struct intel_plane *plane)
 {
-	struct drm_i915_private *i915 = to_i915(state->base.dev);
+	__maybe_unused struct drm_i915_private *i915 = to_i915(state->base.dev);
 	const struct intel_crtc_state *crtc_state =
 		intel_atomic_get_new_crtc_state(state, crtc);
 	const struct intel_plane_state *plane_state =
@@ -985,7 +1021,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 
 	/* FBC1 compression interval: arbitrary choice of 1 second */
 	fbc_state->interval = drm_mode_vrefresh(&crtc_state->hw.adjusted_mode);
-
+#ifdef I915
 	fbc_state->fence_y_offset = intel_plane_fence_y_offset(plane_state);
 
 	drm_WARN_ON(&i915->drm, plane_state->flags & PLANE_HAS_FENCE &&
@@ -995,6 +1031,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 	    plane_state->ggtt_vma->fence)
 		fbc_state->fence_id = plane_state->ggtt_vma->fence->id;
 	else
+#endif
 		fbc_state->fence_id = -1;
 
 	fbc_state->cfb_stride = intel_fbc_cfb_stride(plane_state);
@@ -1004,6 +1041,7 @@ static void intel_fbc_update_state(struct intel_atomic_state *state,
 
 static bool intel_fbc_is_fence_ok(const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	struct drm_i915_private *i915 = to_i915(plane_state->uapi.plane->dev);
 
 	/*
@@ -1021,6 +1059,9 @@ static bool intel_fbc_is_fence_ok(const struct intel_plane_state *plane_state)
 	return DISPLAY_VER(i915) >= 9 ||
 		(plane_state->flags & PLANE_HAS_FENCE &&
 		 plane_state->ggtt_vma->fence);
+#else
+	return true;
+#endif
 }
 
 static bool intel_fbc_is_cfb_ok(const struct intel_plane_state *plane_state)
@@ -1706,7 +1747,7 @@ void intel_fbc_init(struct drm_i915_private *i915)
 {
 	enum intel_fbc_id fbc_id;
 
-	if (!drm_mm_initialized(&i915->mm.stolen))
+	if (!i915_gem_stolen_initialized(i915))
 		RUNTIME_INFO(i915)->fbc_mask = 0;
 
 	if (need_fbc_vtd_wa(i915))
diff --git a/drivers/gpu/drm/i915/display/intel_fbdev.c b/drivers/gpu/drm/i915/display/intel_fbdev.c
index 8ccdf1a964ff..176e0e44a268 100644
--- a/drivers/gpu/drm/i915/display/intel_fbdev.c
+++ b/drivers/gpu/drm/i915/display/intel_fbdev.c
@@ -41,7 +41,11 @@
 #include <drm/drm_fb_helper.h>
 #include <drm/drm_fourcc.h>
 
+#ifdef I915
 #include "gem/i915_gem_lmem.h"
+#else
+#include "xe_gt.h"
+#endif
 
 #include "i915_drv.h"
 #include "intel_display_types.h"
@@ -50,6 +54,14 @@
 #include "intel_fbdev.h"
 #include "intel_frontbuffer.h"
 
+#ifdef I915
+/*
+ * i915 requires obj->__do_not_access.base,
+ * xe uses obj->ttm.base
+ */
+#define ttm __do_not_access
+#endif
+
 struct intel_fbdev {
 	struct drm_fb_helper helper;
 	struct intel_framebuffer *fb;
@@ -147,14 +159,19 @@ static int intelfb_alloc(struct drm_fb_helper *helper,
 	mode_cmd.width = sizes->surface_width;
 	mode_cmd.height = sizes->surface_height;
 
+#ifdef I915
 	mode_cmd.pitches[0] = ALIGN(mode_cmd.width *
 				    DIV_ROUND_UP(sizes->surface_bpp, 8), 64);
+#else
+	mode_cmd.pitches[0] = ALIGN(mode_cmd.width *
+				    DIV_ROUND_UP(sizes->surface_bpp, 8), GEN8_PAGE_SIZE);
+#endif
 	mode_cmd.pixel_format = drm_mode_legacy_fb_format(sizes->surface_bpp,
 							  sizes->surface_depth);
 
 	size = mode_cmd.pitches[0] * mode_cmd.height;
 	size = PAGE_ALIGN(size);
-
+#ifdef I915
 	obj = ERR_PTR(-ENODEV);
 	if (HAS_LMEM(dev_priv)) {
 		obj = i915_gem_object_create_lmem(dev_priv, size,
@@ -170,6 +187,13 @@ static int intelfb_alloc(struct drm_fb_helper *helper,
 		if (IS_ERR(obj))
 			obj = i915_gem_object_create_shmem(dev_priv, size);
 	}
+#else
+	/* XXX: Care about stolen? */
+	obj = xe_bo_create_pin_map(dev_priv, to_gt(dev_priv), NULL, size,
+				   ttm_bo_type_kernel,
+				   XE_BO_CREATE_VRAM_IF_DGFX(to_gt(dev_priv)) |
+				   XE_BO_CREATE_PINNED_BIT | XE_BO_SCANOUT_BIT);
+#endif
 
 	if (IS_ERR(obj)) {
 		drm_err(&dev_priv->drm, "failed to allocate framebuffer (%pe)\n", obj);
@@ -177,10 +201,16 @@ static int intelfb_alloc(struct drm_fb_helper *helper,
 	}
 
 	fb = intel_framebuffer_create(obj, &mode_cmd);
-	i915_gem_object_put(obj);
-	if (IS_ERR(fb))
+	if (IS_ERR(fb)) {
+#ifdef I915
+		i915_gem_object_put(obj);
+#else
+		xe_bo_unpin_map_no_vm(obj);
+#endif
 		return PTR_ERR(fb);
+	}
 
+	drm_gem_object_put(&obj->ttm.base);
 	ifbdev->fb = to_intel_framebuffer(fb);
 	return 0;
 }
@@ -194,7 +224,6 @@ static int intelfb_create(struct drm_fb_helper *helper,
 	struct drm_device *dev = helper->dev;
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct pci_dev *pdev = to_pci_dev(dev_priv->drm.dev);
-	struct i915_ggtt *ggtt = to_gt(dev_priv)->ggtt;
 	const struct i915_gtt_view view = {
 		.type = I915_GTT_VIEW_NORMAL,
 	};
@@ -264,6 +293,7 @@ static int intelfb_create(struct drm_fb_helper *helper,
 
 	/* setup aperture base/size for vesafb takeover */
 	obj = intel_fb_obj(&intel_fb->base);
+#ifdef I915
 	if (i915_gem_object_is_lmem(obj)) {
 		struct intel_memory_region *mem = obj->mm.region;
 
@@ -276,6 +306,8 @@ static int intelfb_create(struct drm_fb_helper *helper,
 					i915_gem_object_get_dma_address(obj, 0));
 		info->fix.smem_len = obj->base.size;
 	} else {
+		struct i915_ggtt *ggtt = to_gt(dev_priv)->ggtt;
+
 		info->apertures->ranges[0].base = ggtt->gmadr.start;
 		info->apertures->ranges[0].size = ggtt->mappable_end;
 
@@ -284,8 +316,36 @@ static int intelfb_create(struct drm_fb_helper *helper,
 			(unsigned long)(ggtt->gmadr.start + i915_ggtt_offset(vma));
 		info->fix.smem_len = vma->size;
 	}
-
 	vaddr = i915_vma_pin_iomap(vma);
+
+#else
+	/* XXX: Could be pure fiction.. */
+	if (obj->flags & XE_BO_CREATE_VRAM0_BIT) {
+		struct xe_gt *gt = to_gt(dev_priv);
+		bool lmem;
+
+		info->apertures->ranges[0].base = gt->mem.vram.io_start;
+		info->apertures->ranges[0].size = gt->mem.vram.size;
+
+		info->fix.smem_start =
+			(unsigned long)(gt->mem.vram.io_start + xe_bo_addr(obj, 0, 4096, &lmem));
+		info->fix.smem_len = obj->ttm.base.size;
+
+	} else {
+		struct pci_dev *pdev = to_pci_dev(dev_priv->drm.dev);
+
+		info->apertures->ranges[0].base = pci_resource_start(pdev, 2);
+		info->apertures->ranges[0].size =
+			pci_resource_end(pdev, 2) - pci_resource_start(pdev, 2);
+
+		info->fix.smem_start = info->apertures->ranges[0].base + xe_bo_ggtt_addr(obj);
+		info->fix.smem_len = obj->ttm.base.size;
+	}
+
+	/* TODO: ttm_bo_kmap? */
+	vaddr = obj->vmap.vaddr;
+#endif
+
 	if (IS_ERR(vaddr)) {
 		drm_err(&dev_priv->drm,
 			"Failed to remap framebuffer into virtual memory (%pe)\n", vaddr);
@@ -293,7 +353,7 @@ static int intelfb_create(struct drm_fb_helper *helper,
 		goto out_unpin;
 	}
 	info->screen_base = vaddr;
-	info->screen_size = vma->size;
+	info->screen_size = obj->ttm.base.size;
 
 	drm_fb_helper_fill_info(info, &ifbdev->helper, sizes);
 
@@ -301,14 +361,23 @@ static int intelfb_create(struct drm_fb_helper *helper,
 	 * If the object is stolen however, it will be full of whatever
 	 * garbage was left in there.
 	 */
+#ifdef I915
 	if (!i915_gem_object_is_shmem(vma->obj) && !prealloc)
+#else
+	/* XXX: Check stolen bit? */
+	if (!(obj->flags & XE_BO_CREATE_SYSTEM_BIT) && !prealloc)
+#endif
 		memset_io(info->screen_base, 0, info->screen_size);
 
 	/* Use default scratch pixmap (info->pixmap.flags = FB_PIXMAP_SYSTEM) */
 
 	drm_dbg_kms(&dev_priv->drm, "allocated %dx%d fb: 0x%08x\n",
 		    ifbdev->fb->base.width, ifbdev->fb->base.height,
+#ifdef I915
 		    i915_ggtt_offset(vma));
+#else
+		    (u32)vma->node.start);
+#endif
 	ifbdev->vma = vma;
 	ifbdev->vma_flags = flags;
 
@@ -339,8 +408,17 @@ static void intel_fbdev_destroy(struct intel_fbdev *ifbdev)
 	if (ifbdev->vma)
 		intel_unpin_fb_vma(ifbdev->vma, ifbdev->vma_flags);
 
-	if (ifbdev->fb)
+	if (ifbdev->fb) {
+#ifndef I915
+		struct xe_bo *bo = intel_fb_obj(&ifbdev->fb->base);
+
+		/* Unpin our kernel fb first */
+		xe_bo_lock_no_vm(bo, NULL);
+		xe_bo_unpin(bo);
+		xe_bo_unlock_no_vm(bo);
+#endif
 		drm_framebuffer_remove(&ifbdev->fb->base);
+	}
 
 	kfree(ifbdev);
 }
@@ -387,12 +465,12 @@ static bool intel_fbdev_init_bios(struct drm_device *dev,
 			continue;
 		}
 
-		if (obj->base.size > max_size) {
+		if (obj->ttm.base.size > max_size) {
 			drm_dbg_kms(&i915->drm,
 				    "found possible fb from [PLANE:%d:%s]\n",
 				    plane->base.base.id, plane->base.name);
 			fb = to_intel_framebuffer(plane_state->uapi.fb);
-			max_size = obj->base.size;
+			max_size = obj->ttm.base.size;
 		}
 	}
 
@@ -658,8 +736,13 @@ void intel_fbdev_set_suspend(struct drm_device *dev, int state, bool synchronous
 	 * been restored from swap. If the object is stolen however, it will be
 	 * full of whatever garbage was left in there.
 	 */
+#ifdef I915
 	if (state == FBINFO_STATE_RUNNING &&
 	    !i915_gem_object_is_shmem(intel_fb_obj(&ifbdev->fb->base)))
+#else
+	if (state == FBINFO_STATE_RUNNING &&
+	    !(intel_fb_obj(&ifbdev->fb->base)->flags & XE_BO_CREATE_SYSTEM_BIT))
+#endif
 		memset_io(info->screen_base, 0, info->screen_size);
 
 	drm_fb_helper_set_suspend(&ifbdev->helper, state);
diff --git a/drivers/gpu/drm/i915/display/intel_gmbus.c b/drivers/gpu/drm/i915/display/intel_gmbus.c
index 0bc4f6b48e80..2d099f4c52cd 100644
--- a/drivers/gpu/drm/i915/display/intel_gmbus.c
+++ b/drivers/gpu/drm/i915/display/intel_gmbus.c
@@ -39,7 +39,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_gmbus.h"
-#include "intel_gmbus_regs.h"
+#include "../i915/display/intel_gmbus_regs.h"
 
 struct intel_gmbus {
 	struct i2c_adapter adapter;
diff --git a/drivers/gpu/drm/i915/display/intel_lpe_audio.h b/drivers/gpu/drm/i915/display/intel_lpe_audio.h
index f848c5038714..1e236df9273b 100644
--- a/drivers/gpu/drm/i915/display/intel_lpe_audio.h
+++ b/drivers/gpu/drm/i915/display/intel_lpe_audio.h
@@ -12,11 +12,19 @@ enum pipe;
 enum port;
 struct drm_i915_private;
 
+#ifdef I915
 int  intel_lpe_audio_init(struct drm_i915_private *dev_priv);
 void intel_lpe_audio_teardown(struct drm_i915_private *dev_priv);
 void intel_lpe_audio_irq_handler(struct drm_i915_private *dev_priv);
 void intel_lpe_audio_notify(struct drm_i915_private *dev_priv,
 			    enum pipe pipe, enum port port,
 			    const void *eld, int ls_clock, bool dp_output);
+#else
+#define intel_lpe_audio_init(xe) (-ENODEV)
+#define intel_lpe_audio_teardown(xe) BUG_ON(1)
+#define intel_lpe_audio_irq_handler(xe) do { } while (0)
+#define intel_lpe_audio_notify(xe, a, b, c, d, e) do { } while (0)
+
+#endif
 
 #endif /* __INTEL_LPE_AUDIO_H__ */
diff --git a/drivers/gpu/drm/i915/display/intel_modeset_setup.c b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
index 96395bfbd41d..6f705acaf225 100644
--- a/drivers/gpu/drm/i915/display/intel_modeset_setup.c
+++ b/drivers/gpu/drm/i915/display/intel_modeset_setup.c
@@ -721,18 +721,21 @@ void intel_modeset_setup_hw_state(struct drm_i915_private *i915,
 
 	intel_dpll_sanitize_state(i915);
 
-	if (IS_G4X(i915)) {
+	if (DISPLAY_VER(i915) >= 9) {
+		skl_wm_get_hw_state(i915);
+		skl_wm_sanitize(i915);
+	}
+#ifdef I915
+	else if (IS_G4X(i915)) {
 		g4x_wm_get_hw_state(i915);
 		g4x_wm_sanitize(i915);
 	} else if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915)) {
 		vlv_wm_get_hw_state(i915);
 		vlv_wm_sanitize(i915);
-	} else if (DISPLAY_VER(i915) >= 9) {
-		skl_wm_get_hw_state(i915);
-		skl_wm_sanitize(i915);
 	} else if (HAS_PCH_SPLIT(i915)) {
 		ilk_wm_get_hw_state(i915);
 	}
+#endif
 
 	for_each_intel_crtc(&i915->drm, crtc) {
 		struct intel_crtc_state *crtc_state =
diff --git a/drivers/gpu/drm/i915/display/intel_opregion.c b/drivers/gpu/drm/i915/display/intel_opregion.c
index e0184745632c..057a68237efe 100644
--- a/drivers/gpu/drm/i915/display/intel_opregion.c
+++ b/drivers/gpu/drm/i915/display/intel_opregion.c
@@ -37,7 +37,7 @@
 #include "intel_backlight.h"
 #include "intel_display_types.h"
 #include "intel_opregion.h"
-#include "intel_pci_config.h"
+#include "../i915/intel_pci_config.h"
 
 #define OPREGION_HEADER_OFFSET 0
 #define OPREGION_ACPI_OFFSET   0x100
diff --git a/drivers/gpu/drm/i915/display/intel_pch_display.h b/drivers/gpu/drm/i915/display/intel_pch_display.h
index 41a63413cb3d..e8b50e9a4969 100644
--- a/drivers/gpu/drm/i915/display/intel_pch_display.h
+++ b/drivers/gpu/drm/i915/display/intel_pch_display.h
@@ -15,6 +15,7 @@ struct intel_crtc;
 struct intel_crtc_state;
 struct intel_link_m_n;
 
+#ifdef I915
 bool intel_has_pch_trancoder(struct drm_i915_private *i915,
 			     enum pipe pch_transcoder);
 enum pipe intel_crtc_pch_transcoder(struct intel_crtc *crtc);
@@ -41,5 +42,20 @@ void intel_pch_transcoder_get_m2_n2(struct intel_crtc *crtc,
 				    struct intel_link_m_n *m_n);
 
 void intel_pch_sanitize(struct drm_i915_private *i915);
+#else
+#define intel_has_pch_trancoder(xe, pipe) (xe && pipe && 0)
+#define intel_crtc_pch_transcoder(crtc) ((crtc)->pipe)
+#define ilk_pch_pre_enable(state, crtc) do { } while (0)
+#define ilk_pch_enable(state, crtc) do { } while (0)
+#define ilk_pch_disable(state, crtc) do { } while (0)
+#define ilk_pch_post_disable(state, crtc) do { } while (0)
+#define ilk_pch_get_config(crtc) do { } while (0)
+#define lpt_pch_enable(state, crtc) do { } while (0)
+#define lpt_pch_disable(state, crtc) do { } while (0)
+#define lpt_pch_get_config(crtc) do { } while (0)
+#define intel_pch_transcoder_get_m1_n1(crtc, m_n) memset((m_n), 0, sizeof(*m_n))
+#define intel_pch_transcoder_get_m2_n2(crtc, m_n) memset((m_n), 0, sizeof(*m_n))
+#define intel_pch_sanitize(xe) do { } while (0)
+#endif
 
 #endif
diff --git a/drivers/gpu/drm/i915/display/intel_pch_refclk.h b/drivers/gpu/drm/i915/display/intel_pch_refclk.h
index 9bcf56629f24..aa4f6e0b1127 100644
--- a/drivers/gpu/drm/i915/display/intel_pch_refclk.h
+++ b/drivers/gpu/drm/i915/display/intel_pch_refclk.h
@@ -11,6 +11,7 @@
 struct drm_i915_private;
 struct intel_crtc_state;
 
+#ifdef I915
 void lpt_program_iclkip(const struct intel_crtc_state *crtc_state);
 void lpt_disable_iclkip(struct drm_i915_private *dev_priv);
 int lpt_get_iclkip(struct drm_i915_private *dev_priv);
@@ -18,5 +19,12 @@ int lpt_iclkip(const struct intel_crtc_state *crtc_state);
 
 void intel_init_pch_refclk(struct drm_i915_private *dev_priv);
 void lpt_disable_clkout_dp(struct drm_i915_private *dev_priv);
+#else
+#define lpt_program_iclkip(cstate) do { } while (0)
+#define lpt_disable_iclkip(xe) do { } while (0)
+#define lpt_get_iclkip(xe) (WARN_ON(-ENODEV))
+#define intel_init_pch_refclk(xe) do { } while (0)
+#define lpt_disable_clkout_dp(xe) do { } while (0)
+#endif
 
 #endif
diff --git a/drivers/gpu/drm/i915/display/intel_pipe_crc.c b/drivers/gpu/drm/i915/display/intel_pipe_crc.c
index e9774670e3f6..a4b7a8ec3720 100644
--- a/drivers/gpu/drm/i915/display/intel_pipe_crc.c
+++ b/drivers/gpu/drm/i915/display/intel_pipe_crc.c
@@ -34,6 +34,7 @@
 #include "intel_de.h"
 #include "intel_display_types.h"
 #include "intel_pipe_crc.h"
+#include "i915_irq.h"
 
 static const char * const pipe_crc_sources[] = {
 	[INTEL_PIPE_CRC_SOURCE_NONE] = "none",
diff --git a/drivers/gpu/drm/i915/display/intel_sprite.c b/drivers/gpu/drm/i915/display/intel_sprite.c
index e6b4d24b9cd0..000561eadcc1 100644
--- a/drivers/gpu/drm/i915/display/intel_sprite.c
+++ b/drivers/gpu/drm/i915/display/intel_sprite.c
@@ -43,8 +43,10 @@
 
 #include "i915_drv.h"
 #include "i915_reg.h"
+#ifdef I915
 #include "i915_vgpu.h"
 #include "i9xx_plane.h"
+#endif
 #include "intel_atomic_plane.h"
 #include "intel_crtc.h"
 #include "intel_de.h"
@@ -112,6 +114,7 @@ int intel_plane_check_src_coordinates(struct intel_plane_state *plane_state)
 	return 0;
 }
 
+#ifdef I915
 static void i9xx_plane_linear_gamma(u16 gamma[8])
 {
 	/* The points are not evenly spaced. */
@@ -325,6 +328,7 @@ static u32 vlv_sprite_ctl_crtc(const struct intel_crtc_state *crtc_state)
 static u32 vlv_sprite_ctl(const struct intel_crtc_state *crtc_state,
 			  const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	const struct drm_framebuffer *fb = plane_state->hw.fb;
 	unsigned int rotation = plane_state->hw.rotation;
 	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
@@ -396,6 +400,9 @@ static u32 vlv_sprite_ctl(const struct intel_crtc_state *crtc_state,
 		sprctl |= SP_SOURCE_KEY;
 
 	return sprctl;
+#else
+	return 0;
+#endif
 }
 
 static void vlv_sprite_update_gamma(const struct intel_plane_state *plane_state)
@@ -447,6 +454,7 @@ vlv_sprite_update_arm(struct intel_plane *plane,
 		      const struct intel_crtc_state *crtc_state,
 		      const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	struct drm_i915_private *dev_priv = to_i915(plane->base.dev);
 	enum pipe pipe = plane->pipe;
 	enum plane_id plane_id = plane->id;
@@ -486,6 +494,7 @@ vlv_sprite_update_arm(struct intel_plane *plane,
 	intel_de_write_fw(dev_priv, SPCNTR(pipe, plane_id), sprctl);
 	intel_de_write_fw(dev_priv, SPSURF(pipe, plane_id),
 			  intel_plane_ggtt_offset(plane_state) + sprsurf_offset);
+#endif
 
 	vlv_sprite_update_clrc(plane_state);
 	vlv_sprite_update_gamma(plane_state);
@@ -711,6 +720,7 @@ static bool ivb_need_sprite_gamma(const struct intel_plane_state *plane_state)
 static u32 ivb_sprite_ctl(const struct intel_crtc_state *crtc_state,
 			  const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	struct drm_i915_private *dev_priv =
 		to_i915(plane_state->uapi.plane->dev);
 	const struct drm_framebuffer *fb = plane_state->hw.fb;
@@ -780,6 +790,9 @@ static u32 ivb_sprite_ctl(const struct intel_crtc_state *crtc_state,
 		sprctl |= SPRITE_SOURCE_KEY;
 
 	return sprctl;
+#else
+	return 0;
+#endif
 }
 
 static void ivb_sprite_linear_gamma(const struct intel_plane_state *plane_state,
@@ -1723,10 +1736,13 @@ static const struct drm_plane_funcs vlv_sprite_funcs = {
 	.format_mod_supported = vlv_sprite_format_mod_supported,
 };
 
+#endif
+
 struct intel_plane *
 intel_sprite_plane_create(struct drm_i915_private *dev_priv,
 			  enum pipe pipe, int sprite)
 {
+#ifdef I915
 	struct intel_plane *plane;
 	const struct drm_plane_funcs *plane_funcs;
 	unsigned int supported_rotations;
@@ -1846,4 +1862,9 @@ intel_sprite_plane_create(struct drm_i915_private *dev_priv,
 	intel_plane_free(plane);
 
 	return ERR_PTR(ret);
+#else
+	BUG_ON(1);
+	return ERR_PTR(-ENODEV);
+#endif
 }
+
diff --git a/drivers/gpu/drm/i915/display/intel_vbt_defs.h b/drivers/gpu/drm/i915/display/intel_vbt_defs.h
index a9f44abfc9fc..001de8fe4e64 100644
--- a/drivers/gpu/drm/i915/display/intel_vbt_defs.h
+++ b/drivers/gpu/drm/i915/display/intel_vbt_defs.h
@@ -30,7 +30,7 @@
  *
  * Please do NOT include anywhere else.
  */
-#ifndef _INTEL_BIOS_PRIVATE
+#if !defined(_INTEL_BIOS_PRIVATE) && !defined(HDRTEST)
 #error "intel_vbt_defs.h is private to intel_bios.c"
 #endif
 
diff --git a/drivers/gpu/drm/i915/display/intel_vga.c b/drivers/gpu/drm/i915/display/intel_vga.c
index a69bfcac9a94..b15dcc84ae8c 100644
--- a/drivers/gpu/drm/i915/display/intel_vga.c
+++ b/drivers/gpu/drm/i915/display/intel_vga.c
@@ -101,6 +101,7 @@ void intel_vga_reset_io_mem(struct drm_i915_private *i915)
 static int
 intel_vga_set_state(struct drm_i915_private *i915, bool enable_decode)
 {
+#ifdef I915
 	unsigned int reg = DISPLAY_VER(i915) >= 6 ? SNB_GMCH_CTRL : INTEL_GMCH_CTRL;
 	u16 gmch_ctrl;
 
@@ -123,6 +124,10 @@ intel_vga_set_state(struct drm_i915_private *i915, bool enable_decode)
 	}
 
 	return 0;
+#else
+	/* Only works on some machines because bios forgets to lock the reg. */
+	return -EIO;
+#endif
 }
 
 static unsigned int
diff --git a/drivers/gpu/drm/i915/display/skl_scaler.c b/drivers/gpu/drm/i915/display/skl_scaler.c
index d7390067b7d4..8c918734416d 100644
--- a/drivers/gpu/drm/i915/display/skl_scaler.c
+++ b/drivers/gpu/drm/i915/display/skl_scaler.c
@@ -244,6 +244,7 @@ int skl_update_scaler_plane(struct intel_crtc_state *crtc_state,
 	if (ret || plane_state->scaler_id < 0)
 		return ret;
 
+#ifdef I915
 	/* check colorkey */
 	if (plane_state->ckey.flags) {
 		drm_dbg_kms(&dev_priv->drm,
@@ -252,6 +253,7 @@ int skl_update_scaler_plane(struct intel_crtc_state *crtc_state,
 			    intel_plane->base.name);
 		return -EINVAL;
 	}
+#endif
 
 	/* Check src format */
 	switch (fb->format->format) {
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index 2f5524f380b0..b08aa1a06784 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -22,7 +22,11 @@
 #include "skl_scaler.h"
 #include "skl_universal_plane.h"
 #include "skl_watermark.h"
+#ifdef I915
 #include "pxp/intel_pxp.h"
+#else
+// TODO: pxp?
+#endif
 
 static const u32 skl_plane_formats[] = {
 	DRM_FORMAT_C8,
@@ -895,7 +899,9 @@ static u32 skl_plane_ctl(const struct intel_crtc_state *crtc_state,
 		to_i915(plane_state->uapi.plane->dev);
 	const struct drm_framebuffer *fb = plane_state->hw.fb;
 	unsigned int rotation = plane_state->hw.rotation;
+#ifdef I915
 	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
+#endif
 	u32 plane_ctl;
 
 	plane_ctl = PLANE_CTL_ENABLE;
@@ -919,10 +925,12 @@ static u32 skl_plane_ctl(const struct intel_crtc_state *crtc_state,
 		plane_ctl |= icl_plane_ctl_flip(rotation &
 						DRM_MODE_REFLECT_MASK);
 
+#ifdef I915
 	if (key->flags & I915_SET_COLORKEY_DESTINATION)
 		plane_ctl |= PLANE_CTL_KEY_ENABLE_DESTINATION;
 	else if (key->flags & I915_SET_COLORKEY_SOURCE)
 		plane_ctl |= PLANE_CTL_KEY_ENABLE_SOURCE;
+#endif
 
 	/* Wa_22012358565:adl-p */
 	if (DISPLAY_VER(dev_priv) == 13)
@@ -999,9 +1007,13 @@ static u32 skl_surf_address(const struct intel_plane_state *plane_state,
 		 * The DPT object contains only one vma, so the VMA's offset
 		 * within the DPT is always 0.
 		 */
-		drm_WARN_ON(&i915->drm, plane_state->dpt_vma->node.start);
 		drm_WARN_ON(&i915->drm, offset & 0x1fffff);
+#ifdef I915
+		drm_WARN_ON(&i915->drm, plane_state->dpt_vma->node.start);
 		return offset >> 9;
+#else
+		return 0;
+#endif
 	} else {
 		drm_WARN_ON(&i915->drm, offset & 0xfff);
 		return offset;
@@ -1044,26 +1056,35 @@ static u32 skl_plane_aux_dist(const struct intel_plane_state *plane_state,
 
 static u32 skl_plane_keyval(const struct intel_plane_state *plane_state)
 {
+#ifdef I915
 	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 
 	return key->min_value;
+#else
+	return 0;
+#endif
 }
 
 static u32 skl_plane_keymax(const struct intel_plane_state *plane_state)
 {
-	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 	u8 alpha = plane_state->hw.alpha >> 8;
+#ifdef I915
+	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 
 	return (key->max_value & 0xffffff) | PLANE_KEYMAX_ALPHA(alpha);
+#else
+	return PLANE_KEYMAX_ALPHA(alpha);
+#endif
 }
 
 static u32 skl_plane_keymsk(const struct intel_plane_state *plane_state)
 {
-	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 	u8 alpha = plane_state->hw.alpha >> 8;
-	u32 keymsk;
-
+	u32 keymsk = 0;
+#ifdef I915
+	const struct drm_intel_sprite_colorkey *key = &plane_state->ckey;
 	keymsk = key->channel_mask & 0x7ffffff;
+#endif
 	if (alpha < 0xff)
 		keymsk |= PLANE_KEYMSK_ALPHA_ENABLE;
 
@@ -1319,7 +1340,7 @@ skl_plane_async_flip(struct intel_plane *plane,
 			  skl_plane_surf(plane_state, 0));
 }
 
-static bool intel_format_is_p01x(u32 format)
+static inline bool intel_format_is_p01x(u32 format)
 {
 	switch (format) {
 	case DRM_FORMAT_P010:
@@ -1402,6 +1423,7 @@ static int skl_plane_check_fb(const struct intel_crtc_state *crtc_state,
 		return -EINVAL;
 	}
 
+#ifdef I915
 	/* Wa_1606054188:tgl,adl-s */
 	if ((IS_ALDERLAKE_S(dev_priv) || IS_TIGERLAKE(dev_priv)) &&
 	    plane_state->ckey.flags & I915_SET_COLORKEY_SOURCE &&
@@ -1410,6 +1432,7 @@ static int skl_plane_check_fb(const struct intel_crtc_state *crtc_state,
 			    "Source color keying not supported with P01x formats\n");
 		return -EINVAL;
 	}
+#endif
 
 	return 0;
 }
@@ -1847,9 +1870,14 @@ static bool skl_fb_scalable(const struct drm_framebuffer *fb)
 
 static bool bo_has_valid_encryption(struct drm_i915_gem_object *obj)
 {
+#ifdef I915
 	struct drm_i915_private *i915 = to_i915(obj->base.dev);
 
 	return intel_pxp_key_check(i915->pxp, obj, false) == 0;
+#else
+#define i915_gem_object_is_protected(x) ((x) && 0)
+	return false;
+#endif
 }
 
 static bool pxp_is_borked(struct drm_i915_gem_object *obj)
@@ -1872,7 +1900,12 @@ static int skl_plane_check(struct intel_crtc_state *crtc_state,
 		return ret;
 
 	/* use scaler when colorkey is not required */
-	if (!plane_state->ckey.flags && skl_fb_scalable(fb)) {
+#ifdef I915
+	if (!plane_state->ckey.flags && skl_fb_scalable(fb))
+#else
+	if (skl_fb_scalable(fb))
+#endif
+	{
 		min_scale = 1;
 		max_scale = skl_plane_max_scale(dev_priv, fb);
 	}
@@ -2435,11 +2468,15 @@ skl_get_initial_plane_config(struct intel_crtc *crtc,
 		fb->modifier = DRM_FORMAT_MOD_LINEAR;
 		break;
 	case PLANE_CTL_TILED_X:
+#ifdef I915
 		plane_config->tiling = I915_TILING_X;
+#endif
 		fb->modifier = I915_FORMAT_MOD_X_TILED;
 		break;
 	case PLANE_CTL_TILED_Y:
+#ifdef I915
 		plane_config->tiling = I915_TILING_Y;
+#endif
 		if (val & PLANE_CTL_RENDER_DECOMPRESSION_ENABLE)
 			if (DISPLAY_VER(dev_priv) >= 12)
 				fb->modifier = I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS;
diff --git a/drivers/gpu/drm/i915/display/skl_watermark.c b/drivers/gpu/drm/i915/display/skl_watermark.c
index e254fb21b47f..381d4f75e7c8 100644
--- a/drivers/gpu/drm/i915/display/skl_watermark.c
+++ b/drivers/gpu/drm/i915/display/skl_watermark.c
@@ -16,7 +16,7 @@
 #include "skl_watermark.h"
 
 #include "i915_drv.h"
-#include "i915_fixed.h"
+#include "../i915/i915_fixed.h"
 #include "i915_reg.h"
 #include "intel_pm.h"
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index f8eb807b56f9..3b9e20dd6039 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -7,8 +7,9 @@
 #define __INTEL_GT_REGS__
 
 #include "i915_reg_defs.h"
+#ifdef I915
 #include "display/intel_display_reg_defs.h"	/* VLV_DISPLAY_BASE */
-
+#endif
 #define MCR_REG(offset)	((const i915_mcr_reg_t){ .reg = (offset) })
 
 /*
diff --git a/drivers/gpu/drm/i915/i915_reg_defs.h b/drivers/gpu/drm/i915/i915_reg_defs.h
index be43580a6979..1e3966609844 100644
--- a/drivers/gpu/drm/i915/i915_reg_defs.h
+++ b/drivers/gpu/drm/i915/i915_reg_defs.h
@@ -132,9 +132,13 @@ typedef struct {
 
 #define _MMIO(r) ((const i915_reg_t){ .reg = (r) })
 
+#ifdef I915
 typedef struct {
 	u32 reg;
 } i915_mcr_reg_t;
+#else
+#define i915_mcr_reg_t i915_reg_t
+#endif
 
 #define INVALID_MMIO_REG _MMIO(0)
 
@@ -143,8 +147,12 @@ typedef struct {
  * simply operations on the register's offset and don't care about the MCR vs
  * non-MCR nature of the register.
  */
+#ifdef I915
 #define i915_mmio_reg_offset(r) \
 	_Generic((r), i915_reg_t: (r).reg, i915_mcr_reg_t: (r).reg)
+#else
+#define i915_mmio_reg_offset(r) ((r).reg)
+#endif
 #define i915_mmio_reg_equal(a, b) (i915_mmio_reg_offset(a) == i915_mmio_reg_offset(b))
 #define i915_mmio_reg_valid(r) (!i915_mmio_reg_equal(r, INVALID_MMIO_REG))
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 19/20] sound/hda: Allow XE as i915 replacement for sound
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Xe, is a new driver for Intel GPUs that supports both integrated
and discrete platforms starting with Tiger Lake. Let's ensure
sound can accept xe instead of i915 whenever that is in use.

Cc: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 sound/hda/hdac_i915.c      | 17 +++---------
 sound/pci/hda/hda_intel.c  | 56 ++++++++++++++++++++------------------
 sound/soc/intel/avs/core.c | 13 ++++++---
 sound/soc/sof/intel/hda.c  |  7 +++--
 4 files changed, 48 insertions(+), 45 deletions(-)

diff --git a/sound/hda/hdac_i915.c b/sound/hda/hdac_i915.c
index 161a9711cd63..39c548b34fcd 100644
--- a/sound/hda/hdac_i915.c
+++ b/sound/hda/hdac_i915.c
@@ -108,7 +108,8 @@ static int i915_component_master_match(struct device *dev, int subcomponent,
 	hdac_pci = to_pci_dev(bus->dev);
 	i915_pci = to_pci_dev(dev);
 
-	if (!strcmp(dev->driver->name, "i915") &&
+	if ((!strcmp(dev->driver->name, "i915") ||
+	     !strcmp(dev->driver->name, "xe")) &&
 	    subcomponent == I915_COMPONENT_AUDIO &&
 	    connectivity_check(i915_pci, hdac_pci))
 		return 1;
@@ -159,20 +160,10 @@ int snd_hdac_i915_init(struct hdac_bus *bus)
 	if (err < 0)
 		return err;
 	acomp = bus->audio_component;
-	if (!acomp)
-		return -ENODEV;
-	if (!acomp->ops) {
-		if (!IS_ENABLED(CONFIG_MODULES) ||
-		    !request_module("i915")) {
-			/* 60s timeout */
-			wait_for_completion_killable_timeout(&acomp->master_bind_complete,
-							     msecs_to_jiffies(60 * 1000));
-		}
-	}
-	if (!acomp->ops) {
+	if (!acomp || !acomp->ops) {
 		dev_info(bus->dev, "couldn't bind with audio component\n");
 		snd_hdac_acomp_exit(bus);
-		return -ENODEV;
+		return -EPROBE_DEFER;
 	}
 	return 0;
 }
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 87002670c0c9..481887903f75 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -209,6 +209,7 @@ MODULE_DESCRIPTION("Intel HDA driver");
 #endif
 #endif
 
+static DECLARE_BITMAP(probed_devs, SNDRV_CARDS);
 
 /*
  */
@@ -1829,6 +1830,35 @@ static int azx_create(struct snd_card *card, struct pci_dev *pci,
 	/* continue probing in work context as may trigger request module */
 	INIT_DELAYED_WORK(&hda->probe_work, azx_probe_work);
 
+	/* bind with i915 if needed */
+	if (chip->driver_caps & AZX_DCAPS_I915_COMPONENT) {
+		err = snd_hdac_i915_init(azx_bus(chip));
+		if (err < 0) {
+			/* if the controller is bound only with HDMI/DP
+			 * (for HSW and BDW), we need to abort the probe;
+			 * for other chips, still continue probing as other
+			 * codecs can be on the same link.
+			 */
+			if (CONTROLLER_IN_GPU(pci)) {
+				if (err != -EPROBE_DEFER)
+					dev_err(card->dev,
+						"HSW/BDW HD-audio HDMI/DP requires binding with gfx driver\n");
+
+				clear_bit(chip->dev_index, probed_devs);
+				pci_set_drvdata(pci, NULL);
+				snd_device_free(card, chip);
+				return err;
+			} else {
+				/* don't bother any longer */
+				chip->driver_caps &= ~AZX_DCAPS_I915_COMPONENT;
+			}
+		}
+
+		/* HSW/BDW controllers need this power */
+		if (CONTROLLER_IN_GPU(pci))
+			hda->need_i915_power = true;
+	}
+
 	*rchip = chip;
 
 	return 0;
@@ -2059,8 +2089,6 @@ static const struct hda_controller_ops pci_hda_ops = {
 	.position_check = azx_position_check,
 };
 
-static DECLARE_BITMAP(probed_devs, SNDRV_CARDS);
-
 static int azx_probe(struct pci_dev *pci,
 		     const struct pci_device_id *pci_id)
 {
@@ -2239,30 +2267,6 @@ static int azx_probe_continue(struct azx *chip)
 	to_hda_bus(bus)->bus_probing = 1;
 	hda->probe_continued = 1;
 
-	/* bind with i915 if needed */
-	if (chip->driver_caps & AZX_DCAPS_I915_COMPONENT) {
-		err = snd_hdac_i915_init(bus);
-		if (err < 0) {
-			/* if the controller is bound only with HDMI/DP
-			 * (for HSW and BDW), we need to abort the probe;
-			 * for other chips, still continue probing as other
-			 * codecs can be on the same link.
-			 */
-			if (CONTROLLER_IN_GPU(pci)) {
-				dev_err(chip->card->dev,
-					"HSW/BDW HD-audio HDMI/DP requires binding with gfx driver\n");
-				goto out_free;
-			} else {
-				/* don't bother any longer */
-				chip->driver_caps &= ~AZX_DCAPS_I915_COMPONENT;
-			}
-		}
-
-		/* HSW/BDW controllers need this power */
-		if (CONTROLLER_IN_GPU(pci))
-			hda->need_i915_power = true;
-	}
-
 	/* Request display power well for the HDA controller or codec. For
 	 * Haswell/Broadwell, both the display HDA controller and codec need
 	 * this power. For other platforms, like Baytrail/Braswell, only the
diff --git a/sound/soc/intel/avs/core.c b/sound/soc/intel/avs/core.c
index bb0719c58ca4..353aa40eb6dc 100644
--- a/sound/soc/intel/avs/core.c
+++ b/sound/soc/intel/avs/core.c
@@ -187,10 +187,6 @@ static void avs_hda_probe_work(struct work_struct *work)
 
 	pm_runtime_set_active(bus->dev); /* clear runtime_error flag */
 
-	ret = snd_hdac_i915_init(bus);
-	if (ret < 0)
-		dev_info(bus->dev, "i915 init unsuccessful: %d\n", ret);
-
 	snd_hdac_display_power(bus, HDA_CODEC_IDX_CONTROLLER, true);
 	avs_hdac_bus_init_chip(bus, true);
 	avs_hdac_bus_probe_codecs(bus);
@@ -460,10 +456,19 @@ static int avs_pci_probe(struct pci_dev *pci, const struct pci_device_id *id)
 	pci_set_drvdata(pci, bus);
 	device_disable_async_suspend(dev);
 
+	ret = snd_hdac_i915_init(bus);
+	if (ret == -EPROBE_DEFER)
+		goto err_unmaster;
+	else if (ret < 0)
+		dev_info(bus->dev, "i915 init unsuccessful: %d\n", ret);
+
 	schedule_work(&adev->probe_work);
 
 	return 0;
 
+err_unmaster:
+	pci_clear_master(pci);
+	pci_set_drvdata(pci, NULL);
 err_acquire_irq:
 	snd_hdac_bus_free_stream_pages(bus);
 	snd_hdac_ext_stream_free_all(bus);
diff --git a/sound/soc/sof/intel/hda.c b/sound/soc/sof/intel/hda.c
index 1188ec51816b..671291838b8e 100644
--- a/sound/soc/sof/intel/hda.c
+++ b/sound/soc/sof/intel/hda.c
@@ -719,8 +719,11 @@ static int hda_init(struct snd_sof_dev *sdev)
 
 	/* init i915 and HDMI codecs */
 	ret = hda_codec_i915_init(sdev);
-	if (ret < 0)
-		dev_warn(sdev->dev, "init of i915 and HDMI codec failed\n");
+	if (ret < 0) {
+		if (ret != -EPROBE_DEFER)
+			dev_warn(sdev->dev, "init of i915 and HDMI codec failed: %i\n", ret);
+		return ret;
+	}
 
 	/* get controller capabilities */
 	ret = hda_dsp_ctrl_get_caps(sdev);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 19/20] sound/hda: Allow XE as i915 replacement for sound
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Xe, is a new driver for Intel GPUs that supports both integrated
and discrete platforms starting with Tiger Lake. Let's ensure
sound can accept xe instead of i915 whenever that is in use.

Cc: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 sound/hda/hdac_i915.c      | 17 +++---------
 sound/pci/hda/hda_intel.c  | 56 ++++++++++++++++++++------------------
 sound/soc/intel/avs/core.c | 13 ++++++---
 sound/soc/sof/intel/hda.c  |  7 +++--
 4 files changed, 48 insertions(+), 45 deletions(-)

diff --git a/sound/hda/hdac_i915.c b/sound/hda/hdac_i915.c
index 161a9711cd63..39c548b34fcd 100644
--- a/sound/hda/hdac_i915.c
+++ b/sound/hda/hdac_i915.c
@@ -108,7 +108,8 @@ static int i915_component_master_match(struct device *dev, int subcomponent,
 	hdac_pci = to_pci_dev(bus->dev);
 	i915_pci = to_pci_dev(dev);
 
-	if (!strcmp(dev->driver->name, "i915") &&
+	if ((!strcmp(dev->driver->name, "i915") ||
+	     !strcmp(dev->driver->name, "xe")) &&
 	    subcomponent == I915_COMPONENT_AUDIO &&
 	    connectivity_check(i915_pci, hdac_pci))
 		return 1;
@@ -159,20 +160,10 @@ int snd_hdac_i915_init(struct hdac_bus *bus)
 	if (err < 0)
 		return err;
 	acomp = bus->audio_component;
-	if (!acomp)
-		return -ENODEV;
-	if (!acomp->ops) {
-		if (!IS_ENABLED(CONFIG_MODULES) ||
-		    !request_module("i915")) {
-			/* 60s timeout */
-			wait_for_completion_killable_timeout(&acomp->master_bind_complete,
-							     msecs_to_jiffies(60 * 1000));
-		}
-	}
-	if (!acomp->ops) {
+	if (!acomp || !acomp->ops) {
 		dev_info(bus->dev, "couldn't bind with audio component\n");
 		snd_hdac_acomp_exit(bus);
-		return -ENODEV;
+		return -EPROBE_DEFER;
 	}
 	return 0;
 }
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 87002670c0c9..481887903f75 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -209,6 +209,7 @@ MODULE_DESCRIPTION("Intel HDA driver");
 #endif
 #endif
 
+static DECLARE_BITMAP(probed_devs, SNDRV_CARDS);
 
 /*
  */
@@ -1829,6 +1830,35 @@ static int azx_create(struct snd_card *card, struct pci_dev *pci,
 	/* continue probing in work context as may trigger request module */
 	INIT_DELAYED_WORK(&hda->probe_work, azx_probe_work);
 
+	/* bind with i915 if needed */
+	if (chip->driver_caps & AZX_DCAPS_I915_COMPONENT) {
+		err = snd_hdac_i915_init(azx_bus(chip));
+		if (err < 0) {
+			/* if the controller is bound only with HDMI/DP
+			 * (for HSW and BDW), we need to abort the probe;
+			 * for other chips, still continue probing as other
+			 * codecs can be on the same link.
+			 */
+			if (CONTROLLER_IN_GPU(pci)) {
+				if (err != -EPROBE_DEFER)
+					dev_err(card->dev,
+						"HSW/BDW HD-audio HDMI/DP requires binding with gfx driver\n");
+
+				clear_bit(chip->dev_index, probed_devs);
+				pci_set_drvdata(pci, NULL);
+				snd_device_free(card, chip);
+				return err;
+			} else {
+				/* don't bother any longer */
+				chip->driver_caps &= ~AZX_DCAPS_I915_COMPONENT;
+			}
+		}
+
+		/* HSW/BDW controllers need this power */
+		if (CONTROLLER_IN_GPU(pci))
+			hda->need_i915_power = true;
+	}
+
 	*rchip = chip;
 
 	return 0;
@@ -2059,8 +2089,6 @@ static const struct hda_controller_ops pci_hda_ops = {
 	.position_check = azx_position_check,
 };
 
-static DECLARE_BITMAP(probed_devs, SNDRV_CARDS);
-
 static int azx_probe(struct pci_dev *pci,
 		     const struct pci_device_id *pci_id)
 {
@@ -2239,30 +2267,6 @@ static int azx_probe_continue(struct azx *chip)
 	to_hda_bus(bus)->bus_probing = 1;
 	hda->probe_continued = 1;
 
-	/* bind with i915 if needed */
-	if (chip->driver_caps & AZX_DCAPS_I915_COMPONENT) {
-		err = snd_hdac_i915_init(bus);
-		if (err < 0) {
-			/* if the controller is bound only with HDMI/DP
-			 * (for HSW and BDW), we need to abort the probe;
-			 * for other chips, still continue probing as other
-			 * codecs can be on the same link.
-			 */
-			if (CONTROLLER_IN_GPU(pci)) {
-				dev_err(chip->card->dev,
-					"HSW/BDW HD-audio HDMI/DP requires binding with gfx driver\n");
-				goto out_free;
-			} else {
-				/* don't bother any longer */
-				chip->driver_caps &= ~AZX_DCAPS_I915_COMPONENT;
-			}
-		}
-
-		/* HSW/BDW controllers need this power */
-		if (CONTROLLER_IN_GPU(pci))
-			hda->need_i915_power = true;
-	}
-
 	/* Request display power well for the HDA controller or codec. For
 	 * Haswell/Broadwell, both the display HDA controller and codec need
 	 * this power. For other platforms, like Baytrail/Braswell, only the
diff --git a/sound/soc/intel/avs/core.c b/sound/soc/intel/avs/core.c
index bb0719c58ca4..353aa40eb6dc 100644
--- a/sound/soc/intel/avs/core.c
+++ b/sound/soc/intel/avs/core.c
@@ -187,10 +187,6 @@ static void avs_hda_probe_work(struct work_struct *work)
 
 	pm_runtime_set_active(bus->dev); /* clear runtime_error flag */
 
-	ret = snd_hdac_i915_init(bus);
-	if (ret < 0)
-		dev_info(bus->dev, "i915 init unsuccessful: %d\n", ret);
-
 	snd_hdac_display_power(bus, HDA_CODEC_IDX_CONTROLLER, true);
 	avs_hdac_bus_init_chip(bus, true);
 	avs_hdac_bus_probe_codecs(bus);
@@ -460,10 +456,19 @@ static int avs_pci_probe(struct pci_dev *pci, const struct pci_device_id *id)
 	pci_set_drvdata(pci, bus);
 	device_disable_async_suspend(dev);
 
+	ret = snd_hdac_i915_init(bus);
+	if (ret == -EPROBE_DEFER)
+		goto err_unmaster;
+	else if (ret < 0)
+		dev_info(bus->dev, "i915 init unsuccessful: %d\n", ret);
+
 	schedule_work(&adev->probe_work);
 
 	return 0;
 
+err_unmaster:
+	pci_clear_master(pci);
+	pci_set_drvdata(pci, NULL);
 err_acquire_irq:
 	snd_hdac_bus_free_stream_pages(bus);
 	snd_hdac_ext_stream_free_all(bus);
diff --git a/sound/soc/sof/intel/hda.c b/sound/soc/sof/intel/hda.c
index 1188ec51816b..671291838b8e 100644
--- a/sound/soc/sof/intel/hda.c
+++ b/sound/soc/sof/intel/hda.c
@@ -719,8 +719,11 @@ static int hda_init(struct snd_sof_dev *sdev)
 
 	/* init i915 and HDMI codecs */
 	ret = hda_codec_i915_init(sdev);
-	if (ret < 0)
-		dev_warn(sdev->dev, "init of i915 and HDMI codec failed\n");
+	if (ret < 0) {
+		if (ret != -EPROBE_DEFER)
+			dev_warn(sdev->dev, "init of i915 and HDMI codec failed: %i\n", ret);
+		return ret;
+	}
 
 	/* get controller capabilities */
 	ret = hda_dsp_ctrl_get_caps(sdev);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC PATCH 20/20] mei/hdcp: Also enable for XE
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2022-12-22 22:21   ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Xe, is a new driver for Intel GPUs that supports both integrated
and discrete platforms starting with Tiger Lake. Let's ensure
mei/hdcp can accept xe instead of i915 whenever that is in use.

Cc: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/misc/mei/hdcp/Kconfig    | 2 +-
 drivers/misc/mei/hdcp/mei_hdcp.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/mei/hdcp/Kconfig b/drivers/misc/mei/hdcp/Kconfig
index 54e1c9526909..2ac9148988d4 100644
--- a/drivers/misc/mei/hdcp/Kconfig
+++ b/drivers/misc/mei/hdcp/Kconfig
@@ -4,7 +4,7 @@
 config INTEL_MEI_HDCP
 	tristate "Intel HDCP2.2 services of ME Interface"
 	select INTEL_MEI_ME
-	depends on DRM_I915
+	depends on DRM_I915 || DRM_XE
 	help
 	  MEI Support for HDCP2.2 Services on Intel platforms.
 
diff --git a/drivers/misc/mei/hdcp/mei_hdcp.c b/drivers/misc/mei/hdcp/mei_hdcp.c
index e889a8bd7ac8..699dfc4a0a57 100644
--- a/drivers/misc/mei/hdcp/mei_hdcp.c
+++ b/drivers/misc/mei/hdcp/mei_hdcp.c
@@ -784,7 +784,8 @@ static int mei_hdcp_component_match(struct device *dev, int subcomponent,
 {
 	struct device *base = data;
 
-	if (!dev->driver || strcmp(dev->driver->name, "i915") ||
+	if (!dev->driver ||
+	    (strcmp(dev->driver->name, "i915") && strcmp(dev->driver->name, "xe")) ||
 	    subcomponent != I915_COMPONENT_HDCP)
 		return 0;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] [RFC PATCH 20/20] mei/hdcp: Also enable for XE
@ 2022-12-22 22:21   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-22 22:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Xe, is a new driver for Intel GPUs that supports both integrated
and discrete platforms starting with Tiger Lake. Let's ensure
mei/hdcp can accept xe instead of i915 whenever that is in use.

Cc: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/misc/mei/hdcp/Kconfig    | 2 +-
 drivers/misc/mei/hdcp/mei_hdcp.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/mei/hdcp/Kconfig b/drivers/misc/mei/hdcp/Kconfig
index 54e1c9526909..2ac9148988d4 100644
--- a/drivers/misc/mei/hdcp/Kconfig
+++ b/drivers/misc/mei/hdcp/Kconfig
@@ -4,7 +4,7 @@
 config INTEL_MEI_HDCP
 	tristate "Intel HDCP2.2 services of ME Interface"
 	select INTEL_MEI_ME
-	depends on DRM_I915
+	depends on DRM_I915 || DRM_XE
 	help
 	  MEI Support for HDCP2.2 Services on Intel platforms.
 
diff --git a/drivers/misc/mei/hdcp/mei_hdcp.c b/drivers/misc/mei/hdcp/mei_hdcp.c
index e889a8bd7ac8..699dfc4a0a57 100644
--- a/drivers/misc/mei/hdcp/mei_hdcp.c
+++ b/drivers/misc/mei/hdcp/mei_hdcp.c
@@ -784,7 +784,8 @@ static int mei_hdcp_component_match(struct device *dev, int subcomponent,
 {
 	struct device *base = data;
 
-	if (!dev->driver || strcmp(dev->driver->name, "i915") ||
+	if (!dev->driver ||
+	    (strcmp(dev->driver->name, "i915") && strcmp(dev->driver->name, "xe")) ||
 	    subcomponent != I915_COMPONENT_HDCP)
 		return 0;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for Initial Xe driver submission
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
                   ` (20 preceding siblings ...)
  (?)
@ 2022-12-22 22:41 ` Patchwork
  -1 siblings, 0 replies; 161+ messages in thread
From: Patchwork @ 2022-12-22 22:41 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Initial Xe driver submission
URL   : https://patchwork.freedesktop.org/series/112189/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/112189/revisions/1/mbox/ not applied
Applying: drm/suballoc: Introduce a generic suballocation manager
Applying: drm/amd: Convert amdgpu to use suballocation helper.
Applying: drm/radeon: Use the drm suballocation manager implementation.
Applying: drm/sched: Convert drm scheduler to use a work queue rather than kthread
Applying: drm/sched: Add generic scheduler message interface
Applying: drm/sched: Start run wq before TDR in drm_sched_start
Applying: drm/sched: Submit job before starting TDR
Applying: drm/sched: Add helper to set TDR timeout
Applying: drm: Add a gpu page-table walker helper
Applying: drm/ttm: Don't print error message if eviction was interrupted
Applying: drm/i915: Remove gem and overlay frontbuffer tracking
Applying: drm/i915/display: Neuter frontbuffer tracking harder
Applying: drm/i915/display: Add more macros to remove all direct calls to uncore
Applying: drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
Applying: drm/i915: Rename find_section to find_bdb_section
Applying: drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
Applying: drm/i915/display: Fix a use-after-free when intel_edp_init_connector fails
Applying: drm/i915/display: Remaining changes to make xe compile
Applying: sound/hda: Allow XE as i915 replacement for sound
Using index info to reconstruct a base tree...
M	sound/hda/hdac_i915.c
M	sound/pci/hda/hda_intel.c
Falling back to patching base and 3-way merge...
Auto-merging sound/pci/hda/hda_intel.c
CONFLICT (content): Merge conflict in sound/pci/hda/hda_intel.c
Auto-merging sound/hda/hdac_i915.c
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0019 sound/hda: Allow XE as i915 replacement for sound
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 11/20] drm/i915: Remove gem and overlay frontbuffer tracking
  2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
  (?)
@ 2022-12-23 11:13   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2022-12-23 11:13 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel


On 22/12/2022 22:21, Matthew Brost wrote:
> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> 
> Frontbuffer update handling should be done explicitly by using dirtyfb
> calls only.

A bit terse - questions around if this breaks something, and if it does 
what and why it is okay, were left hanging in the air in the previous 
thread (a6cdde0b-47a1-967d-f2c4-9299618cb1fb@linux.intel.com). Can that 
be discussed please?

Plus, this does not appear to be either DRM core or Xe work so maybe 
doesn't need to be in this series to start with?!

Regards,

Tvrtko

> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/display/i9xx_plane.c     |  1 +
>   drivers/gpu/drm/i915/display/intel_drrs.c     |  1 +
>   drivers/gpu/drm/i915/display/intel_fb.c       |  1 +
>   drivers/gpu/drm/i915/display/intel_overlay.c  | 14 -----------
>   .../drm/i915/display/intel_plane_initial.c    |  1 +
>   drivers/gpu/drm/i915/display/intel_psr.c      |  1 +
>   .../drm/i915/display/skl_universal_plane.c    |  1 +
>   drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |  4 ---
>   drivers/gpu/drm/i915/gem/i915_gem_domain.c    |  7 ------
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  2 --
>   drivers/gpu/drm/i915/gem/i915_gem_object.c    | 25 -------------------
>   drivers/gpu/drm/i915/gem/i915_gem_object.h    | 22 ----------------
>   drivers/gpu/drm/i915/gem/i915_gem_phys.c      |  4 ---
>   drivers/gpu/drm/i915/i915_driver.c            |  1 +
>   drivers/gpu/drm/i915/i915_gem.c               |  8 ------
>   drivers/gpu/drm/i915/i915_gem_gtt.c           |  1 -
>   drivers/gpu/drm/i915/i915_vma.c               | 12 ---------
>   17 files changed, 7 insertions(+), 99 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/display/i9xx_plane.c b/drivers/gpu/drm/i915/display/i9xx_plane.c
> index ecaeb7dc196b..633e462d96a0 100644
> --- a/drivers/gpu/drm/i915/display/i9xx_plane.c
> +++ b/drivers/gpu/drm/i915/display/i9xx_plane.c
> @@ -17,6 +17,7 @@
>   #include "intel_display_types.h"
>   #include "intel_fb.h"
>   #include "intel_fbc.h"
> +#include "intel_frontbuffer.h"
>   #include "intel_sprite.h"
>   
>   /* Primary plane formats for gen <= 3 */
> diff --git a/drivers/gpu/drm/i915/display/intel_drrs.c b/drivers/gpu/drm/i915/display/intel_drrs.c
> index 5b9e44443814..3503d112387d 100644
> --- a/drivers/gpu/drm/i915/display/intel_drrs.c
> +++ b/drivers/gpu/drm/i915/display/intel_drrs.c
> @@ -9,6 +9,7 @@
>   #include "intel_de.h"
>   #include "intel_display_types.h"
>   #include "intel_drrs.h"
> +#include "intel_frontbuffer.h"
>   #include "intel_panel.h"
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/display/intel_fb.c b/drivers/gpu/drm/i915/display/intel_fb.c
> index 63137ae5ab21..7cf31c87884c 100644
> --- a/drivers/gpu/drm/i915/display/intel_fb.c
> +++ b/drivers/gpu/drm/i915/display/intel_fb.c
> @@ -12,6 +12,7 @@
>   #include "intel_display_types.h"
>   #include "intel_dpt.h"
>   #include "intel_fb.h"
> +#include "intel_frontbuffer.h"
>   
>   #define check_array_bounds(i915, a, i) drm_WARN_ON(&(i915)->drm, (i) >= ARRAY_SIZE(a))
>   
> diff --git a/drivers/gpu/drm/i915/display/intel_overlay.c b/drivers/gpu/drm/i915/display/intel_overlay.c
> index c12bdca8da9b..5b86563ce577 100644
> --- a/drivers/gpu/drm/i915/display/intel_overlay.c
> +++ b/drivers/gpu/drm/i915/display/intel_overlay.c
> @@ -186,7 +186,6 @@ struct intel_overlay {
>   	struct intel_crtc *crtc;
>   	struct i915_vma *vma;
>   	struct i915_vma *old_vma;
> -	struct intel_frontbuffer *frontbuffer;
>   	bool active;
>   	bool pfit_active;
>   	u32 pfit_vscale_ratio; /* shifted-point number, (1<<12) == 1.0 */
> @@ -287,20 +286,9 @@ static void intel_overlay_flip_prepare(struct intel_overlay *overlay,
>   				       struct i915_vma *vma)
>   {
>   	enum pipe pipe = overlay->crtc->pipe;
> -	struct intel_frontbuffer *frontbuffer = NULL;
>   
>   	drm_WARN_ON(&overlay->i915->drm, overlay->old_vma);
>   
> -	if (vma)
> -		frontbuffer = intel_frontbuffer_get(vma->obj);
> -
> -	intel_frontbuffer_track(overlay->frontbuffer, frontbuffer,
> -				INTEL_FRONTBUFFER_OVERLAY(pipe));
> -
> -	if (overlay->frontbuffer)
> -		intel_frontbuffer_put(overlay->frontbuffer);
> -	overlay->frontbuffer = frontbuffer;
> -
>   	intel_frontbuffer_flip_prepare(overlay->i915,
>   				       INTEL_FRONTBUFFER_OVERLAY(pipe));
>   
> @@ -810,8 +798,6 @@ static int intel_overlay_do_put_image(struct intel_overlay *overlay,
>   		goto out_pin_section;
>   	}
>   
> -	i915_gem_object_flush_frontbuffer(new_bo, ORIGIN_DIRTYFB);
> -
>   	if (!overlay->active) {
>   		const struct intel_crtc_state *crtc_state =
>   			overlay->crtc->config;
> diff --git a/drivers/gpu/drm/i915/display/intel_plane_initial.c b/drivers/gpu/drm/i915/display/intel_plane_initial.c
> index 76be796df255..cad9c8884af3 100644
> --- a/drivers/gpu/drm/i915/display/intel_plane_initial.c
> +++ b/drivers/gpu/drm/i915/display/intel_plane_initial.c
> @@ -9,6 +9,7 @@
>   #include "intel_display.h"
>   #include "intel_display_types.h"
>   #include "intel_fb.h"
> +#include "intel_frontbuffer.h"
>   #include "intel_plane_initial.h"
>   
>   static bool
> diff --git a/drivers/gpu/drm/i915/display/intel_psr.c b/drivers/gpu/drm/i915/display/intel_psr.c
> index 9820e5fdd087..bc998b526d88 100644
> --- a/drivers/gpu/drm/i915/display/intel_psr.c
> +++ b/drivers/gpu/drm/i915/display/intel_psr.c
> @@ -33,6 +33,7 @@
>   #include "intel_de.h"
>   #include "intel_display_types.h"
>   #include "intel_dp_aux.h"
> +#include "intel_frontbuffer.h"
>   #include "intel_hdmi.h"
>   #include "intel_psr.h"
>   #include "intel_snps_phy.h"
> diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
> index 4b79c2d2d617..2f5524f380b0 100644
> --- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
> +++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
> @@ -16,6 +16,7 @@
>   #include "intel_display_types.h"
>   #include "intel_fb.h"
>   #include "intel_fbc.h"
> +#include "intel_frontbuffer.h"
>   #include "intel_psr.h"
>   #include "intel_sprite.h"
>   #include "skl_scaler.h"
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
> index b3b398fe689c..df2db78b10ca 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
> @@ -6,8 +6,6 @@
>   
>   #include <drm/drm_cache.h>
>   
> -#include "display/intel_frontbuffer.h"
> -
>   #include "i915_drv.h"
>   #include "i915_gem_clflush.h"
>   #include "i915_sw_fence_work.h"
> @@ -22,8 +20,6 @@ static void __do_clflush(struct drm_i915_gem_object *obj)
>   {
>   	GEM_BUG_ON(!i915_gem_object_has_pages(obj));
>   	drm_clflush_sg(obj->mm.pages);
> -
> -	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
>   }
>   
>   static void clflush_work(struct dma_fence_work *base)
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_domain.c b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
> index 9969e687ad85..cd5505da4884 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_domain.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_domain.c
> @@ -4,7 +4,6 @@
>    * Copyright © 2014-2016 Intel Corporation
>    */
>   
> -#include "display/intel_frontbuffer.h"
>   #include "gt/intel_gt.h"
>   
>   #include "i915_drv.h"
> @@ -65,8 +64,6 @@ flush_write_domain(struct drm_i915_gem_object *obj, unsigned int flush_domains)
>   				intel_gt_flush_ggtt_writes(vma->vm->gt);
>   		}
>   		spin_unlock(&obj->vma.lock);
> -
> -		i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
>   		break;
>   
>   	case I915_GEM_DOMAIN_WC:
> @@ -629,9 +626,6 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
>   out_unlock:
>   	i915_gem_object_unlock(obj);
>   
> -	if (!err && write_domain)
> -		i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
> -
>   out:
>   	i915_gem_object_put(obj);
>   	return err;
> @@ -742,7 +736,6 @@ int i915_gem_object_prepare_write(struct drm_i915_gem_object *obj,
>   	}
>   
>   out:
> -	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
>   	obj->mm.dirty = true;
>   	/* return with the pages pinned */
>   	return 0;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> index f98600ca7557..08f84d4f4f92 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> @@ -11,8 +11,6 @@
>   
>   #include <drm/drm_syncobj.h>
>   
> -#include "display/intel_frontbuffer.h"
> -
>   #include "gem/i915_gem_ioctls.h"
>   #include "gt/intel_context.h"
>   #include "gt/intel_gpu_commands.h"
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> index 1a0886b8aaa1..d2fef38cd12e 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
> @@ -27,7 +27,6 @@
>   
>   #include <drm/drm_cache.h>
>   
> -#include "display/intel_frontbuffer.h"
>   #include "pxp/intel_pxp.h"
>   
>   #include "i915_drv.h"
> @@ -400,30 +399,6 @@ static void i915_gem_free_object(struct drm_gem_object *gem_obj)
>   		queue_work(i915->wq, &i915->mm.free_work);
>   }
>   
> -void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
> -					 enum fb_op_origin origin)
> -{
> -	struct intel_frontbuffer *front;
> -
> -	front = __intel_frontbuffer_get(obj);
> -	if (front) {
> -		intel_frontbuffer_flush(front, origin);
> -		intel_frontbuffer_put(front);
> -	}
> -}
> -
> -void __i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
> -					      enum fb_op_origin origin)
> -{
> -	struct intel_frontbuffer *front;
> -
> -	front = __intel_frontbuffer_get(obj);
> -	if (front) {
> -		intel_frontbuffer_invalidate(front, origin);
> -		intel_frontbuffer_put(front);
> -	}
> -}
> -
>   static void
>   i915_gem_object_read_from_page_kmap(struct drm_i915_gem_object *obj, u64 offset, void *dst, int size)
>   {
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> index 3db53769864c..90dba761889c 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
> @@ -11,7 +11,6 @@
>   #include <drm/drm_file.h>
>   #include <drm/drm_device.h>
>   
> -#include "display/intel_frontbuffer.h"
>   #include "intel_memory_region.h"
>   #include "i915_gem_object_types.h"
>   #include "i915_gem_gtt.h"
> @@ -573,27 +572,6 @@ int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
>   				  unsigned int flags,
>   				  const struct i915_sched_attr *attr);
>   
> -void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
> -					 enum fb_op_origin origin);
> -void __i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
> -					      enum fb_op_origin origin);
> -
> -static inline void
> -i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
> -				  enum fb_op_origin origin)
> -{
> -	if (unlikely(rcu_access_pointer(obj->frontbuffer)))
> -		__i915_gem_object_flush_frontbuffer(obj, origin);
> -}
> -
> -static inline void
> -i915_gem_object_invalidate_frontbuffer(struct drm_i915_gem_object *obj,
> -				       enum fb_op_origin origin)
> -{
> -	if (unlikely(rcu_access_pointer(obj->frontbuffer)))
> -		__i915_gem_object_invalidate_frontbuffer(obj, origin);
> -}
> -
>   int i915_gem_object_read_from_page(struct drm_i915_gem_object *obj, u64 offset, void *dst, int size);
>   
>   bool i915_gem_object_is_shmem(const struct drm_i915_gem_object *obj);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_phys.c b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
> index 68453572275b..4cf57676e180 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_phys.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_phys.c
> @@ -156,15 +156,11 @@ int i915_gem_object_pwrite_phys(struct drm_i915_gem_object *obj,
>   	 * We manually control the domain here and pretend that it
>   	 * remains coherent i.e. in the GTT domain, like shmem_pwrite.
>   	 */
> -	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
> -
>   	if (copy_from_user(vaddr, user_data, args->size))
>   		return -EFAULT;
>   
>   	drm_clflush_virt_range(vaddr, args->size);
>   	intel_gt_chipset_flush(to_gt(i915));
> -
> -	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
>   	return 0;
>   }
>   
> diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
> index c1e427ba57ae..f4201f9c5f84 100644
> --- a/drivers/gpu/drm/i915/i915_driver.c
> +++ b/drivers/gpu/drm/i915/i915_driver.c
> @@ -346,6 +346,7 @@ static int i915_driver_early_probe(struct drm_i915_private *dev_priv)
>   
>   	spin_lock_init(&dev_priv->irq_lock);
>   	spin_lock_init(&dev_priv->gpu_error.lock);
> +	spin_lock_init(&dev_priv->display.fb_tracking.lock);
>   	mutex_init(&dev_priv->display.backlight.lock);
>   
>   	mutex_init(&dev_priv->sb_lock);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 969581e7106f..594891291735 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -40,7 +40,6 @@
>   #include <drm/drm_vma_manager.h>
>   
>   #include "display/intel_display.h"
> -#include "display/intel_frontbuffer.h"
>   
>   #include "gem/i915_gem_clflush.h"
>   #include "gem/i915_gem_context.h"
> @@ -569,8 +568,6 @@ i915_gem_gtt_pwrite_fast(struct drm_i915_gem_object *obj,
>   		goto out_rpm;
>   	}
>   
> -	i915_gem_object_invalidate_frontbuffer(obj, ORIGIN_CPU);
> -
>   	user_data = u64_to_user_ptr(args->data_ptr);
>   	offset = args->offset;
>   	remain = args->size;
> @@ -613,7 +610,6 @@ i915_gem_gtt_pwrite_fast(struct drm_i915_gem_object *obj,
>   	}
>   
>   	intel_gt_flush_ggtt_writes(ggtt->vm.gt);
> -	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
>   
>   	i915_gem_gtt_cleanup(obj, &node, vma);
>   out_rpm:
> @@ -700,8 +696,6 @@ i915_gem_shmem_pwrite(struct drm_i915_gem_object *obj,
>   		offset = 0;
>   	}
>   
> -	i915_gem_object_flush_frontbuffer(obj, ORIGIN_CPU);
> -
>   	i915_gem_object_unpin_pages(obj);
>   	return ret;
>   
> @@ -1272,8 +1266,6 @@ void i915_gem_init_early(struct drm_i915_private *dev_priv)
>   {
>   	i915_gem_init__mm(dev_priv);
>   	i915_gem_init__contexts(dev_priv);
> -
> -	spin_lock_init(&dev_priv->display.fb_tracking.lock);
>   }
>   
>   void i915_gem_cleanup_early(struct drm_i915_private *dev_priv)
> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> index 7bd1861ddbdf..a9662cc6ed1e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> @@ -15,7 +15,6 @@
>   #include <asm/set_memory.h>
>   #include <asm/smp.h>
>   
> -#include "display/intel_frontbuffer.h"
>   #include "gt/intel_gt.h"
>   #include "gt/intel_gt_requests.h"
>   
> diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> index 7d044888ac33..e3b73175805b 100644
> --- a/drivers/gpu/drm/i915/i915_vma.c
> +++ b/drivers/gpu/drm/i915/i915_vma.c
> @@ -26,7 +26,6 @@
>   #include <linux/dma-fence-array.h>
>   #include <drm/drm_gem.h>
>   
> -#include "display/intel_frontbuffer.h"
>   #include "gem/i915_gem_lmem.h"
>   #include "gem/i915_gem_tiling.h"
>   #include "gt/intel_engine.h"
> @@ -1901,17 +1900,6 @@ int _i915_vma_move_to_active(struct i915_vma *vma,
>   			return err;
>   	}
>   
> -	if (flags & EXEC_OBJECT_WRITE) {
> -		struct intel_frontbuffer *front;
> -
> -		front = __intel_frontbuffer_get(obj);
> -		if (unlikely(front)) {
> -			if (intel_frontbuffer_invalidate(front, ORIGIN_CS))
> -				i915_active_add_request(&front->write, rq);
> -			intel_frontbuffer_put(front);
> -		}
> -	}
> -
>   	if (fence) {
>   		struct dma_fence *curr;
>   		enum dma_resv_usage usage;

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2022-12-22 22:21   ` Matthew Brost
  (?)
@ 2022-12-23 17:42   ` Rob Clark
  2022-12-28 22:21     ` Matthew Brost
  -1 siblings, 1 reply; 161+ messages in thread
From: Rob Clark @ 2022-12-23 17:42 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Thu, Dec 22, 2022 at 2:29 PM Matthew Brost <matthew.brost@intel.com> wrote:
>
> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
>
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>
> 2. In XE submissions are done via programming a ring buffer (circular
> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> control on the ring for free.
>
> A problem with this design is currently a drm_gpu_scheduler uses a
> kthread for submission / job cleanup. This doesn't scale if a large
> number of drm_gpu_scheduler are used. To work around the scaling issue,
> use a worker rather than kthread for submission / job cleanup.

You might want to enable CONFIG_DRM_MSM in your kconfig, I think you
missed a part

BR,
-R

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  12 +-
>  drivers/gpu/drm/scheduler/sched_main.c      | 124 ++++++++++++--------
>  include/drm/gpu_scheduler.h                 |  13 +-
>  4 files changed, 93 insertions(+), 70 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index f60753f97ac5..9c2a10aeb0b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>         for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>                 struct amdgpu_ring *ring = adev->rings[i];
>
> -               if (!ring || !ring->sched.thread)
> +               if (!ring || !ring->sched.ready)
>                         continue;
> -               kthread_park(ring->sched.thread);
> +               drm_sched_run_wq_stop(&ring->sched);
>         }
>
>         seq_printf(m, "run ib test:\n");
> @@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>         for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>                 struct amdgpu_ring *ring = adev->rings[i];
>
> -               if (!ring || !ring->sched.thread)
> +               if (!ring || !ring->sched.ready)
>                         continue;
> -               kthread_unpark(ring->sched.thread);
> +               drm_sched_run_wq_start(&ring->sched);
>         }
>
>         up_write(&adev->reset_domain->sem);
> @@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
>
>         ring = adev->rings[val];
>
> -       if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
> +       if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
>                 return -EINVAL;
>
>         /* the last preemption failed */
> @@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
>                 goto pro_end;
>
>         /* stop the scheduler */
> -       kthread_park(ring->sched.thread);
> +       drm_sched_run_wq_stop(&ring->sched);
>
>         /* preempt the IB */
>         r = amdgpu_ring_preempt_ib(ring);
> @@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
>
>  failure:
>         /* restart the scheduler */
> -       kthread_unpark(ring->sched.thread);
> +       drm_sched_run_wq_start(&ring->sched);
>
>         up_read(&adev->reset_domain->sem);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 076ae400d099..9552929ccf87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4577,7 +4577,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 struct amdgpu_ring *ring = adev->rings[i];
>
> -               if (!ring || !ring->sched.thread)
> +               if (!ring || !ring->sched.ready)
>                         continue;
>
>                 spin_lock(&ring->sched.job_list_lock);
> @@ -4708,7 +4708,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 struct amdgpu_ring *ring = adev->rings[i];
>
> -               if (!ring || !ring->sched.thread)
> +               if (!ring || !ring->sched.ready)
>                         continue;
>
>                 /*clear job fence from fence drv to avoid force_completion
> @@ -5247,7 +5247,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                         struct amdgpu_ring *ring = tmp_adev->rings[i];
>
> -                       if (!ring || !ring->sched.thread)
> +                       if (!ring || !ring->sched.ready)
>                                 continue;
>
>                         drm_sched_stop(&ring->sched, job ? &job->base : NULL);
> @@ -5321,7 +5321,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                         struct amdgpu_ring *ring = tmp_adev->rings[i];
>
> -                       if (!ring || !ring->sched.thread)
> +                       if (!ring || !ring->sched.ready)
>                                 continue;
>
>                         drm_sched_start(&ring->sched, true);
> @@ -5648,7 +5648,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                         struct amdgpu_ring *ring = adev->rings[i];
>
> -                       if (!ring || !ring->sched.thread)
> +                       if (!ring || !ring->sched.ready)
>                                 continue;
>
>                         drm_sched_stop(&ring->sched, NULL);
> @@ -5776,7 +5776,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 struct amdgpu_ring *ring = adev->rings[i];
>
> -               if (!ring || !ring->sched.thread)
> +               if (!ring || !ring->sched.ready)
>                         continue;
>
>                 drm_sched_start(&ring->sched, true);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 27d52ffbb808..8c64045d0692 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -44,7 +44,6 @@
>   * The jobs in a entity are always scheduled in the order that they were pushed.
>   */
>
> -#include <linux/kthread.h>
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>
> @@ -251,6 +250,53 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>         return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
>  }
>
> +/**
> + * drm_sched_run_wq_stop - stop scheduler run worker
> + *
> + * @sched: scheduler instance to stop run worker
> + */
> +void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched)
> +{
> +       sched->pause_run_wq = true;
> +       smp_wmb();
> +
> +       cancel_work_sync(&sched->work_run);
> +}
> +EXPORT_SYMBOL(drm_sched_run_wq_stop);
> +
> +/**
> + * drm_sched_run_wq_start - start scheduler run worker
> + *
> + * @sched: scheduler instance to start run worker
> + */
> +void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched)
> +{
> +       sched->pause_run_wq = false;
> +       smp_wmb();
> +
> +       queue_work(sched->run_wq, &sched->work_run);
> +}
> +EXPORT_SYMBOL(drm_sched_run_wq_start);
> +
> +/**
> + * drm_sched_run_wq_queue - queue scheduler run worker
> + *
> + * @sched: scheduler instance to queue run worker
> + */
> +static void drm_sched_run_wq_queue(struct drm_gpu_scheduler *sched)
> +{
> +       smp_rmb();
> +
> +       /*
> +        * Try not to schedule work if pause_run_wq set but not the end of world
> +        * if we do as either it will be cancelled by the above
> +        * cancel_work_sync, or drm_sched_main turns into a NOP while
> +        * pause_run_wq is set.
> +        */
> +       if (!sched->pause_run_wq)
> +               queue_work(sched->run_wq, &sched->work_run);
> +}
> +
>  /**
>   * drm_sched_job_done - complete a job
>   * @s_job: pointer to the job which is done
> @@ -270,7 +316,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job)
>         dma_fence_get(&s_fence->finished);
>         drm_sched_fence_finished(s_fence);
>         dma_fence_put(&s_fence->finished);
> -       wake_up_interruptible(&sched->wake_up_worker);
> +       drm_sched_run_wq_queue(sched);
>  }
>
>  /**
> @@ -433,7 +479,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  {
>         struct drm_sched_job *s_job, *tmp;
>
> -       kthread_park(sched->thread);
> +       drm_sched_run_wq_stop(sched);
>
>         /*
>          * Reinsert back the bad job here - now it's safe as
> @@ -546,7 +592,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>                 spin_unlock(&sched->job_list_lock);
>         }
>
> -       kthread_unpark(sched->thread);
> +       drm_sched_run_wq_start(sched);
>  }
>  EXPORT_SYMBOL(drm_sched_start);
>
> @@ -831,7 +877,7 @@ static bool drm_sched_ready(struct drm_gpu_scheduler *sched)
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
>  {
>         if (drm_sched_ready(sched))
> -               wake_up_interruptible(&sched->wake_up_worker);
> +               drm_sched_run_wq_queue(sched);
>  }
>
>  /**
> @@ -941,60 +987,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>  }
>  EXPORT_SYMBOL(drm_sched_pick_best);
>
> -/**
> - * drm_sched_blocked - check if the scheduler is blocked
> - *
> - * @sched: scheduler instance
> - *
> - * Returns true if blocked, otherwise false.
> - */
> -static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
> -{
> -       if (kthread_should_park()) {
> -               kthread_parkme();
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
>  /**
>   * drm_sched_main - main scheduler thread
>   *
>   * @param: scheduler instance
> - *
> - * Returns 0.
>   */
> -static int drm_sched_main(void *param)
> +static void drm_sched_main(struct work_struct *w)
>  {
> -       struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> +       struct drm_gpu_scheduler *sched =
> +               container_of(w, struct drm_gpu_scheduler, work_run);
>         int r;
>
> -       sched_set_fifo_low(current);
> -
> -       while (!kthread_should_stop()) {
> -               struct drm_sched_entity *entity = NULL;
> +       while (!READ_ONCE(sched->pause_run_wq)) {
> +               struct drm_sched_entity *entity;
>                 struct drm_sched_fence *s_fence;
>                 struct drm_sched_job *sched_job;
>                 struct dma_fence *fence;
> -               struct drm_sched_job *cleanup_job = NULL;
> +               struct drm_sched_job *cleanup_job;
>
> -               wait_event_interruptible(sched->wake_up_worker,
> -                                        (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> -                                        (!drm_sched_blocked(sched) &&
> -                                         (entity = drm_sched_select_entity(sched))) ||
> -                                        kthread_should_stop());
> +               cleanup_job = drm_sched_get_cleanup_job(sched);
> +               entity = drm_sched_select_entity(sched);
>
>                 if (cleanup_job)
>                         sched->ops->free_job(cleanup_job);
>
> -               if (!entity)
> +               if (!entity) {
> +                       if (!cleanup_job)
> +                               break;
>                         continue;
> +               }
>
>                 sched_job = drm_sched_entity_pop_job(entity);
>
>                 if (!sched_job) {
>                         complete_all(&entity->entity_idle);
> +                       if (!cleanup_job)
> +                               break;
>                         continue;
>                 }
>
> @@ -1022,14 +1050,14 @@ static int drm_sched_main(void *param)
>                                           r);
>                 } else {
>                         if (IS_ERR(fence))
> -                               dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
> +                               dma_fence_set_error(&s_fence->finished,
> +                                                   PTR_ERR(fence));
>
>                         drm_sched_job_done(sched_job);
>                 }
>
>                 wake_up(&sched->job_scheduled);
>         }
> -       return 0;
>  }
>
>  /**
> @@ -1054,35 +1082,28 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>                    long timeout, struct workqueue_struct *timeout_wq,
>                    atomic_t *score, const char *name, struct device *dev)
>  {
> -       int i, ret;
> +       int i;
>         sched->ops = ops;
>         sched->hw_submission_limit = hw_submission;
>         sched->name = name;
>         sched->timeout = timeout;
>         sched->timeout_wq = timeout_wq ? : system_wq;
> +       sched->run_wq = system_wq;      /* FIXME: Let user pass this in */
>         sched->hang_limit = hang_limit;
>         sched->score = score ? score : &sched->_score;
>         sched->dev = dev;
>         for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>                 drm_sched_rq_init(sched, &sched->sched_rq[i]);
>
> -       init_waitqueue_head(&sched->wake_up_worker);
>         init_waitqueue_head(&sched->job_scheduled);
>         INIT_LIST_HEAD(&sched->pending_list);
>         spin_lock_init(&sched->job_list_lock);
>         atomic_set(&sched->hw_rq_count, 0);
>         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> +       INIT_WORK(&sched->work_run, drm_sched_main);
>         atomic_set(&sched->_score, 0);
>         atomic64_set(&sched->job_id_count, 0);
> -
> -       /* Each scheduler will run on a seperate kernel thread */
> -       sched->thread = kthread_run(drm_sched_main, sched, sched->name);
> -       if (IS_ERR(sched->thread)) {
> -               ret = PTR_ERR(sched->thread);
> -               sched->thread = NULL;
> -               DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name);
> -               return ret;
> -       }
> +       sched->pause_run_wq = false;
>
>         sched->ready = true;
>         return 0;
> @@ -1101,8 +1122,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>         struct drm_sched_entity *s_entity;
>         int i;
>
> -       if (sched->thread)
> -               kthread_stop(sched->thread);
> +       drm_sched_run_wq_stop(sched);
>
>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>                 struct drm_sched_rq *rq = &sched->sched_rq[i];
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index ca857ec9e7eb..ff50f3c289cd 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -456,17 +456,16 @@ struct drm_sched_backend_ops {
>   * @timeout: the time after which a job is removed from the scheduler.
>   * @name: name of the ring for which this scheduler is being used.
>   * @sched_rq: priority wise array of run queues.
> - * @wake_up_worker: the wait queue on which the scheduler sleeps until a job
> - *                  is ready to be scheduled.
>   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>   *                 waits on this wait queue until all the scheduled jobs are
>   *                 finished.
>   * @hw_rq_count: the number of jobs currently in the hardware queue.
>   * @job_id_count: used to assign unique id to the each job.
> + * @run_wq: workqueue used to queue @work_run
>   * @timeout_wq: workqueue used to queue @work_tdr
> + * @work_run: schedules jobs and cleans up entities
>   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>   *            timeout interval is over.
> - * @thread: the kthread on which the scheduler which run.
>   * @pending_list: the list of jobs which are currently in the job queue.
>   * @job_list_lock: lock to protect the pending_list.
>   * @hang_limit: once the hangs by a job crosses this limit then it is marked
> @@ -475,6 +474,7 @@ struct drm_sched_backend_ops {
>   * @_score: score used when the driver doesn't provide one
>   * @ready: marks if the underlying HW is ready to work
>   * @free_guilty: A hit to time out handler to free the guilty job.
> + * @pause_run_wq: pause queuing of @work_run on @run_wq
>   * @dev: system &struct device
>   *
>   * One scheduler is implemented for each hardware ring.
> @@ -485,13 +485,13 @@ struct drm_gpu_scheduler {
>         long                            timeout;
>         const char                      *name;
>         struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> -       wait_queue_head_t               wake_up_worker;
>         wait_queue_head_t               job_scheduled;
>         atomic_t                        hw_rq_count;
>         atomic64_t                      job_id_count;
> +       struct workqueue_struct         *run_wq;
>         struct workqueue_struct         *timeout_wq;
> +       struct work_struct              work_run;
>         struct delayed_work             work_tdr;
> -       struct task_struct              *thread;
>         struct list_head                pending_list;
>         spinlock_t                      job_list_lock;
>         int                             hang_limit;
> @@ -499,6 +499,7 @@ struct drm_gpu_scheduler {
>         atomic_t                        _score;
>         bool                            ready;
>         bool                            free_guilty;
> +       bool                            pause_run_wq;
>         struct device                   *dev;
>  };
>
> @@ -529,6 +530,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> +void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> +void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> --
> 2.37.3
>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2022-12-23 17:42   ` [Intel-gfx] " Rob Clark
@ 2022-12-28 22:21     ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2022-12-28 22:21 UTC (permalink / raw)
  To: Rob Clark; +Cc: intel-gfx, dri-devel

On Fri, Dec 23, 2022 at 09:42:58AM -0800, Rob Clark wrote:
> On Thu, Dec 22, 2022 at 2:29 PM Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > seems a bit odd but let us explain the reasoning below.
> >
> > 1. In XE the submission order from multiple drm_sched_entity is not
> > guaranteed to be the same completion even if targeting the same hardware
> > engine. This is because in XE we have a firmware scheduler, the GuC,
> > which allowed to reorder, timeslice, and preempt submissions. If a using
> > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > apart as the TDR expects submission order == completion order. Using a
> > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> >
> > 2. In XE submissions are done via programming a ring buffer (circular
> > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> > control on the ring for free.
> >
> > A problem with this design is currently a drm_gpu_scheduler uses a
> > kthread for submission / job cleanup. This doesn't scale if a large
> > number of drm_gpu_scheduler are used. To work around the scaling issue,
> > use a worker rather than kthread for submission / job cleanup.
> 
> You might want to enable CONFIG_DRM_MSM in your kconfig, I think you
> missed a part
> 
> BR,
> -R
> 

Thanks for feedback Rob, yes indeed we missed updating the MSM driver. Fixed up
in our Xe repo and will be fixed in the next rev on the list.

Matt

> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  12 +-
> >  drivers/gpu/drm/scheduler/sched_main.c      | 124 ++++++++++++--------
> >  include/drm/gpu_scheduler.h                 |  13 +-
> >  4 files changed, 93 insertions(+), 70 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index f60753f97ac5..9c2a10aeb0b3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
> >         for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> >                 struct amdgpu_ring *ring = adev->rings[i];
> >
> > -               if (!ring || !ring->sched.thread)
> > +               if (!ring || !ring->sched.ready)
> >                         continue;
> > -               kthread_park(ring->sched.thread);
> > +               drm_sched_run_wq_stop(&ring->sched);
> >         }
> >
> >         seq_printf(m, "run ib test:\n");
> > @@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
> >         for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> >                 struct amdgpu_ring *ring = adev->rings[i];
> >
> > -               if (!ring || !ring->sched.thread)
> > +               if (!ring || !ring->sched.ready)
> >                         continue;
> > -               kthread_unpark(ring->sched.thread);
> > +               drm_sched_run_wq_start(&ring->sched);
> >         }
> >
> >         up_write(&adev->reset_domain->sem);
> > @@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
> >
> >         ring = adev->rings[val];
> >
> > -       if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
> > +       if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
> >                 return -EINVAL;
> >
> >         /* the last preemption failed */
> > @@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
> >                 goto pro_end;
> >
> >         /* stop the scheduler */
> > -       kthread_park(ring->sched.thread);
> > +       drm_sched_run_wq_stop(&ring->sched);
> >
> >         /* preempt the IB */
> >         r = amdgpu_ring_preempt_ib(ring);
> > @@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
> >
> >  failure:
> >         /* restart the scheduler */
> > -       kthread_unpark(ring->sched.thread);
> > +       drm_sched_run_wq_start(&ring->sched);
> >
> >         up_read(&adev->reset_domain->sem);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 076ae400d099..9552929ccf87 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4577,7 +4577,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
> >         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >                 struct amdgpu_ring *ring = adev->rings[i];
> >
> > -               if (!ring || !ring->sched.thread)
> > +               if (!ring || !ring->sched.ready)
> >                         continue;
> >
> >                 spin_lock(&ring->sched.job_list_lock);
> > @@ -4708,7 +4708,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> >         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >                 struct amdgpu_ring *ring = adev->rings[i];
> >
> > -               if (!ring || !ring->sched.thread)
> > +               if (!ring || !ring->sched.ready)
> >                         continue;
> >
> >                 /*clear job fence from fence drv to avoid force_completion
> > @@ -5247,7 +5247,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >                         struct amdgpu_ring *ring = tmp_adev->rings[i];
> >
> > -                       if (!ring || !ring->sched.thread)
> > +                       if (!ring || !ring->sched.ready)
> >                                 continue;
> >
> >                         drm_sched_stop(&ring->sched, job ? &job->base : NULL);
> > @@ -5321,7 +5321,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >                         struct amdgpu_ring *ring = tmp_adev->rings[i];
> >
> > -                       if (!ring || !ring->sched.thread)
> > +                       if (!ring || !ring->sched.ready)
> >                                 continue;
> >
> >                         drm_sched_start(&ring->sched, true);
> > @@ -5648,7 +5648,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
> >                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >                         struct amdgpu_ring *ring = adev->rings[i];
> >
> > -                       if (!ring || !ring->sched.thread)
> > +                       if (!ring || !ring->sched.ready)
> >                                 continue;
> >
> >                         drm_sched_stop(&ring->sched, NULL);
> > @@ -5776,7 +5776,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
> >         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >                 struct amdgpu_ring *ring = adev->rings[i];
> >
> > -               if (!ring || !ring->sched.thread)
> > +               if (!ring || !ring->sched.ready)
> >                         continue;
> >
> >                 drm_sched_start(&ring->sched, true);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 27d52ffbb808..8c64045d0692 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -44,7 +44,6 @@
> >   * The jobs in a entity are always scheduled in the order that they were pushed.
> >   */
> >
> > -#include <linux/kthread.h>
> >  #include <linux/wait.h>
> >  #include <linux/sched.h>
> >  #include <linux/completion.h>
> > @@ -251,6 +250,53 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >         return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
> >  }
> >
> > +/**
> > + * drm_sched_run_wq_stop - stop scheduler run worker
> > + *
> > + * @sched: scheduler instance to stop run worker
> > + */
> > +void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched)
> > +{
> > +       sched->pause_run_wq = true;
> > +       smp_wmb();
> > +
> > +       cancel_work_sync(&sched->work_run);
> > +}
> > +EXPORT_SYMBOL(drm_sched_run_wq_stop);
> > +
> > +/**
> > + * drm_sched_run_wq_start - start scheduler run worker
> > + *
> > + * @sched: scheduler instance to start run worker
> > + */
> > +void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched)
> > +{
> > +       sched->pause_run_wq = false;
> > +       smp_wmb();
> > +
> > +       queue_work(sched->run_wq, &sched->work_run);
> > +}
> > +EXPORT_SYMBOL(drm_sched_run_wq_start);
> > +
> > +/**
> > + * drm_sched_run_wq_queue - queue scheduler run worker
> > + *
> > + * @sched: scheduler instance to queue run worker
> > + */
> > +static void drm_sched_run_wq_queue(struct drm_gpu_scheduler *sched)
> > +{
> > +       smp_rmb();
> > +
> > +       /*
> > +        * Try not to schedule work if pause_run_wq set but not the end of world
> > +        * if we do as either it will be cancelled by the above
> > +        * cancel_work_sync, or drm_sched_main turns into a NOP while
> > +        * pause_run_wq is set.
> > +        */
> > +       if (!sched->pause_run_wq)
> > +               queue_work(sched->run_wq, &sched->work_run);
> > +}
> > +
> >  /**
> >   * drm_sched_job_done - complete a job
> >   * @s_job: pointer to the job which is done
> > @@ -270,7 +316,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job)
> >         dma_fence_get(&s_fence->finished);
> >         drm_sched_fence_finished(s_fence);
> >         dma_fence_put(&s_fence->finished);
> > -       wake_up_interruptible(&sched->wake_up_worker);
> > +       drm_sched_run_wq_queue(sched);
> >  }
> >
> >  /**
> > @@ -433,7 +479,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> >  {
> >         struct drm_sched_job *s_job, *tmp;
> >
> > -       kthread_park(sched->thread);
> > +       drm_sched_run_wq_stop(sched);
> >
> >         /*
> >          * Reinsert back the bad job here - now it's safe as
> > @@ -546,7 +592,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> >                 spin_unlock(&sched->job_list_lock);
> >         }
> >
> > -       kthread_unpark(sched->thread);
> > +       drm_sched_run_wq_start(sched);
> >  }
> >  EXPORT_SYMBOL(drm_sched_start);
> >
> > @@ -831,7 +877,7 @@ static bool drm_sched_ready(struct drm_gpu_scheduler *sched)
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched)
> >  {
> >         if (drm_sched_ready(sched))
> > -               wake_up_interruptible(&sched->wake_up_worker);
> > +               drm_sched_run_wq_queue(sched);
> >  }
> >
> >  /**
> > @@ -941,60 +987,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >  }
> >  EXPORT_SYMBOL(drm_sched_pick_best);
> >
> > -/**
> > - * drm_sched_blocked - check if the scheduler is blocked
> > - *
> > - * @sched: scheduler instance
> > - *
> > - * Returns true if blocked, otherwise false.
> > - */
> > -static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
> > -{
> > -       if (kthread_should_park()) {
> > -               kthread_parkme();
> > -               return true;
> > -       }
> > -
> > -       return false;
> > -}
> > -
> >  /**
> >   * drm_sched_main - main scheduler thread
> >   *
> >   * @param: scheduler instance
> > - *
> > - * Returns 0.
> >   */
> > -static int drm_sched_main(void *param)
> > +static void drm_sched_main(struct work_struct *w)
> >  {
> > -       struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> > +       struct drm_gpu_scheduler *sched =
> > +               container_of(w, struct drm_gpu_scheduler, work_run);
> >         int r;
> >
> > -       sched_set_fifo_low(current);
> > -
> > -       while (!kthread_should_stop()) {
> > -               struct drm_sched_entity *entity = NULL;
> > +       while (!READ_ONCE(sched->pause_run_wq)) {
> > +               struct drm_sched_entity *entity;
> >                 struct drm_sched_fence *s_fence;
> >                 struct drm_sched_job *sched_job;
> >                 struct dma_fence *fence;
> > -               struct drm_sched_job *cleanup_job = NULL;
> > +               struct drm_sched_job *cleanup_job;
> >
> > -               wait_event_interruptible(sched->wake_up_worker,
> > -                                        (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> > -                                        (!drm_sched_blocked(sched) &&
> > -                                         (entity = drm_sched_select_entity(sched))) ||
> > -                                        kthread_should_stop());
> > +               cleanup_job = drm_sched_get_cleanup_job(sched);
> > +               entity = drm_sched_select_entity(sched);
> >
> >                 if (cleanup_job)
> >                         sched->ops->free_job(cleanup_job);
> >
> > -               if (!entity)
> > +               if (!entity) {
> > +                       if (!cleanup_job)
> > +                               break;
> >                         continue;
> > +               }
> >
> >                 sched_job = drm_sched_entity_pop_job(entity);
> >
> >                 if (!sched_job) {
> >                         complete_all(&entity->entity_idle);
> > +                       if (!cleanup_job)
> > +                               break;
> >                         continue;
> >                 }
> >
> > @@ -1022,14 +1050,14 @@ static int drm_sched_main(void *param)
> >                                           r);
> >                 } else {
> >                         if (IS_ERR(fence))
> > -                               dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
> > +                               dma_fence_set_error(&s_fence->finished,
> > +                                                   PTR_ERR(fence));
> >
> >                         drm_sched_job_done(sched_job);
> >                 }
> >
> >                 wake_up(&sched->job_scheduled);
> >         }
> > -       return 0;
> >  }
> >
> >  /**
> > @@ -1054,35 +1082,28 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >                    long timeout, struct workqueue_struct *timeout_wq,
> >                    atomic_t *score, const char *name, struct device *dev)
> >  {
> > -       int i, ret;
> > +       int i;
> >         sched->ops = ops;
> >         sched->hw_submission_limit = hw_submission;
> >         sched->name = name;
> >         sched->timeout = timeout;
> >         sched->timeout_wq = timeout_wq ? : system_wq;
> > +       sched->run_wq = system_wq;      /* FIXME: Let user pass this in */
> >         sched->hang_limit = hang_limit;
> >         sched->score = score ? score : &sched->_score;
> >         sched->dev = dev;
> >         for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
> >                 drm_sched_rq_init(sched, &sched->sched_rq[i]);
> >
> > -       init_waitqueue_head(&sched->wake_up_worker);
> >         init_waitqueue_head(&sched->job_scheduled);
> >         INIT_LIST_HEAD(&sched->pending_list);
> >         spin_lock_init(&sched->job_list_lock);
> >         atomic_set(&sched->hw_rq_count, 0);
> >         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > +       INIT_WORK(&sched->work_run, drm_sched_main);
> >         atomic_set(&sched->_score, 0);
> >         atomic64_set(&sched->job_id_count, 0);
> > -
> > -       /* Each scheduler will run on a seperate kernel thread */
> > -       sched->thread = kthread_run(drm_sched_main, sched, sched->name);
> > -       if (IS_ERR(sched->thread)) {
> > -               ret = PTR_ERR(sched->thread);
> > -               sched->thread = NULL;
> > -               DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name);
> > -               return ret;
> > -       }
> > +       sched->pause_run_wq = false;
> >
> >         sched->ready = true;
> >         return 0;
> > @@ -1101,8 +1122,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >         struct drm_sched_entity *s_entity;
> >         int i;
> >
> > -       if (sched->thread)
> > -               kthread_stop(sched->thread);
> > +       drm_sched_run_wq_stop(sched);
> >
> >         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >                 struct drm_sched_rq *rq = &sched->sched_rq[i];
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index ca857ec9e7eb..ff50f3c289cd 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -456,17 +456,16 @@ struct drm_sched_backend_ops {
> >   * @timeout: the time after which a job is removed from the scheduler.
> >   * @name: name of the ring for which this scheduler is being used.
> >   * @sched_rq: priority wise array of run queues.
> > - * @wake_up_worker: the wait queue on which the scheduler sleeps until a job
> > - *                  is ready to be scheduled.
> >   * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> >   *                 waits on this wait queue until all the scheduled jobs are
> >   *                 finished.
> >   * @hw_rq_count: the number of jobs currently in the hardware queue.
> >   * @job_id_count: used to assign unique id to the each job.
> > + * @run_wq: workqueue used to queue @work_run
> >   * @timeout_wq: workqueue used to queue @work_tdr
> > + * @work_run: schedules jobs and cleans up entities
> >   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> >   *            timeout interval is over.
> > - * @thread: the kthread on which the scheduler which run.
> >   * @pending_list: the list of jobs which are currently in the job queue.
> >   * @job_list_lock: lock to protect the pending_list.
> >   * @hang_limit: once the hangs by a job crosses this limit then it is marked
> > @@ -475,6 +474,7 @@ struct drm_sched_backend_ops {
> >   * @_score: score used when the driver doesn't provide one
> >   * @ready: marks if the underlying HW is ready to work
> >   * @free_guilty: A hit to time out handler to free the guilty job.
> > + * @pause_run_wq: pause queuing of @work_run on @run_wq
> >   * @dev: system &struct device
> >   *
> >   * One scheduler is implemented for each hardware ring.
> > @@ -485,13 +485,13 @@ struct drm_gpu_scheduler {
> >         long                            timeout;
> >         const char                      *name;
> >         struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > -       wait_queue_head_t               wake_up_worker;
> >         wait_queue_head_t               job_scheduled;
> >         atomic_t                        hw_rq_count;
> >         atomic64_t                      job_id_count;
> > +       struct workqueue_struct         *run_wq;
> >         struct workqueue_struct         *timeout_wq;
> > +       struct work_struct              work_run;
> >         struct delayed_work             work_tdr;
> > -       struct task_struct              *thread;
> >         struct list_head                pending_list;
> >         spinlock_t                      job_list_lock;
> >         int                             hang_limit;
> > @@ -499,6 +499,7 @@ struct drm_gpu_scheduler {
> >         atomic_t                        _score;
> >         bool                            ready;
> >         bool                            free_guilty;
> > +       bool                            pause_run_wq;
> >         struct device                   *dev;
> >  };
> >
> > @@ -529,6 +530,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > +void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> > +void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> >  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> >  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> > --
> > 2.37.3
> >

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2022-12-22 22:21   ` Matthew Brost
@ 2022-12-30 10:20     ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2022-12-30 10:20 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

Hello Matthew,

On Thu, 22 Dec 2022 14:21:11 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
> 
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

Oh, that's interesting. I've been trying to solve the same sort of
issues to support Arm's new Mali GPU which is relying on a FW-assisted
scheduling scheme (you give the FW N streams to execute, and it does
the scheduling between those N command streams, the kernel driver
does timeslice scheduling to update the command streams passed to the
FW). I must admit I gave up on using drm_sched at some point, mostly
because the integration with drm_sched was painful, but also because I
felt trying to bend drm_sched to make it interact with a
timeslice-oriented scheduling model wasn't really future proof. Giving
drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
help for a few things (didn't think it through yet), but I feel it's
coming short on other aspects we have to deal with on Arm GPUs. Here
are a few things I noted while working on the drm_sched-based PoC:

- The complexity to suspend/resume streams and recover from failures
  remains quite important (because everything is still very asynchronous
  under the hood). Sure, you don't have to do this fancy
  timeslice-based scheduling, but that's still a lot of code, and
  AFAICT, it didn't integrate well with drm_sched TDR (my previous
  attempt at reconciling them has been unsuccessful, but maybe your
  patches would help there)
- You lose one of the nice thing that's brought by timeslice-based
  scheduling: a tiny bit of fairness. That is, if one stream is queuing
  a compute job that's monopolizing the GPU core, you know the kernel
  part of the scheduler will eventually evict it and let other streams
  with same or higher priority run, even before the job timeout
  kicks in.
- Stream slots exposed by the Arm FW are not exactly HW queues that run
  things concurrently. The FW can decide to let only the stream with the
  highest priority get access to the various HW resources (GPU cores,
  tiler, ...), and let other streams starve. That means you might get
  spurious timeouts on some jobs/sched-entities while they didn't even
  get a chance to run.

So overall, and given I'm no longer the only one having to deal with a
FW scheduler that's designed with timeslice scheduling in mind, I'm
wondering if it's not time to design a common timeslice-based scheduler
instead of trying to bend drivers to use the model enforced by
drm_sched. But that's just my 2 cents, of course.

Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2022-12-30 10:20     ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2022-12-30 10:20 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

Hello Matthew,

On Thu, 22 Dec 2022 14:21:11 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
> 
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

Oh, that's interesting. I've been trying to solve the same sort of
issues to support Arm's new Mali GPU which is relying on a FW-assisted
scheduling scheme (you give the FW N streams to execute, and it does
the scheduling between those N command streams, the kernel driver
does timeslice scheduling to update the command streams passed to the
FW). I must admit I gave up on using drm_sched at some point, mostly
because the integration with drm_sched was painful, but also because I
felt trying to bend drm_sched to make it interact with a
timeslice-oriented scheduling model wasn't really future proof. Giving
drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
help for a few things (didn't think it through yet), but I feel it's
coming short on other aspects we have to deal with on Arm GPUs. Here
are a few things I noted while working on the drm_sched-based PoC:

- The complexity to suspend/resume streams and recover from failures
  remains quite important (because everything is still very asynchronous
  under the hood). Sure, you don't have to do this fancy
  timeslice-based scheduling, but that's still a lot of code, and
  AFAICT, it didn't integrate well with drm_sched TDR (my previous
  attempt at reconciling them has been unsuccessful, but maybe your
  patches would help there)
- You lose one of the nice thing that's brought by timeslice-based
  scheduling: a tiny bit of fairness. That is, if one stream is queuing
  a compute job that's monopolizing the GPU core, you know the kernel
  part of the scheduler will eventually evict it and let other streams
  with same or higher priority run, even before the job timeout
  kicks in.
- Stream slots exposed by the Arm FW are not exactly HW queues that run
  things concurrently. The FW can decide to let only the stream with the
  highest priority get access to the various HW resources (GPU cores,
  tiler, ...), and let other streams starve. That means you might get
  spurious timeouts on some jobs/sched-entities while they didn't even
  get a chance to run.

So overall, and given I'm no longer the only one having to deal with a
FW scheduler that's designed with timeslice scheduling in mind, I'm
wondering if it's not time to design a common timeslice-based scheduler
instead of trying to bend drivers to use the model enforced by
drm_sched. But that's just my 2 cents, of course.

Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2022-12-30 10:20     ` [Intel-gfx] " Boris Brezillon
@ 2022-12-30 11:55       ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2022-12-30 11:55 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Fri, 30 Dec 2022 11:20:42 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hello Matthew,
> 
> On Thu, 22 Dec 2022 14:21:11 -0800
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > seems a bit odd but let us explain the reasoning below.
> > 
> > 1. In XE the submission order from multiple drm_sched_entity is not
> > guaranteed to be the same completion even if targeting the same hardware
> > engine. This is because in XE we have a firmware scheduler, the GuC,
> > which allowed to reorder, timeslice, and preempt submissions. If a using
> > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > apart as the TDR expects submission order == completion order. Using a
> > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> 
> Oh, that's interesting. I've been trying to solve the same sort of
> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> scheduling scheme (you give the FW N streams to execute, and it does
> the scheduling between those N command streams, the kernel driver
> does timeslice scheduling to update the command streams passed to the
> FW). I must admit I gave up on using drm_sched at some point, mostly
> because the integration with drm_sched was painful, but also because I
> felt trying to bend drm_sched to make it interact with a
> timeslice-oriented scheduling model wasn't really future proof. Giving
> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> help for a few things (didn't think it through yet), but I feel it's
> coming short on other aspects we have to deal with on Arm GPUs.

Ok, so I just had a quick look at the Xe driver and how it
instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
have a better understanding of how you get away with using drm_sched
while still controlling how scheduling is really done. Here
drm_gpu_scheduler is just a dummy abstract that let's you use the
drm_sched job queuing/dep/tracking mechanism. The whole run-queue
selection is dumb because there's only one entity ever bound to the
scheduler (the one that's part of the xe_guc_engine object which also
contains the drm_gpu_scheduler instance). I guess the main issue we'd
have on Arm is the fact that the stream doesn't necessarily get
scheduled when ->run_job() is called, it can be placed in the runnable
queue and be picked later by the kernel-side scheduler when a FW slot
gets released. That can probably be sorted out by manually disabling the
job timer and re-enabling it when the stream gets picked by the
scheduler. But my main concern remains, we're basically abusing
drm_sched here.

For the Arm driver, that means turning the following sequence

1. wait for job deps
2. queue job to ringbuf and push the stream to the runnable
   queue (if it wasn't queued already). Wakeup the timeslice scheduler
   to re-evaluate (if the stream is not on a FW slot already)
3. stream gets picked by the timeslice scheduler and sent to the FW for
   execution

into

1. queue job to entity which takes care of waiting for job deps for
   us
2. schedule a drm_sched_main iteration
3. the only available entity is picked, and the first job from this
   entity is dequeued. ->run_job() is called: the job is queued to the
   ringbuf and the stream is pushed to the runnable queue (if it wasn't
   queued already). Wakeup the timeslice scheduler to re-evaluate (if
   the stream is not on a FW slot already)
4. stream gets picked by the timeslice scheduler and sent to the FW for
   execution

That's one extra step we don't really need. To sum-up, yes, all the
job/entity tracking might be interesting to share/re-use, but I wonder
if we couldn't have that without pulling out the scheduling part of
drm_sched, or maybe I'm missing something, and there's something in
drm_gpu_scheduler you really need.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2022-12-30 11:55       ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2022-12-30 11:55 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Fri, 30 Dec 2022 11:20:42 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hello Matthew,
> 
> On Thu, 22 Dec 2022 14:21:11 -0800
> Matthew Brost <matthew.brost@intel.com> wrote:
> 
> > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > seems a bit odd but let us explain the reasoning below.
> > 
> > 1. In XE the submission order from multiple drm_sched_entity is not
> > guaranteed to be the same completion even if targeting the same hardware
> > engine. This is because in XE we have a firmware scheduler, the GuC,
> > which allowed to reorder, timeslice, and preempt submissions. If a using
> > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > apart as the TDR expects submission order == completion order. Using a
> > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> 
> Oh, that's interesting. I've been trying to solve the same sort of
> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> scheduling scheme (you give the FW N streams to execute, and it does
> the scheduling between those N command streams, the kernel driver
> does timeslice scheduling to update the command streams passed to the
> FW). I must admit I gave up on using drm_sched at some point, mostly
> because the integration with drm_sched was painful, but also because I
> felt trying to bend drm_sched to make it interact with a
> timeslice-oriented scheduling model wasn't really future proof. Giving
> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> help for a few things (didn't think it through yet), but I feel it's
> coming short on other aspects we have to deal with on Arm GPUs.

Ok, so I just had a quick look at the Xe driver and how it
instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
have a better understanding of how you get away with using drm_sched
while still controlling how scheduling is really done. Here
drm_gpu_scheduler is just a dummy abstract that let's you use the
drm_sched job queuing/dep/tracking mechanism. The whole run-queue
selection is dumb because there's only one entity ever bound to the
scheduler (the one that's part of the xe_guc_engine object which also
contains the drm_gpu_scheduler instance). I guess the main issue we'd
have on Arm is the fact that the stream doesn't necessarily get
scheduled when ->run_job() is called, it can be placed in the runnable
queue and be picked later by the kernel-side scheduler when a FW slot
gets released. That can probably be sorted out by manually disabling the
job timer and re-enabling it when the stream gets picked by the
scheduler. But my main concern remains, we're basically abusing
drm_sched here.

For the Arm driver, that means turning the following sequence

1. wait for job deps
2. queue job to ringbuf and push the stream to the runnable
   queue (if it wasn't queued already). Wakeup the timeslice scheduler
   to re-evaluate (if the stream is not on a FW slot already)
3. stream gets picked by the timeslice scheduler and sent to the FW for
   execution

into

1. queue job to entity which takes care of waiting for job deps for
   us
2. schedule a drm_sched_main iteration
3. the only available entity is picked, and the first job from this
   entity is dequeued. ->run_job() is called: the job is queued to the
   ringbuf and the stream is pushed to the runnable queue (if it wasn't
   queued already). Wakeup the timeslice scheduler to re-evaluate (if
   the stream is not on a FW slot already)
4. stream gets picked by the timeslice scheduler and sent to the FW for
   execution

That's one extra step we don't really need. To sum-up, yes, all the
job/entity tracking might be interesting to share/re-use, but I wonder
if we couldn't have that without pulling out the scheduling part of
drm_sched, or maybe I'm missing something, and there's something in
drm_gpu_scheduler you really need.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2022-12-30 11:55       ` [Intel-gfx] " Boris Brezillon
@ 2023-01-02  7:30         ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-02  7:30 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Fri, 30 Dec 2022 12:55:08 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Fri, 30 Dec 2022 11:20:42 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > Hello Matthew,
> > 
> > On Thu, 22 Dec 2022 14:21:11 -0800
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >   
> > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > seems a bit odd but let us explain the reasoning below.
> > > 
> > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > guaranteed to be the same completion even if targeting the same hardware
> > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > apart as the TDR expects submission order == completion order. Using a
> > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.    
> > 
> > Oh, that's interesting. I've been trying to solve the same sort of
> > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > scheduling scheme (you give the FW N streams to execute, and it does
> > the scheduling between those N command streams, the kernel driver
> > does timeslice scheduling to update the command streams passed to the
> > FW). I must admit I gave up on using drm_sched at some point, mostly
> > because the integration with drm_sched was painful, but also because I
> > felt trying to bend drm_sched to make it interact with a
> > timeslice-oriented scheduling model wasn't really future proof. Giving
> > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > help for a few things (didn't think it through yet), but I feel it's
> > coming short on other aspects we have to deal with on Arm GPUs.  
> 
> Ok, so I just had a quick look at the Xe driver and how it
> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> have a better understanding of how you get away with using drm_sched
> while still controlling how scheduling is really done. Here
> drm_gpu_scheduler is just a dummy abstract that let's you use the
> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> selection is dumb because there's only one entity ever bound to the
> scheduler (the one that's part of the xe_guc_engine object which also
> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> have on Arm is the fact that the stream doesn't necessarily get
> scheduled when ->run_job() is called, it can be placed in the runnable
> queue and be picked later by the kernel-side scheduler when a FW slot
> gets released. That can probably be sorted out by manually disabling the
> job timer and re-enabling it when the stream gets picked by the
> scheduler. But my main concern remains, we're basically abusing
> drm_sched here.
> 
> For the Arm driver, that means turning the following sequence
> 
> 1. wait for job deps
> 2. queue job to ringbuf and push the stream to the runnable
>    queue (if it wasn't queued already). Wakeup the timeslice scheduler
>    to re-evaluate (if the stream is not on a FW slot already)
> 3. stream gets picked by the timeslice scheduler and sent to the FW for
>    execution
> 
> into
> 
> 1. queue job to entity which takes care of waiting for job deps for
>    us
> 2. schedule a drm_sched_main iteration
> 3. the only available entity is picked, and the first job from this
>    entity is dequeued. ->run_job() is called: the job is queued to the
>    ringbuf and the stream is pushed to the runnable queue (if it wasn't
>    queued already). Wakeup the timeslice scheduler to re-evaluate (if
>    the stream is not on a FW slot already)
> 4. stream gets picked by the timeslice scheduler and sent to the FW for
>    execution
> 
> That's one extra step we don't really need. To sum-up, yes, all the
> job/entity tracking might be interesting to share/re-use, but I wonder
> if we couldn't have that without pulling out the scheduling part of
> drm_sched, or maybe I'm missing something, and there's something in
> drm_gpu_scheduler you really need.

On second thought, that's probably an acceptable overhead (not even
sure the extra step I was mentioning exists in practice, because dep
fence signaled state is checked as part of the drm_sched_main
iteration, so that's basically replacing the worker I schedule to
check job deps), and I like the idea of being able to re-use drm_sched
dep-tracking without resorting to invasive changes to the existing
logic, so I'll probably give it a try.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-02  7:30         ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-02  7:30 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Fri, 30 Dec 2022 12:55:08 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Fri, 30 Dec 2022 11:20:42 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > Hello Matthew,
> > 
> > On Thu, 22 Dec 2022 14:21:11 -0800
> > Matthew Brost <matthew.brost@intel.com> wrote:
> >   
> > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > seems a bit odd but let us explain the reasoning below.
> > > 
> > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > guaranteed to be the same completion even if targeting the same hardware
> > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > apart as the TDR expects submission order == completion order. Using a
> > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.    
> > 
> > Oh, that's interesting. I've been trying to solve the same sort of
> > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > scheduling scheme (you give the FW N streams to execute, and it does
> > the scheduling between those N command streams, the kernel driver
> > does timeslice scheduling to update the command streams passed to the
> > FW). I must admit I gave up on using drm_sched at some point, mostly
> > because the integration with drm_sched was painful, but also because I
> > felt trying to bend drm_sched to make it interact with a
> > timeslice-oriented scheduling model wasn't really future proof. Giving
> > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > help for a few things (didn't think it through yet), but I feel it's
> > coming short on other aspects we have to deal with on Arm GPUs.  
> 
> Ok, so I just had a quick look at the Xe driver and how it
> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> have a better understanding of how you get away with using drm_sched
> while still controlling how scheduling is really done. Here
> drm_gpu_scheduler is just a dummy abstract that let's you use the
> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> selection is dumb because there's only one entity ever bound to the
> scheduler (the one that's part of the xe_guc_engine object which also
> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> have on Arm is the fact that the stream doesn't necessarily get
> scheduled when ->run_job() is called, it can be placed in the runnable
> queue and be picked later by the kernel-side scheduler when a FW slot
> gets released. That can probably be sorted out by manually disabling the
> job timer and re-enabling it when the stream gets picked by the
> scheduler. But my main concern remains, we're basically abusing
> drm_sched here.
> 
> For the Arm driver, that means turning the following sequence
> 
> 1. wait for job deps
> 2. queue job to ringbuf and push the stream to the runnable
>    queue (if it wasn't queued already). Wakeup the timeslice scheduler
>    to re-evaluate (if the stream is not on a FW slot already)
> 3. stream gets picked by the timeslice scheduler and sent to the FW for
>    execution
> 
> into
> 
> 1. queue job to entity which takes care of waiting for job deps for
>    us
> 2. schedule a drm_sched_main iteration
> 3. the only available entity is picked, and the first job from this
>    entity is dequeued. ->run_job() is called: the job is queued to the
>    ringbuf and the stream is pushed to the runnable queue (if it wasn't
>    queued already). Wakeup the timeslice scheduler to re-evaluate (if
>    the stream is not on a FW slot already)
> 4. stream gets picked by the timeslice scheduler and sent to the FW for
>    execution
> 
> That's one extra step we don't really need. To sum-up, yes, all the
> job/entity tracking might be interesting to share/re-use, but I wonder
> if we couldn't have that without pulling out the scheduling part of
> drm_sched, or maybe I'm missing something, and there's something in
> drm_gpu_scheduler you really need.

On second thought, that's probably an acceptable overhead (not even
sure the extra step I was mentioning exists in practice, because dep
fence signaled state is checked as part of the drm_sched_main
iteration, so that's basically replacing the worker I schedule to
check job deps), and I like the idea of being able to re-use drm_sched
dep-tracking without resorting to invasive changes to the existing
logic, so I'll probably give it a try.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2023-01-02  8:14   ` Thomas Zimmermann
  -1 siblings, 0 replies; 161+ messages in thread
From: Thomas Zimmermann @ 2023-01-02  8:14 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 11761 bytes --]

Hi

Am 22.12.22 um 23:21 schrieb Matthew Brost:
> Hello,
> 
> This is a submission for Xe, a new driver for Intel GPUs that supports both
> integrated and discrete platforms starting with Tiger Lake (first platform with
> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> to work from that is unencumbered by older platforms, whilst also taking the
> opportunity to rearchitect our driver to increase sharing across the drm
> subsystem, both leveraging and allowing us to contribute more towards other
> shared components like TTM and drm/scheduler. The memory model is based on VM
> bind which is similar to the i915 implementation. Likewise the execbuf
> implementation for Xe is very similar to execbuf3 in the i915 [1].

After Xe has stabilized, will i915 loose the ability to drive this 
hardware (and possibly other)?  I'm specfically thinking of the i915 
code that requires TTM. Keeping that dependecy within Xe only might 
benefit DRM as a whole.

> 
> The code is at a stage where it is already functional and has experimental
> support for multiple platforms starting from Tiger Lake, with initial support
> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> implementation will be released publicly early next year. We also have a suite
> of IGTs for XE that will appear on the IGT list shortly.
> 
> It has been built with the assumption of supporting multiple architectures from
> the get-go, right now with tests running both on X86 and ARM hosts. And we
> intend to continue working on it and improving on it as part of the kernel
> community upstream.
> 
> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> ready Xe for production throughout 2023.
> 
> As for display, the intent is to share the display code with the i915 driver so
> that there is maximum reuse there. Currently this is being done by compiling the
> display code twice, but alternatives to that are under consideration and we want
> to have more discussion on what the best final solution will look like over the
> next few months. Right now, work is ongoing in refactoring the display codebase
> to remove as much as possible any unnecessary dependencies on i915 specific data
> structures there..

Could both drivers reside in a common parent directory and share 
something like a DRM Intel helper module with the common code? This 
would fit well with the common design of DRM helpers.

Best regards
Thomas

> 
> We currently have 2 submission backends, execlists and GuC. The execlist is
> meant mostly for testing and is not fully functional while GuC backend is fully
> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> required and should be placed in /lib/firmware/xe.
> 
> The GuC firmware can be found in the below location:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> 
> The easiest way to setup firmware is:
> cp -r /lib/firmware/i915 /lib/firmware/xe
> 
> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/
> 
> There's a lot of work still to happen on Xe but we're very excited about it and
> wanted to share it early and welcome feedback and discussion.
> 
> Cheers,
> Matthew Brost
> 
> [1] https://patchwork.freedesktop.org/series/105879/
> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> 
> Maarten Lankhorst (12):
>    drm/amd: Convert amdgpu to use suballocation helper.
>    drm/radeon: Use the drm suballocation manager implementation.
>    drm/i915: Remove gem and overlay frontbuffer tracking
>    drm/i915/display: Neuter frontbuffer tracking harder
>    drm/i915/display: Add more macros to remove all direct calls to uncore
>    drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>    drm/i915: Rename find_section to find_bdb_section
>    drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>    drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>      fails
>    drm/i915/display: Remaining changes to make xe compile
>    sound/hda: Allow XE as i915 replacement for sound
>    mei/hdcp: Also enable for XE
> 
> Matthew Brost (5):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
> 
> Thomas Hellström (3):
>    drm/suballoc: Introduce a generic suballocation manager
>    drm: Add a gpu page-table walker helper
>    drm/ttm: Don't print error message if eviction was interrupted
> 
>   drivers/gpu/drm/Kconfig                       |   5 +
>   drivers/gpu/drm/Makefile                      |   4 +
>   drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>   drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>   drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>   drivers/gpu/drm/i915/Makefile                 |   2 +-
>   drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>   drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>   drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>   .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>   .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>   drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>   drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>   drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>   drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>   drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>   drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>   drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>   drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>   drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>   .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>   .../drm/i915/display/intel_display_debugfs.c  |   8 +
>   .../drm/i915/display/intel_display_power.c    |  40 ++-
>   .../drm/i915/display/intel_display_power.h    |   6 +
>   .../i915/display/intel_display_power_map.c    |   7 +
>   .../i915/display/intel_display_power_well.c   |  24 +-
>   .../drm/i915/display/intel_display_reg_defs.h |   4 +
>   .../drm/i915/display/intel_display_trace.h    |   6 +
>   .../drm/i915/display/intel_display_types.h    |  32 +-
>   drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>   drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>   drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>   drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>   drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>   drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>   drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>   drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>   drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>   drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>   drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>   drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>   drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>   drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>   .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>   .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>   drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>   drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>   drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>   .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>   .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>   drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>   drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>   .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>   .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>   drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>   .../drm/i915/display/intel_plane_initial.c    |   3 +-
>   drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>   drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>   drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>   drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>   drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>   .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>   drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>   drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>   drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>   drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>   drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>   drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>   drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>   drivers/gpu/drm/i915/i915_driver.c            |   1 +
>   drivers/gpu/drm/i915/i915_gem.c               |   8 -
>   drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>   drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>   drivers/gpu/drm/i915/i915_vma.c               |  12 -
>   drivers/gpu/drm/radeon/radeon.h               |  55 +--
>   drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>   drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>   drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>   drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>   drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>   drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>   drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>   include/drm/drm_pt_walk.h                     | 161 +++++++++
>   include/drm/drm_suballoc.h                    | 112 ++++++
>   include/drm/gpu_scheduler.h                   |  41 ++-
>   sound/hda/hdac_i915.c                         |  17 +-
>   sound/pci/hda/hda_intel.c                     |  56 +--
>   sound/soc/intel/avs/core.c                    |  13 +-
>   sound/soc/sof/intel/hda.c                     |   7 +-
>   98 files changed, 2076 insertions(+), 1325 deletions(-)
>   create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>   create mode 100644 include/drm/drm_pt_walk.h
>   create mode 100644 include/drm/drm_suballoc.h
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-02  8:14   ` Thomas Zimmermann
  0 siblings, 0 replies; 161+ messages in thread
From: Thomas Zimmermann @ 2023-01-02  8:14 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 11761 bytes --]

Hi

Am 22.12.22 um 23:21 schrieb Matthew Brost:
> Hello,
> 
> This is a submission for Xe, a new driver for Intel GPUs that supports both
> integrated and discrete platforms starting with Tiger Lake (first platform with
> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> to work from that is unencumbered by older platforms, whilst also taking the
> opportunity to rearchitect our driver to increase sharing across the drm
> subsystem, both leveraging and allowing us to contribute more towards other
> shared components like TTM and drm/scheduler. The memory model is based on VM
> bind which is similar to the i915 implementation. Likewise the execbuf
> implementation for Xe is very similar to execbuf3 in the i915 [1].

After Xe has stabilized, will i915 loose the ability to drive this 
hardware (and possibly other)?  I'm specfically thinking of the i915 
code that requires TTM. Keeping that dependecy within Xe only might 
benefit DRM as a whole.

> 
> The code is at a stage where it is already functional and has experimental
> support for multiple platforms starting from Tiger Lake, with initial support
> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> implementation will be released publicly early next year. We also have a suite
> of IGTs for XE that will appear on the IGT list shortly.
> 
> It has been built with the assumption of supporting multiple architectures from
> the get-go, right now with tests running both on X86 and ARM hosts. And we
> intend to continue working on it and improving on it as part of the kernel
> community upstream.
> 
> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> ready Xe for production throughout 2023.
> 
> As for display, the intent is to share the display code with the i915 driver so
> that there is maximum reuse there. Currently this is being done by compiling the
> display code twice, but alternatives to that are under consideration and we want
> to have more discussion on what the best final solution will look like over the
> next few months. Right now, work is ongoing in refactoring the display codebase
> to remove as much as possible any unnecessary dependencies on i915 specific data
> structures there..

Could both drivers reside in a common parent directory and share 
something like a DRM Intel helper module with the common code? This 
would fit well with the common design of DRM helpers.

Best regards
Thomas

> 
> We currently have 2 submission backends, execlists and GuC. The execlist is
> meant mostly for testing and is not fully functional while GuC backend is fully
> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> required and should be placed in /lib/firmware/xe.
> 
> The GuC firmware can be found in the below location:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> 
> The easiest way to setup firmware is:
> cp -r /lib/firmware/i915 /lib/firmware/xe
> 
> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/
> 
> There's a lot of work still to happen on Xe but we're very excited about it and
> wanted to share it early and welcome feedback and discussion.
> 
> Cheers,
> Matthew Brost
> 
> [1] https://patchwork.freedesktop.org/series/105879/
> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> 
> Maarten Lankhorst (12):
>    drm/amd: Convert amdgpu to use suballocation helper.
>    drm/radeon: Use the drm suballocation manager implementation.
>    drm/i915: Remove gem and overlay frontbuffer tracking
>    drm/i915/display: Neuter frontbuffer tracking harder
>    drm/i915/display: Add more macros to remove all direct calls to uncore
>    drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>    drm/i915: Rename find_section to find_bdb_section
>    drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>    drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>      fails
>    drm/i915/display: Remaining changes to make xe compile
>    sound/hda: Allow XE as i915 replacement for sound
>    mei/hdcp: Also enable for XE
> 
> Matthew Brost (5):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
> 
> Thomas Hellström (3):
>    drm/suballoc: Introduce a generic suballocation manager
>    drm: Add a gpu page-table walker helper
>    drm/ttm: Don't print error message if eviction was interrupted
> 
>   drivers/gpu/drm/Kconfig                       |   5 +
>   drivers/gpu/drm/Makefile                      |   4 +
>   drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>   drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>   drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>   drivers/gpu/drm/i915/Makefile                 |   2 +-
>   drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>   drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>   drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>   .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>   .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>   drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>   drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>   drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>   drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>   drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>   drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>   drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>   drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>   drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>   .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>   .../drm/i915/display/intel_display_debugfs.c  |   8 +
>   .../drm/i915/display/intel_display_power.c    |  40 ++-
>   .../drm/i915/display/intel_display_power.h    |   6 +
>   .../i915/display/intel_display_power_map.c    |   7 +
>   .../i915/display/intel_display_power_well.c   |  24 +-
>   .../drm/i915/display/intel_display_reg_defs.h |   4 +
>   .../drm/i915/display/intel_display_trace.h    |   6 +
>   .../drm/i915/display/intel_display_types.h    |  32 +-
>   drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>   drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>   drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>   drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>   drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>   drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>   drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>   drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>   drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>   drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>   drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>   drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>   drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>   drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>   .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>   .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>   drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>   drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>   drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>   .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>   .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>   drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>   drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>   .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>   .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>   drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>   .../drm/i915/display/intel_plane_initial.c    |   3 +-
>   drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>   drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>   drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>   drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>   drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>   .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>   drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>   drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>   drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>   drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>   drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>   drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>   drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>   drivers/gpu/drm/i915/i915_driver.c            |   1 +
>   drivers/gpu/drm/i915/i915_gem.c               |   8 -
>   drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>   drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>   drivers/gpu/drm/i915/i915_vma.c               |  12 -
>   drivers/gpu/drm/radeon/radeon.h               |  55 +--
>   drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>   drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>   drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>   drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>   drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>   drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>   drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>   include/drm/drm_pt_walk.h                     | 161 +++++++++
>   include/drm/drm_suballoc.h                    | 112 ++++++
>   include/drm/gpu_scheduler.h                   |  41 ++-
>   sound/hda/hdac_i915.c                         |  17 +-
>   sound/pci/hda/hda_intel.c                     |  56 +--
>   sound/soc/intel/avs/core.c                    |  13 +-
>   sound/soc/sof/intel/hda.c                     |   7 +-
>   98 files changed, 2076 insertions(+), 1325 deletions(-)
>   create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>   create mode 100644 include/drm/drm_pt_walk.h
>   create mode 100644 include/drm/drm_suballoc.h
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-02  8:14   ` [Intel-gfx] " Thomas Zimmermann
@ 2023-01-02 11:42     ` Jani Nikula
  -1 siblings, 0 replies; 161+ messages in thread
From: Jani Nikula @ 2023-01-02 11:42 UTC (permalink / raw)
  To: Thomas Zimmermann, Matthew Brost, intel-gfx, dri-devel

On Mon, 02 Jan 2023, Thomas Zimmermann <tzimmermann@suse.de> wrote:
> Hi
>
> Am 22.12.22 um 23:21 schrieb Matthew Brost:
>> Hello,
>> 
>> This is a submission for Xe, a new driver for Intel GPUs that supports both
>> integrated and discrete platforms starting with Tiger Lake (first platform with
>> Intel Xe Architecture). The intention of this new driver is to have a fresh base
>> to work from that is unencumbered by older platforms, whilst also taking the
>> opportunity to rearchitect our driver to increase sharing across the drm
>> subsystem, both leveraging and allowing us to contribute more towards other
>> shared components like TTM and drm/scheduler. The memory model is based on VM
>> bind which is similar to the i915 implementation. Likewise the execbuf
>> implementation for Xe is very similar to execbuf3 in the i915 [1].
>
> After Xe has stabilized, will i915 loose the ability to drive this 
> hardware (and possibly other)?  I'm specfically thinking of the i915 
> code that requires TTM. Keeping that dependecy within Xe only might 
> benefit DRM as a whole.

There's going to be a number of platforms supported by both drivers, and
from purely a i915 standpoint dropping any currently supported platforms
or that dependency from i915 would be a regression.

>> 
>> The code is at a stage where it is already functional and has experimental
>> support for multiple platforms starting from Tiger Lake, with initial support
>> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
>> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
>> implementation will be released publicly early next year. We also have a suite
>> of IGTs for XE that will appear on the IGT list shortly.
>> 
>> It has been built with the assumption of supporting multiple architectures from
>> the get-go, right now with tests running both on X86 and ARM hosts. And we
>> intend to continue working on it and improving on it as part of the kernel
>> community upstream.
>> 
>> The new Xe driver leverages a lot from i915 and work on i915 continues as we
>> ready Xe for production throughout 2023.
>> 
>> As for display, the intent is to share the display code with the i915 driver so
>> that there is maximum reuse there. Currently this is being done by compiling the
>> display code twice, but alternatives to that are under consideration and we want
>> to have more discussion on what the best final solution will look like over the
>> next few months. Right now, work is ongoing in refactoring the display codebase
>> to remove as much as possible any unnecessary dependencies on i915 specific data
>> structures there..
>
> Could both drivers reside in a common parent directory and share 
> something like a DRM Intel helper module with the common code? This 
> would fit well with the common design of DRM helpers.

I think it's too early to tell.

For one thing, setting that up would be a lot of up front infrastructure
work. I'm not sure how to even pull that off when Xe is still
out-of-tree and i915 development plunges on upstream as ever.

For another, realistically, the overlap between supported platforms is
going to end at some point, and eventually new platforms are only going
to be supported with Xe. That's going to open up new possibilities for
refactoring also the display code. I think it would be premature to lock
in to a common directory structure or a common helper module at this
point.

I'm not saying no to the idea, and we've contemplated it before, but I
think there are still too many moving parts to decide to go that way.


BR,
Jani.


>
> Best regards
> Thomas
>
>> 
>> We currently have 2 submission backends, execlists and GuC. The execlist is
>> meant mostly for testing and is not fully functional while GuC backend is fully
>> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
>> required and should be placed in /lib/firmware/xe.
>> 
>> The GuC firmware can be found in the below location:
>> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
>> 
>> The easiest way to setup firmware is:
>> cp -r /lib/firmware/i915 /lib/firmware/xe
>> 
>> The code has been organized such that we have all patches that touch areas
>> outside of drm/xe first for review, and then the actual new driver in a separate
>> commit. The code which is outside of drm/xe is included in this RFC while
>> drm/xe is not due to the size of the commit. The drm/xe is code is available in
>> a public repo listed below.
>> 
>> Xe driver commit:
>> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
>> 
>> Xe kernel repo:
>> https://cgit.freedesktop.org/drm/drm-xe/
>> 
>> There's a lot of work still to happen on Xe but we're very excited about it and
>> wanted to share it early and welcome feedback and discussion.
>> 
>> Cheers,
>> Matthew Brost
>> 
>> [1] https://patchwork.freedesktop.org/series/105879/
>> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
>> 
>> Maarten Lankhorst (12):
>>    drm/amd: Convert amdgpu to use suballocation helper.
>>    drm/radeon: Use the drm suballocation manager implementation.
>>    drm/i915: Remove gem and overlay frontbuffer tracking
>>    drm/i915/display: Neuter frontbuffer tracking harder
>>    drm/i915/display: Add more macros to remove all direct calls to uncore
>>    drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>>    drm/i915: Rename find_section to find_bdb_section
>>    drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>>    drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>>      fails
>>    drm/i915/display: Remaining changes to make xe compile
>>    sound/hda: Allow XE as i915 replacement for sound
>>    mei/hdcp: Also enable for XE
>> 
>> Matthew Brost (5):
>>    drm/sched: Convert drm scheduler to use a work queue rather than
>>      kthread
>>    drm/sched: Add generic scheduler message interface
>>    drm/sched: Start run wq before TDR in drm_sched_start
>>    drm/sched: Submit job before starting TDR
>>    drm/sched: Add helper to set TDR timeout
>> 
>> Thomas Hellström (3):
>>    drm/suballoc: Introduce a generic suballocation manager
>>    drm: Add a gpu page-table walker helper
>>    drm/ttm: Don't print error message if eviction was interrupted
>> 
>>   drivers/gpu/drm/Kconfig                       |   5 +
>>   drivers/gpu/drm/Makefile                      |   4 +
>>   drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>>   drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>>   drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>>   drivers/gpu/drm/i915/Makefile                 |   2 +-
>>   drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>>   drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>>   drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>>   .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>>   .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>>   drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>>   drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>>   drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>>   drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>>   drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>>   drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>>   drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>>   drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>>   drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>>   .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>>   .../drm/i915/display/intel_display_debugfs.c  |   8 +
>>   .../drm/i915/display/intel_display_power.c    |  40 ++-
>>   .../drm/i915/display/intel_display_power.h    |   6 +
>>   .../i915/display/intel_display_power_map.c    |   7 +
>>   .../i915/display/intel_display_power_well.c   |  24 +-
>>   .../drm/i915/display/intel_display_reg_defs.h |   4 +
>>   .../drm/i915/display/intel_display_trace.h    |   6 +
>>   .../drm/i915/display/intel_display_types.h    |  32 +-
>>   drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>>   drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>>   drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>>   drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>>   drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>>   drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>>   drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>>   drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>>   drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>>   drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>>   drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>>   drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>>   drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>>   drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>>   .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>>   .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>>   drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>>   drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>>   drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>>   .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>>   .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>>   drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>>   drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>>   .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>>   .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>>   drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>>   .../drm/i915/display/intel_plane_initial.c    |   3 +-
>>   drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>>   drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>>   drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>>   drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>>   drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>>   .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>>   drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>>   drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>>   drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>>   drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>>   drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>>   drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>>   drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>>   drivers/gpu/drm/i915/i915_driver.c            |   1 +
>>   drivers/gpu/drm/i915/i915_gem.c               |   8 -
>>   drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>>   drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>>   drivers/gpu/drm/i915/i915_vma.c               |  12 -
>>   drivers/gpu/drm/radeon/radeon.h               |  55 +--
>>   drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>>   drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>>   drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>>   drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>>   drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>>   drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>>   drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>>   drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>>   include/drm/drm_pt_walk.h                     | 161 +++++++++
>>   include/drm/drm_suballoc.h                    | 112 ++++++
>>   include/drm/gpu_scheduler.h                   |  41 ++-
>>   sound/hda/hdac_i915.c                         |  17 +-
>>   sound/pci/hda/hda_intel.c                     |  56 +--
>>   sound/soc/intel/avs/core.c                    |  13 +-
>>   sound/soc/sof/intel/hda.c                     |   7 +-
>>   98 files changed, 2076 insertions(+), 1325 deletions(-)
>>   create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>   create mode 100644 include/drm/drm_pt_walk.h
>>   create mode 100644 include/drm/drm_suballoc.h
>> 

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-02 11:42     ` Jani Nikula
  0 siblings, 0 replies; 161+ messages in thread
From: Jani Nikula @ 2023-01-02 11:42 UTC (permalink / raw)
  To: Thomas Zimmermann, Matthew Brost, intel-gfx, dri-devel

On Mon, 02 Jan 2023, Thomas Zimmermann <tzimmermann@suse.de> wrote:
> Hi
>
> Am 22.12.22 um 23:21 schrieb Matthew Brost:
>> Hello,
>> 
>> This is a submission for Xe, a new driver for Intel GPUs that supports both
>> integrated and discrete platforms starting with Tiger Lake (first platform with
>> Intel Xe Architecture). The intention of this new driver is to have a fresh base
>> to work from that is unencumbered by older platforms, whilst also taking the
>> opportunity to rearchitect our driver to increase sharing across the drm
>> subsystem, both leveraging and allowing us to contribute more towards other
>> shared components like TTM and drm/scheduler. The memory model is based on VM
>> bind which is similar to the i915 implementation. Likewise the execbuf
>> implementation for Xe is very similar to execbuf3 in the i915 [1].
>
> After Xe has stabilized, will i915 loose the ability to drive this 
> hardware (and possibly other)?  I'm specfically thinking of the i915 
> code that requires TTM. Keeping that dependecy within Xe only might 
> benefit DRM as a whole.

There's going to be a number of platforms supported by both drivers, and
from purely a i915 standpoint dropping any currently supported platforms
or that dependency from i915 would be a regression.

>> 
>> The code is at a stage where it is already functional and has experimental
>> support for multiple platforms starting from Tiger Lake, with initial support
>> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
>> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
>> implementation will be released publicly early next year. We also have a suite
>> of IGTs for XE that will appear on the IGT list shortly.
>> 
>> It has been built with the assumption of supporting multiple architectures from
>> the get-go, right now with tests running both on X86 and ARM hosts. And we
>> intend to continue working on it and improving on it as part of the kernel
>> community upstream.
>> 
>> The new Xe driver leverages a lot from i915 and work on i915 continues as we
>> ready Xe for production throughout 2023.
>> 
>> As for display, the intent is to share the display code with the i915 driver so
>> that there is maximum reuse there. Currently this is being done by compiling the
>> display code twice, but alternatives to that are under consideration and we want
>> to have more discussion on what the best final solution will look like over the
>> next few months. Right now, work is ongoing in refactoring the display codebase
>> to remove as much as possible any unnecessary dependencies on i915 specific data
>> structures there..
>
> Could both drivers reside in a common parent directory and share 
> something like a DRM Intel helper module with the common code? This 
> would fit well with the common design of DRM helpers.

I think it's too early to tell.

For one thing, setting that up would be a lot of up front infrastructure
work. I'm not sure how to even pull that off when Xe is still
out-of-tree and i915 development plunges on upstream as ever.

For another, realistically, the overlap between supported platforms is
going to end at some point, and eventually new platforms are only going
to be supported with Xe. That's going to open up new possibilities for
refactoring also the display code. I think it would be premature to lock
in to a common directory structure or a common helper module at this
point.

I'm not saying no to the idea, and we've contemplated it before, but I
think there are still too many moving parts to decide to go that way.


BR,
Jani.


>
> Best regards
> Thomas
>
>> 
>> We currently have 2 submission backends, execlists and GuC. The execlist is
>> meant mostly for testing and is not fully functional while GuC backend is fully
>> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
>> required and should be placed in /lib/firmware/xe.
>> 
>> The GuC firmware can be found in the below location:
>> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
>> 
>> The easiest way to setup firmware is:
>> cp -r /lib/firmware/i915 /lib/firmware/xe
>> 
>> The code has been organized such that we have all patches that touch areas
>> outside of drm/xe first for review, and then the actual new driver in a separate
>> commit. The code which is outside of drm/xe is included in this RFC while
>> drm/xe is not due to the size of the commit. The drm/xe is code is available in
>> a public repo listed below.
>> 
>> Xe driver commit:
>> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
>> 
>> Xe kernel repo:
>> https://cgit.freedesktop.org/drm/drm-xe/
>> 
>> There's a lot of work still to happen on Xe but we're very excited about it and
>> wanted to share it early and welcome feedback and discussion.
>> 
>> Cheers,
>> Matthew Brost
>> 
>> [1] https://patchwork.freedesktop.org/series/105879/
>> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
>> 
>> Maarten Lankhorst (12):
>>    drm/amd: Convert amdgpu to use suballocation helper.
>>    drm/radeon: Use the drm suballocation manager implementation.
>>    drm/i915: Remove gem and overlay frontbuffer tracking
>>    drm/i915/display: Neuter frontbuffer tracking harder
>>    drm/i915/display: Add more macros to remove all direct calls to uncore
>>    drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>>    drm/i915: Rename find_section to find_bdb_section
>>    drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>>    drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>>      fails
>>    drm/i915/display: Remaining changes to make xe compile
>>    sound/hda: Allow XE as i915 replacement for sound
>>    mei/hdcp: Also enable for XE
>> 
>> Matthew Brost (5):
>>    drm/sched: Convert drm scheduler to use a work queue rather than
>>      kthread
>>    drm/sched: Add generic scheduler message interface
>>    drm/sched: Start run wq before TDR in drm_sched_start
>>    drm/sched: Submit job before starting TDR
>>    drm/sched: Add helper to set TDR timeout
>> 
>> Thomas Hellström (3):
>>    drm/suballoc: Introduce a generic suballocation manager
>>    drm: Add a gpu page-table walker helper
>>    drm/ttm: Don't print error message if eviction was interrupted
>> 
>>   drivers/gpu/drm/Kconfig                       |   5 +
>>   drivers/gpu/drm/Makefile                      |   4 +
>>   drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>>   drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>>   drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>>   drivers/gpu/drm/i915/Makefile                 |   2 +-
>>   drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>>   drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>>   drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>>   .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>>   .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>>   drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>>   drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>>   drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>>   drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>>   drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>>   drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>>   drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>>   drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>>   drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>>   .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>>   .../drm/i915/display/intel_display_debugfs.c  |   8 +
>>   .../drm/i915/display/intel_display_power.c    |  40 ++-
>>   .../drm/i915/display/intel_display_power.h    |   6 +
>>   .../i915/display/intel_display_power_map.c    |   7 +
>>   .../i915/display/intel_display_power_well.c   |  24 +-
>>   .../drm/i915/display/intel_display_reg_defs.h |   4 +
>>   .../drm/i915/display/intel_display_trace.h    |   6 +
>>   .../drm/i915/display/intel_display_types.h    |  32 +-
>>   drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>>   drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>>   drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>>   drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>>   drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>>   drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>>   drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>>   drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>>   drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>>   drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>>   drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>>   drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>>   drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>>   drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>>   .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>>   .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>>   drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>>   drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>>   drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>>   .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>>   .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>>   drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>>   drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>>   .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>>   .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>>   drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>>   .../drm/i915/display/intel_plane_initial.c    |   3 +-
>>   drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>>   drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>>   drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>>   drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>>   drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>>   .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>>   drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>>   drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>>   drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>>   drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>>   drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>>   drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>>   drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>>   drivers/gpu/drm/i915/i915_driver.c            |   1 +
>>   drivers/gpu/drm/i915/i915_gem.c               |   8 -
>>   drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>>   drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>>   drivers/gpu/drm/i915/i915_vma.c               |  12 -
>>   drivers/gpu/drm/radeon/radeon.h               |  55 +--
>>   drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>>   drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>>   drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>>   drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>>   drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>>   drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>>   drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>>   drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>>   include/drm/drm_pt_walk.h                     | 161 +++++++++
>>   include/drm/drm_suballoc.h                    | 112 ++++++
>>   include/drm/gpu_scheduler.h                   |  41 ++-
>>   sound/hda/hdac_i915.c                         |  17 +-
>>   sound/pci/hda/hda_intel.c                     |  56 +--
>>   sound/soc/intel/avs/core.c                    |  13 +-
>>   sound/soc/sof/intel/hda.c                     |   7 +-
>>   98 files changed, 2076 insertions(+), 1325 deletions(-)
>>   create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>   create mode 100644 include/drm/drm_pt_walk.h
>>   create mode 100644 include/drm/drm_suballoc.h
>> 

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
                   ` (22 preceding siblings ...)
  (?)
@ 2023-01-03 12:21 ` Tvrtko Ursulin
  2023-01-05 21:27   ` Matthew Brost
  -1 siblings, 1 reply; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-03 12:21 UTC (permalink / raw)
  To: Matthew Brost, intel-gfx, dri-devel


On 22/12/2022 22:21, Matthew Brost wrote:
> Hello,
> 
> This is a submission for Xe, a new driver for Intel GPUs that supports both
> integrated and discrete platforms starting with Tiger Lake (first platform with
> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> to work from that is unencumbered by older platforms, whilst also taking the
> opportunity to rearchitect our driver to increase sharing across the drm
> subsystem, both leveraging and allowing us to contribute more towards other
> shared components like TTM and drm/scheduler. The memory model is based on VM
> bind which is similar to the i915 implementation. Likewise the execbuf
> implementation for Xe is very similar to execbuf3 in the i915 [1].
> 
> The code is at a stage where it is already functional and has experimental
> support for multiple platforms starting from Tiger Lake, with initial support
> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> implementation will be released publicly early next year. We also have a suite
> of IGTs for XE that will appear on the IGT list shortly.
> 
> It has been built with the assumption of supporting multiple architectures from
> the get-go, right now with tests running both on X86 and ARM hosts. And we
> intend to continue working on it and improving on it as part of the kernel
> community upstream.
> 
> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> ready Xe for production throughout 2023.
> 
> As for display, the intent is to share the display code with the i915 driver so
> that there is maximum reuse there. Currently this is being done by compiling the
> display code twice, but alternatives to that are under consideration and we want
> to have more discussion on what the best final solution will look like over the
> next few months. Right now, work is ongoing in refactoring the display codebase
> to remove as much as possible any unnecessary dependencies on i915 specific data
> structures there..
> 
> We currently have 2 submission backends, execlists and GuC. The execlist is
> meant mostly for testing and is not fully functional while GuC backend is fully
> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> required and should be placed in /lib/firmware/xe.

What is the plan going forward for the execlists backend? I think it 
would be preferable to not upstream something semi-functional and so to 
carry technical debt in the brand new code base, from the very start. If 
it is for Tigerlake, which is the starting platform for Xe, could it be 
made GuC only Tigerlake for instance?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-02  7:30         ` [Intel-gfx] " Boris Brezillon
  (?)
@ 2023-01-03 13:02         ` Tvrtko Ursulin
  2023-01-03 14:21             ` Boris Brezillon
  2023-01-05 21:43             ` Matthew Brost
  -1 siblings, 2 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-03 13:02 UTC (permalink / raw)
  To: Boris Brezillon, Matthew Brost; +Cc: intel-gfx, dri-devel


On 02/01/2023 07:30, Boris Brezillon wrote:
> On Fri, 30 Dec 2022 12:55:08 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> On Fri, 30 Dec 2022 11:20:42 +0100
>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>
>>> Hello Matthew,
>>>
>>> On Thu, 22 Dec 2022 14:21:11 -0800
>>> Matthew Brost <matthew.brost@intel.com> wrote:
>>>    
>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>>>> seems a bit odd but let us explain the reasoning below.
>>>>
>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>> guaranteed to be the same completion even if targeting the same hardware
>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>> which allowed to reorder, timeslice, and preempt submissions. If a using
>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>>>> apart as the TDR expects submission order == completion order. Using a
>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>
>>> Oh, that's interesting. I've been trying to solve the same sort of
>>> issues to support Arm's new Mali GPU which is relying on a FW-assisted
>>> scheduling scheme (you give the FW N streams to execute, and it does
>>> the scheduling between those N command streams, the kernel driver
>>> does timeslice scheduling to update the command streams passed to the
>>> FW). I must admit I gave up on using drm_sched at some point, mostly
>>> because the integration with drm_sched was painful, but also because I
>>> felt trying to bend drm_sched to make it interact with a
>>> timeslice-oriented scheduling model wasn't really future proof. Giving
>>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
>>> help for a few things (didn't think it through yet), but I feel it's
>>> coming short on other aspects we have to deal with on Arm GPUs.
>>
>> Ok, so I just had a quick look at the Xe driver and how it
>> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
>> have a better understanding of how you get away with using drm_sched
>> while still controlling how scheduling is really done. Here
>> drm_gpu_scheduler is just a dummy abstract that let's you use the
>> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
>> selection is dumb because there's only one entity ever bound to the
>> scheduler (the one that's part of the xe_guc_engine object which also
>> contains the drm_gpu_scheduler instance). I guess the main issue we'd
>> have on Arm is the fact that the stream doesn't necessarily get
>> scheduled when ->run_job() is called, it can be placed in the runnable
>> queue and be picked later by the kernel-side scheduler when a FW slot
>> gets released. That can probably be sorted out by manually disabling the
>> job timer and re-enabling it when the stream gets picked by the
>> scheduler. But my main concern remains, we're basically abusing
>> drm_sched here.
>>
>> For the Arm driver, that means turning the following sequence
>>
>> 1. wait for job deps
>> 2. queue job to ringbuf and push the stream to the runnable
>>     queue (if it wasn't queued already). Wakeup the timeslice scheduler
>>     to re-evaluate (if the stream is not on a FW slot already)
>> 3. stream gets picked by the timeslice scheduler and sent to the FW for
>>     execution
>>
>> into
>>
>> 1. queue job to entity which takes care of waiting for job deps for
>>     us
>> 2. schedule a drm_sched_main iteration
>> 3. the only available entity is picked, and the first job from this
>>     entity is dequeued. ->run_job() is called: the job is queued to the
>>     ringbuf and the stream is pushed to the runnable queue (if it wasn't
>>     queued already). Wakeup the timeslice scheduler to re-evaluate (if
>>     the stream is not on a FW slot already)
>> 4. stream gets picked by the timeslice scheduler and sent to the FW for
>>     execution
>>
>> That's one extra step we don't really need. To sum-up, yes, all the
>> job/entity tracking might be interesting to share/re-use, but I wonder
>> if we couldn't have that without pulling out the scheduling part of
>> drm_sched, or maybe I'm missing something, and there's something in
>> drm_gpu_scheduler you really need.
> 
> On second thought, that's probably an acceptable overhead (not even
> sure the extra step I was mentioning exists in practice, because dep
> fence signaled state is checked as part of the drm_sched_main
> iteration, so that's basically replacing the worker I schedule to
> check job deps), and I like the idea of being able to re-use drm_sched
> dep-tracking without resorting to invasive changes to the existing
> logic, so I'll probably give it a try.

I agree with the concerns and think that how Xe proposes to integrate 
with drm_sched is a problem, or at least significantly inelegant.

AFAICT it proposes to have 1:1 between *userspace* created contexts (per 
context _and_ engine) and drm_sched. I am not sure avoiding invasive 
changes to the shared code is in the spirit of the overall idea and 
instead opportunity should be used to look at way to refactor/improve 
drm_sched.

Even on the low level, the idea to replace drm_sched threads with 
workers has a few problems.

To start with, the pattern of:

   while (not_stopped) {
	keep picking jobs
   }

Feels fundamentally in disagreement with workers (while obviously fits 
perfectly with the current kthread design).

Secondly, it probably demands separate workers (not optional), otherwise 
behaviour of shared workqueues has either the potential to explode 
number kernel threads anyway, or add latency.

What would be interesting to learn is whether the option of refactoring 
drm_sched to deal with out of order completion was considered and what 
were the conclusions.

Second option perhaps to split out the drm_sched code into parts which 
would lend themselves more to "pick and choose" of its functionalities. 
Specifically, Xe wants frontend dependency tracking, but not any 
scheduling really (neither least busy drm_sched, neither FIFO/RQ entity 
picking), so even having all these data structures in memory is a waste.

With the first option then the end result could be drm_sched per engine 
class (hardware view), which I think fits with the GuC model. Give all 
schedulable contexts (entities) to the GuC and then mostly forget about 
them. Timeslicing and re-ordering and all happens transparently to the 
kernel from that point until completion.

Or with the second option you would build on some smaller refactored 
sub-components of drm_sched, by maybe splitting the dependency tracking 
from scheduling (RR/FIFO entity picking code).

Second option is especially a bit vague and I haven't thought about the 
required mechanics, but it just appeared too obvious the proposed design 
has a bit too much impedance mismatch.

Oh and as a side note, when I went into the drm_sched code base to 
remind myself how things worked, it is quite easy to find some FIXME 
comments which suggest people working on it are unsure of locking desing 
there and such. So perhaps that all needs cleanup too, I mean would 
benefit from refactoring/improving work as brainstormed above anyway.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-02 11:42     ` [Intel-gfx] " Jani Nikula
@ 2023-01-03 13:56       ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-03 13:56 UTC (permalink / raw)
  To: Jani Nikula
  Cc: Matthew Brost, intel-gfx, dri-devel, Thomas Zimmermann,
	Alyssa Rosenzweig

Hi,

On Mon, 02 Jan 2023 13:42:46 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> On Mon, 02 Jan 2023, Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > Hi
> >
> > Am 22.12.22 um 23:21 schrieb Matthew Brost:  
> >> Hello,
> >> 
> >> This is a submission for Xe, a new driver for Intel GPUs that supports both
> >> integrated and discrete platforms starting with Tiger Lake (first platform with
> >> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> >> to work from that is unencumbered by older platforms, whilst also taking the
> >> opportunity to rearchitect our driver to increase sharing across the drm
> >> subsystem, both leveraging and allowing us to contribute more towards other
> >> shared components like TTM and drm/scheduler. The memory model is based on VM
> >> bind which is similar to the i915 implementation. Likewise the execbuf
> >> implementation for Xe is very similar to execbuf3 in the i915 [1].  
> >
> > After Xe has stabilized, will i915 loose the ability to drive this 
> > hardware (and possibly other)?  I'm specfically thinking of the i915 
> > code that requires TTM. Keeping that dependecy within Xe only might 
> > benefit DRM as a whole.  
> 
> There's going to be a number of platforms supported by both drivers, and
> from purely a i915 standpoint dropping any currently supported platforms
> or that dependency from i915 would be a regression.
> 
> >> 
> >> The code is at a stage where it is already functional and has experimental
> >> support for multiple platforms starting from Tiger Lake, with initial support
> >> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> >> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> >> implementation will be released publicly early next year. We also have a suite
> >> of IGTs for XE that will appear on the IGT list shortly.
> >> 
> >> It has been built with the assumption of supporting multiple architectures from
> >> the get-go, right now with tests running both on X86 and ARM hosts. And we
> >> intend to continue working on it and improving on it as part of the kernel
> >> community upstream.
> >> 
> >> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> >> ready Xe for production throughout 2023.
> >> 
> >> As for display, the intent is to share the display code with the i915 driver so
> >> that there is maximum reuse there. Currently this is being done by compiling the
> >> display code twice, but alternatives to that are under consideration and we want
> >> to have more discussion on what the best final solution will look like over the
> >> next few months. Right now, work is ongoing in refactoring the display codebase
> >> to remove as much as possible any unnecessary dependencies on i915 specific data
> >> structures there..  
> >
> > Could both drivers reside in a common parent directory and share 
> > something like a DRM Intel helper module with the common code? This 
> > would fit well with the common design of DRM helpers.  
> 
> I think it's too early to tell.
> 
> For one thing, setting that up would be a lot of up front infrastructure
> work. I'm not sure how to even pull that off when Xe is still
> out-of-tree and i915 development plunges on upstream as ever.
> 
> For another, realistically, the overlap between supported platforms is
> going to end at some point, and eventually new platforms are only going
> to be supported with Xe. That's going to open up new possibilities for
> refactoring also the display code. I think it would be premature to lock
> in to a common directory structure or a common helper module at this
> point.
> 
> I'm not saying no to the idea, and we've contemplated it before, but I
> think there are still too many moving parts to decide to go that way.

FWIW, I actually have the same dilemma with the driver for new Mali GPUs
I'm working on. I initially started making it a sub-driver of the
existing panfrost driver (some HW blocks are similar, like the
IOMMU and a few other things, and some SW abstracts can be shared here
and there, like the GEM allocator logic). But I'm now considering
forking the driver (after Alyssa planted the seed :-)), not only
because I want to start from a clean sheet on the the uAPI front
(wouldn't be an issue in your case, because you're talking about
sharing helpers, not the driver frontend), but also because any refactor
to panfrost is a potential source of regression for existing users. So,
I tend to agree with Jani here, trying to share code before things have
settled down is likely to cause pain to both Xe and i915
users+developers.

Best Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-03 13:56       ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-03 13:56 UTC (permalink / raw)
  To: Jani Nikula; +Cc: intel-gfx, dri-devel, Thomas Zimmermann, Alyssa Rosenzweig

Hi,

On Mon, 02 Jan 2023 13:42:46 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> On Mon, 02 Jan 2023, Thomas Zimmermann <tzimmermann@suse.de> wrote:
> > Hi
> >
> > Am 22.12.22 um 23:21 schrieb Matthew Brost:  
> >> Hello,
> >> 
> >> This is a submission for Xe, a new driver for Intel GPUs that supports both
> >> integrated and discrete platforms starting with Tiger Lake (first platform with
> >> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> >> to work from that is unencumbered by older platforms, whilst also taking the
> >> opportunity to rearchitect our driver to increase sharing across the drm
> >> subsystem, both leveraging and allowing us to contribute more towards other
> >> shared components like TTM and drm/scheduler. The memory model is based on VM
> >> bind which is similar to the i915 implementation. Likewise the execbuf
> >> implementation for Xe is very similar to execbuf3 in the i915 [1].  
> >
> > After Xe has stabilized, will i915 loose the ability to drive this 
> > hardware (and possibly other)?  I'm specfically thinking of the i915 
> > code that requires TTM. Keeping that dependecy within Xe only might 
> > benefit DRM as a whole.  
> 
> There's going to be a number of platforms supported by both drivers, and
> from purely a i915 standpoint dropping any currently supported platforms
> or that dependency from i915 would be a regression.
> 
> >> 
> >> The code is at a stage where it is already functional and has experimental
> >> support for multiple platforms starting from Tiger Lake, with initial support
> >> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> >> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> >> implementation will be released publicly early next year. We also have a suite
> >> of IGTs for XE that will appear on the IGT list shortly.
> >> 
> >> It has been built with the assumption of supporting multiple architectures from
> >> the get-go, right now with tests running both on X86 and ARM hosts. And we
> >> intend to continue working on it and improving on it as part of the kernel
> >> community upstream.
> >> 
> >> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> >> ready Xe for production throughout 2023.
> >> 
> >> As for display, the intent is to share the display code with the i915 driver so
> >> that there is maximum reuse there. Currently this is being done by compiling the
> >> display code twice, but alternatives to that are under consideration and we want
> >> to have more discussion on what the best final solution will look like over the
> >> next few months. Right now, work is ongoing in refactoring the display codebase
> >> to remove as much as possible any unnecessary dependencies on i915 specific data
> >> structures there..  
> >
> > Could both drivers reside in a common parent directory and share 
> > something like a DRM Intel helper module with the common code? This 
> > would fit well with the common design of DRM helpers.  
> 
> I think it's too early to tell.
> 
> For one thing, setting that up would be a lot of up front infrastructure
> work. I'm not sure how to even pull that off when Xe is still
> out-of-tree and i915 development plunges on upstream as ever.
> 
> For another, realistically, the overlap between supported platforms is
> going to end at some point, and eventually new platforms are only going
> to be supported with Xe. That's going to open up new possibilities for
> refactoring also the display code. I think it would be premature to lock
> in to a common directory structure or a common helper module at this
> point.
> 
> I'm not saying no to the idea, and we've contemplated it before, but I
> think there are still too many moving parts to decide to go that way.

FWIW, I actually have the same dilemma with the driver for new Mali GPUs
I'm working on. I initially started making it a sub-driver of the
existing panfrost driver (some HW blocks are similar, like the
IOMMU and a few other things, and some SW abstracts can be shared here
and there, like the GEM allocator logic). But I'm now considering
forking the driver (after Alyssa planted the seed :-)), not only
because I want to start from a clean sheet on the the uAPI front
(wouldn't be an issue in your case, because you're talking about
sharing helpers, not the driver frontend), but also because any refactor
to panfrost is a potential source of regression for existing users. So,
I tend to agree with Jani here, trying to share code before things have
settled down is likely to cause pain to both Xe and i915
users+developers.

Best Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-03 13:02         ` Tvrtko Ursulin
@ 2023-01-03 14:21             ` Boris Brezillon
  2023-01-05 21:43             ` Matthew Brost
  1 sibling, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-03 14:21 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Matthew Brost, intel-gfx, dri-devel

Hi,

On Tue, 3 Jan 2023 13:02:15 +0000
Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:

> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> On Fri, 30 Dec 2022 11:20:42 +0100
> >> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>  
> >>> Hello Matthew,
> >>>
> >>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>> Matthew Brost <matthew.brost@intel.com> wrote:
> >>>      
> >>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> >>>> seems a bit odd but let us explain the reasoning below.
> >>>>
> >>>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>>> guaranteed to be the same completion even if targeting the same hardware
> >>>> engine. This is because in XE we have a firmware scheduler, the GuC,
> >>>> which allowed to reorder, timeslice, and preempt submissions. If a using
> >>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> >>>> apart as the TDR expects submission order == completion order. Using a
> >>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> >>>
> >>> Oh, that's interesting. I've been trying to solve the same sort of
> >>> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> >>> scheduling scheme (you give the FW N streams to execute, and it does
> >>> the scheduling between those N command streams, the kernel driver
> >>> does timeslice scheduling to update the command streams passed to the
> >>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>> because the integration with drm_sched was painful, but also because I
> >>> felt trying to bend drm_sched to make it interact with a
> >>> timeslice-oriented scheduling model wasn't really future proof. Giving
> >>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> >>> help for a few things (didn't think it through yet), but I feel it's
> >>> coming short on other aspects we have to deal with on Arm GPUs.  
> >>
> >> Ok, so I just had a quick look at the Xe driver and how it
> >> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> >> have a better understanding of how you get away with using drm_sched
> >> while still controlling how scheduling is really done. Here
> >> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >> selection is dumb because there's only one entity ever bound to the
> >> scheduler (the one that's part of the xe_guc_engine object which also
> >> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >> have on Arm is the fact that the stream doesn't necessarily get
> >> scheduled when ->run_job() is called, it can be placed in the runnable
> >> queue and be picked later by the kernel-side scheduler when a FW slot
> >> gets released. That can probably be sorted out by manually disabling the
> >> job timer and re-enabling it when the stream gets picked by the
> >> scheduler. But my main concern remains, we're basically abusing
> >> drm_sched here.
> >>
> >> For the Arm driver, that means turning the following sequence
> >>
> >> 1. wait for job deps
> >> 2. queue job to ringbuf and push the stream to the runnable
> >>     queue (if it wasn't queued already). Wakeup the timeslice scheduler
> >>     to re-evaluate (if the stream is not on a FW slot already)
> >> 3. stream gets picked by the timeslice scheduler and sent to the FW for
> >>     execution
> >>
> >> into
> >>
> >> 1. queue job to entity which takes care of waiting for job deps for
> >>     us
> >> 2. schedule a drm_sched_main iteration
> >> 3. the only available entity is picked, and the first job from this
> >>     entity is dequeued. ->run_job() is called: the job is queued to the
> >>     ringbuf and the stream is pushed to the runnable queue (if it wasn't
> >>     queued already). Wakeup the timeslice scheduler to re-evaluate (if
> >>     the stream is not on a FW slot already)
> >> 4. stream gets picked by the timeslice scheduler and sent to the FW for
> >>     execution
> >>
> >> That's one extra step we don't really need. To sum-up, yes, all the
> >> job/entity tracking might be interesting to share/re-use, but I wonder
> >> if we couldn't have that without pulling out the scheduling part of
> >> drm_sched, or maybe I'm missing something, and there's something in
> >> drm_gpu_scheduler you really need.  
> > 
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.  
> 
> I agree with the concerns and think that how Xe proposes to integrate 
> with drm_sched is a problem, or at least significantly inelegant.

Okay, so it looks like I'm not the only one to be bothered by the way Xe
tries to bypass the drm_sched limitations :-).

> 
> AFAICT it proposes to have 1:1 between *userspace* created contexts (per 
> context _and_ engine) and drm_sched. I am not sure avoiding invasive 
> changes to the shared code is in the spirit of the overall idea and 
> instead opportunity should be used to look at way to refactor/improve 
> drm_sched.
> 
> Even on the low level, the idea to replace drm_sched threads with 
> workers has a few problems.
> 
> To start with, the pattern of:
> 
>    while (not_stopped) {
> 	keep picking jobs
>    }
> 
> Feels fundamentally in disagreement with workers (while obviously fits 
> perfectly with the current kthread design).
> 
> Secondly, it probably demands separate workers (not optional), otherwise 
> behaviour of shared workqueues has either the potential to explode 
> number kernel threads anyway, or add latency.
> 
> What would be interesting to learn is whether the option of refactoring 
> drm_sched to deal with out of order completion was considered and what 
> were the conclusions.

I might be wrong, but I don't think the fundamental issue here is the
out-of-order completion thing that's mentioned in the commit message.
It just feels like this is a symptom of the impedance mismatch we
have between priority+FIFO-based job scheduling and
priority+timeslice-based queue scheduling (a queue being represented by
a drm_sched_entity in drm_sched).

> 
> Second option perhaps to split out the drm_sched code into parts which 
> would lend themselves more to "pick and choose" of its functionalities. 
> Specifically, Xe wants frontend dependency tracking, but not any 
> scheduling really (neither least busy drm_sched, neither FIFO/RQ entity 
> picking), so even having all these data structures in memory is a waste.

Same thing for the panfrost+CSF driver I was mentioning in my previous
emails.

> 
> With the first option then the end result could be drm_sched per engine 
> class (hardware view), which I think fits with the GuC model. Give all 
> schedulable contexts (entities) to the GuC and then mostly forget about 
> them. Timeslicing and re-ordering and all happens transparently to the 
> kernel from that point until completion.

Yep, that would work. I guess it would mean creating an intermediate
abstract/interface to schedule entities and then implement this
interface for the simple HW-engine+job-scheduling case, so that
existing drm_sched users don't see a difference, and new drivers that
need to interface with FW-assisted schedulers can implement the
higher-lever entity scheduling interface. Don't know what this
interface would look like though.

> 
> Or with the second option you would build on some smaller refactored 
> sub-components of drm_sched, by maybe splitting the dependency tracking 
> from scheduling (RR/FIFO entity picking code).

What I've done so far is duplicate the dep-tracking logic in the
driver. It's not that much code, but it would be nice to not have to
duplicate it in the first place...

Regards,

Boris


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-03 14:21             ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-03 14:21 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

Hi,

On Tue, 3 Jan 2023 13:02:15 +0000
Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:

> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> On Fri, 30 Dec 2022 11:20:42 +0100
> >> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>  
> >>> Hello Matthew,
> >>>
> >>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>> Matthew Brost <matthew.brost@intel.com> wrote:
> >>>      
> >>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> >>>> seems a bit odd but let us explain the reasoning below.
> >>>>
> >>>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>>> guaranteed to be the same completion even if targeting the same hardware
> >>>> engine. This is because in XE we have a firmware scheduler, the GuC,
> >>>> which allowed to reorder, timeslice, and preempt submissions. If a using
> >>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> >>>> apart as the TDR expects submission order == completion order. Using a
> >>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> >>>
> >>> Oh, that's interesting. I've been trying to solve the same sort of
> >>> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> >>> scheduling scheme (you give the FW N streams to execute, and it does
> >>> the scheduling between those N command streams, the kernel driver
> >>> does timeslice scheduling to update the command streams passed to the
> >>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>> because the integration with drm_sched was painful, but also because I
> >>> felt trying to bend drm_sched to make it interact with a
> >>> timeslice-oriented scheduling model wasn't really future proof. Giving
> >>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> >>> help for a few things (didn't think it through yet), but I feel it's
> >>> coming short on other aspects we have to deal with on Arm GPUs.  
> >>
> >> Ok, so I just had a quick look at the Xe driver and how it
> >> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> >> have a better understanding of how you get away with using drm_sched
> >> while still controlling how scheduling is really done. Here
> >> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >> selection is dumb because there's only one entity ever bound to the
> >> scheduler (the one that's part of the xe_guc_engine object which also
> >> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >> have on Arm is the fact that the stream doesn't necessarily get
> >> scheduled when ->run_job() is called, it can be placed in the runnable
> >> queue and be picked later by the kernel-side scheduler when a FW slot
> >> gets released. That can probably be sorted out by manually disabling the
> >> job timer and re-enabling it when the stream gets picked by the
> >> scheduler. But my main concern remains, we're basically abusing
> >> drm_sched here.
> >>
> >> For the Arm driver, that means turning the following sequence
> >>
> >> 1. wait for job deps
> >> 2. queue job to ringbuf and push the stream to the runnable
> >>     queue (if it wasn't queued already). Wakeup the timeslice scheduler
> >>     to re-evaluate (if the stream is not on a FW slot already)
> >> 3. stream gets picked by the timeslice scheduler and sent to the FW for
> >>     execution
> >>
> >> into
> >>
> >> 1. queue job to entity which takes care of waiting for job deps for
> >>     us
> >> 2. schedule a drm_sched_main iteration
> >> 3. the only available entity is picked, and the first job from this
> >>     entity is dequeued. ->run_job() is called: the job is queued to the
> >>     ringbuf and the stream is pushed to the runnable queue (if it wasn't
> >>     queued already). Wakeup the timeslice scheduler to re-evaluate (if
> >>     the stream is not on a FW slot already)
> >> 4. stream gets picked by the timeslice scheduler and sent to the FW for
> >>     execution
> >>
> >> That's one extra step we don't really need. To sum-up, yes, all the
> >> job/entity tracking might be interesting to share/re-use, but I wonder
> >> if we couldn't have that without pulling out the scheduling part of
> >> drm_sched, or maybe I'm missing something, and there's something in
> >> drm_gpu_scheduler you really need.  
> > 
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.  
> 
> I agree with the concerns and think that how Xe proposes to integrate 
> with drm_sched is a problem, or at least significantly inelegant.

Okay, so it looks like I'm not the only one to be bothered by the way Xe
tries to bypass the drm_sched limitations :-).

> 
> AFAICT it proposes to have 1:1 between *userspace* created contexts (per 
> context _and_ engine) and drm_sched. I am not sure avoiding invasive 
> changes to the shared code is in the spirit of the overall idea and 
> instead opportunity should be used to look at way to refactor/improve 
> drm_sched.
> 
> Even on the low level, the idea to replace drm_sched threads with 
> workers has a few problems.
> 
> To start with, the pattern of:
> 
>    while (not_stopped) {
> 	keep picking jobs
>    }
> 
> Feels fundamentally in disagreement with workers (while obviously fits 
> perfectly with the current kthread design).
> 
> Secondly, it probably demands separate workers (not optional), otherwise 
> behaviour of shared workqueues has either the potential to explode 
> number kernel threads anyway, or add latency.
> 
> What would be interesting to learn is whether the option of refactoring 
> drm_sched to deal with out of order completion was considered and what 
> were the conclusions.

I might be wrong, but I don't think the fundamental issue here is the
out-of-order completion thing that's mentioned in the commit message.
It just feels like this is a symptom of the impedance mismatch we
have between priority+FIFO-based job scheduling and
priority+timeslice-based queue scheduling (a queue being represented by
a drm_sched_entity in drm_sched).

> 
> Second option perhaps to split out the drm_sched code into parts which 
> would lend themselves more to "pick and choose" of its functionalities. 
> Specifically, Xe wants frontend dependency tracking, but not any 
> scheduling really (neither least busy drm_sched, neither FIFO/RQ entity 
> picking), so even having all these data structures in memory is a waste.

Same thing for the panfrost+CSF driver I was mentioning in my previous
emails.

> 
> With the first option then the end result could be drm_sched per engine 
> class (hardware view), which I think fits with the GuC model. Give all 
> schedulable contexts (entities) to the GuC and then mostly forget about 
> them. Timeslicing and re-ordering and all happens transparently to the 
> kernel from that point until completion.

Yep, that would work. I guess it would mean creating an intermediate
abstract/interface to schedule entities and then implement this
interface for the simple HW-engine+job-scheduling case, so that
existing drm_sched users don't see a difference, and new drivers that
need to interface with FW-assisted schedulers can implement the
higher-lever entity scheduling interface. Don't know what this
interface would look like though.

> 
> Or with the second option you would build on some smaller refactored 
> sub-components of drm_sched, by maybe splitting the dependency tracking 
> from scheduling (RR/FIFO entity picking code).

What I've done so far is duplicate the dep-tracking logic in the
driver. It's not that much code, but it would be nice to not have to
duplicate it in the first place...

Regards,

Boris


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-03 13:56       ` [Intel-gfx] " Boris Brezillon
@ 2023-01-03 14:41         ` Alyssa Rosenzweig
  -1 siblings, 0 replies; 161+ messages in thread
From: Alyssa Rosenzweig @ 2023-01-03 14:41 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, intel-gfx, dri-devel, Thomas Zimmermann,
	Alyssa Rosenzweig

> > For one thing, setting that up would be a lot of up front infrastructure
> > work. I'm not sure how to even pull that off when Xe is still
> > out-of-tree and i915 development plunges on upstream as ever.
> > 
> > For another, realistically, the overlap between supported platforms is
> > going to end at some point, and eventually new platforms are only going
> > to be supported with Xe. That's going to open up new possibilities for
> > refactoring also the display code. I think it would be premature to lock
> > in to a common directory structure or a common helper module at this
> > point.
> > 
> > I'm not saying no to the idea, and we've contemplated it before, but I
> > think there are still too many moving parts to decide to go that way.
> 
> FWIW, I actually have the same dilemma with the driver for new Mali GPUs
> I'm working on. I initially started making it a sub-driver of the
> existing panfrost driver (some HW blocks are similar, like the
> IOMMU and a few other things, and some SW abstracts can be shared here
> and there, like the GEM allocator logic). But I'm now considering
> forking the driver (after Alyssa planted the seed :-)), not only
> because I want to start from a clean sheet on the the uAPI front
> (wouldn't be an issue in your case, because you're talking about
> sharing helpers, not the driver frontend), but also because any refactor
> to panfrost is a potential source of regression for existing users. So,
> I tend to agree with Jani here, trying to share code before things have
> settled down is likely to cause pain to both Xe and i915
> users+developers.

++

I pretend to have never written a kernel driver, so will not comment
there. But Boris and I were previously bit trying to share code between
our GL and VK drivers, before VK settled down, causing pain for both. I
don't want a kernelside repeat of that (for either Mali or Intel).

I tend to think that, if you're tempted to share a driver frontend
without the backend, that's a sign that there's too much boilerplate for
the frontend and maybe there needs to be more helpers somewhere. For Xe,
that doesn't apply since the hw overlaps between the drivers, but for
Mali, there really is more different than similar and there's an
obvious, acute break between "old Mali" and "new Mali". The shared
"instantiate a DRM driver boilerplate" is pretty trivial, and the MMU
code is as simple as it gets...

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-03 14:41         ` Alyssa Rosenzweig
  0 siblings, 0 replies; 161+ messages in thread
From: Alyssa Rosenzweig @ 2023-01-03 14:41 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: intel-gfx, dri-devel, Thomas Zimmermann, Alyssa Rosenzweig

> > For one thing, setting that up would be a lot of up front infrastructure
> > work. I'm not sure how to even pull that off when Xe is still
> > out-of-tree and i915 development plunges on upstream as ever.
> > 
> > For another, realistically, the overlap between supported platforms is
> > going to end at some point, and eventually new platforms are only going
> > to be supported with Xe. That's going to open up new possibilities for
> > refactoring also the display code. I think it would be premature to lock
> > in to a common directory structure or a common helper module at this
> > point.
> > 
> > I'm not saying no to the idea, and we've contemplated it before, but I
> > think there are still too many moving parts to decide to go that way.
> 
> FWIW, I actually have the same dilemma with the driver for new Mali GPUs
> I'm working on. I initially started making it a sub-driver of the
> existing panfrost driver (some HW blocks are similar, like the
> IOMMU and a few other things, and some SW abstracts can be shared here
> and there, like the GEM allocator logic). But I'm now considering
> forking the driver (after Alyssa planted the seed :-)), not only
> because I want to start from a clean sheet on the the uAPI front
> (wouldn't be an issue in your case, because you're talking about
> sharing helpers, not the driver frontend), but also because any refactor
> to panfrost is a potential source of regression for existing users. So,
> I tend to agree with Jani here, trying to share code before things have
> settled down is likely to cause pain to both Xe and i915
> users+developers.

++

I pretend to have never written a kernel driver, so will not comment
there. But Boris and I were previously bit trying to share code between
our GL and VK drivers, before VK settled down, causing pain for both. I
don't want a kernelside repeat of that (for either Mali or Intel).

I tend to think that, if you're tempted to share a driver frontend
without the backend, that's a sign that there's too much boilerplate for
the frontend and maybe there needs to be more helpers somewhere. For Xe,
that doesn't apply since the hw overlaps between the drivers, but for
Mali, there really is more different than similar and there's an
obvious, acute break between "old Mali" and "new Mali". The shared
"instantiate a DRM driver boilerplate" is pretty trivial, and the MMU
code is as simple as it gets...

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-02  7:30         ` [Intel-gfx] " Boris Brezillon
@ 2023-01-05 19:40           ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-05 19:40 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel

On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> On Fri, 30 Dec 2022 12:55:08 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > On Fri, 30 Dec 2022 11:20:42 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > 
> > > Hello Matthew,
> > > 
> > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > Matthew Brost <matthew.brost@intel.com> wrote:
> > >   
> > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > > seems a bit odd but let us explain the reasoning below.
> > > > 
> > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > guaranteed to be the same completion even if targeting the same hardware
> > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > > apart as the TDR expects submission order == completion order. Using a
> > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.    
> > > 
> > > Oh, that's interesting. I've been trying to solve the same sort of
> > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > scheduling scheme (you give the FW N streams to execute, and it does
> > > the scheduling between those N command streams, the kernel driver
> > > does timeslice scheduling to update the command streams passed to the
> > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > because the integration with drm_sched was painful, but also because I
> > > felt trying to bend drm_sched to make it interact with a
> > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > help for a few things (didn't think it through yet), but I feel it's
> > > coming short on other aspects we have to deal with on Arm GPUs.  
> > 
> > Ok, so I just had a quick look at the Xe driver and how it
> > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > have a better understanding of how you get away with using drm_sched
> > while still controlling how scheduling is really done. Here
> > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > drm_sched job queuing/dep/tracking mechanism. The whole run-queue

You nailed it here, we use the DRM scheduler for queuing jobs,
dependency tracking and releasing jobs to be scheduled when dependencies
are met, and lastly a tracking mechanism of inflights jobs that need to
be cleaned up if an error occurs. It doesn't actually do any scheduling
aside from the most basic level of not overflowing the submission ring
buffer. In this sense, a 1 to 1 relationship between entity and
scheduler fits quite well.

FWIW this design was also ran by AMD quite a while ago (off the list)
and we didn't get any serious push back. Things can change however...

> > selection is dumb because there's only one entity ever bound to the
> > scheduler (the one that's part of the xe_guc_engine object which also
> > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > have on Arm is the fact that the stream doesn't necessarily get
> > scheduled when ->run_job() is called, it can be placed in the runnable
> > queue and be picked later by the kernel-side scheduler when a FW slot
> > gets released. That can probably be sorted out by manually disabling the
> > job timer and re-enabling it when the stream gets picked by the
> > scheduler. But my main concern remains, we're basically abusing
> > drm_sched here.
> > 

That's a matter of opinion, yes we are using it slightly differently
than anyone else but IMO the fact the DRM scheduler works for the Xe use
case with barely any changes is a testament to its design.

> > For the Arm driver, that means turning the following sequence
> > 
> > 1. wait for job deps
> > 2. queue job to ringbuf and push the stream to the runnable
> >    queue (if it wasn't queued already). Wakeup the timeslice scheduler
> >    to re-evaluate (if the stream is not on a FW slot already)
> > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> >    execution
> > 
> > into
> > 
> > 1. queue job to entity which takes care of waiting for job deps for
> >    us
> > 2. schedule a drm_sched_main iteration
> > 3. the only available entity is picked, and the first job from this
> >    entity is dequeued. ->run_job() is called: the job is queued to the
> >    ringbuf and the stream is pushed to the runnable queue (if it wasn't
> >    queued already). Wakeup the timeslice scheduler to re-evaluate (if
> >    the stream is not on a FW slot already)
> > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> >    execution
> >

Yes, an extra step but you get to use all the nice DRM scheduler
functions for dependency tracking. Also in our case we really want a
single entry point in the backend (the work queue). Also see [1] which
helped us seal a bunch of races we had in the i915 by using a single
entry point. All these benefits are why we landed on the DRM scheduler
and it has worked of rather nicely compared to the i915.

[1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
 
> > That's one extra step we don't really need. To sum-up, yes, all the
> > job/entity tracking might be interesting to share/re-use, but I wonder
> > if we couldn't have that without pulling out the scheduling part of
> > drm_sched, or maybe I'm missing something, and there's something in
> > drm_gpu_scheduler you really need.
> 
> On second thought, that's probably an acceptable overhead (not even
> sure the extra step I was mentioning exists in practice, because dep
> fence signaled state is checked as part of the drm_sched_main
> iteration, so that's basically replacing the worker I schedule to
> check job deps), and I like the idea of being able to re-use drm_sched
> dep-tracking without resorting to invasive changes to the existing
> logic, so I'll probably give it a try.

Let me know how this goes.

Matt

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-05 19:40           ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-05 19:40 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel

On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> On Fri, 30 Dec 2022 12:55:08 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > On Fri, 30 Dec 2022 11:20:42 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > 
> > > Hello Matthew,
> > > 
> > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > Matthew Brost <matthew.brost@intel.com> wrote:
> > >   
> > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > > seems a bit odd but let us explain the reasoning below.
> > > > 
> > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > guaranteed to be the same completion even if targeting the same hardware
> > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > > apart as the TDR expects submission order == completion order. Using a
> > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.    
> > > 
> > > Oh, that's interesting. I've been trying to solve the same sort of
> > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > scheduling scheme (you give the FW N streams to execute, and it does
> > > the scheduling between those N command streams, the kernel driver
> > > does timeslice scheduling to update the command streams passed to the
> > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > because the integration with drm_sched was painful, but also because I
> > > felt trying to bend drm_sched to make it interact with a
> > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > help for a few things (didn't think it through yet), but I feel it's
> > > coming short on other aspects we have to deal with on Arm GPUs.  
> > 
> > Ok, so I just had a quick look at the Xe driver and how it
> > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > have a better understanding of how you get away with using drm_sched
> > while still controlling how scheduling is really done. Here
> > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > drm_sched job queuing/dep/tracking mechanism. The whole run-queue

You nailed it here, we use the DRM scheduler for queuing jobs,
dependency tracking and releasing jobs to be scheduled when dependencies
are met, and lastly a tracking mechanism of inflights jobs that need to
be cleaned up if an error occurs. It doesn't actually do any scheduling
aside from the most basic level of not overflowing the submission ring
buffer. In this sense, a 1 to 1 relationship between entity and
scheduler fits quite well.

FWIW this design was also ran by AMD quite a while ago (off the list)
and we didn't get any serious push back. Things can change however...

> > selection is dumb because there's only one entity ever bound to the
> > scheduler (the one that's part of the xe_guc_engine object which also
> > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > have on Arm is the fact that the stream doesn't necessarily get
> > scheduled when ->run_job() is called, it can be placed in the runnable
> > queue and be picked later by the kernel-side scheduler when a FW slot
> > gets released. That can probably be sorted out by manually disabling the
> > job timer and re-enabling it when the stream gets picked by the
> > scheduler. But my main concern remains, we're basically abusing
> > drm_sched here.
> > 

That's a matter of opinion, yes we are using it slightly differently
than anyone else but IMO the fact the DRM scheduler works for the Xe use
case with barely any changes is a testament to its design.

> > For the Arm driver, that means turning the following sequence
> > 
> > 1. wait for job deps
> > 2. queue job to ringbuf and push the stream to the runnable
> >    queue (if it wasn't queued already). Wakeup the timeslice scheduler
> >    to re-evaluate (if the stream is not on a FW slot already)
> > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> >    execution
> > 
> > into
> > 
> > 1. queue job to entity which takes care of waiting for job deps for
> >    us
> > 2. schedule a drm_sched_main iteration
> > 3. the only available entity is picked, and the first job from this
> >    entity is dequeued. ->run_job() is called: the job is queued to the
> >    ringbuf and the stream is pushed to the runnable queue (if it wasn't
> >    queued already). Wakeup the timeslice scheduler to re-evaluate (if
> >    the stream is not on a FW slot already)
> > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> >    execution
> >

Yes, an extra step but you get to use all the nice DRM scheduler
functions for dependency tracking. Also in our case we really want a
single entry point in the backend (the work queue). Also see [1] which
helped us seal a bunch of races we had in the i915 by using a single
entry point. All these benefits are why we landed on the DRM scheduler
and it has worked of rather nicely compared to the i915.

[1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
 
> > That's one extra step we don't really need. To sum-up, yes, all the
> > job/entity tracking might be interesting to share/re-use, but I wonder
> > if we couldn't have that without pulling out the scheduling part of
> > drm_sched, or maybe I'm missing something, and there's something in
> > drm_gpu_scheduler you really need.
> 
> On second thought, that's probably an acceptable overhead (not even
> sure the extra step I was mentioning exists in practice, because dep
> fence signaled state is checked as part of the drm_sched_main
> iteration, so that's basically replacing the worker I schedule to
> check job deps), and I like the idea of being able to re-use drm_sched
> dep-tracking without resorting to invasive changes to the existing
> logic, so I'll probably give it a try.

Let me know how this goes.

Matt

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-03 12:21 ` Tvrtko Ursulin
@ 2023-01-05 21:27   ` Matthew Brost
  2023-01-12  9:54       ` Lucas De Marchi
  0 siblings, 1 reply; 161+ messages in thread
From: Matthew Brost @ 2023-01-05 21:27 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 03, 2023 at 12:21:08PM +0000, Tvrtko Ursulin wrote:
> 
> On 22/12/2022 22:21, Matthew Brost wrote:
> > Hello,
> > 
> > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > integrated and discrete platforms starting with Tiger Lake (first platform with
> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > to work from that is unencumbered by older platforms, whilst also taking the
> > opportunity to rearchitect our driver to increase sharing across the drm
> > subsystem, both leveraging and allowing us to contribute more towards other
> > shared components like TTM and drm/scheduler. The memory model is based on VM
> > bind which is similar to the i915 implementation. Likewise the execbuf
> > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > 
> > The code is at a stage where it is already functional and has experimental
> > support for multiple platforms starting from Tiger Lake, with initial support
> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > implementation will be released publicly early next year. We also have a suite
> > of IGTs for XE that will appear on the IGT list shortly.
> > 
> > It has been built with the assumption of supporting multiple architectures from
> > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > intend to continue working on it and improving on it as part of the kernel
> > community upstream.
> > 
> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > ready Xe for production throughout 2023.
> > 
> > As for display, the intent is to share the display code with the i915 driver so
> > that there is maximum reuse there. Currently this is being done by compiling the
> > display code twice, but alternatives to that are under consideration and we want
> > to have more discussion on what the best final solution will look like over the
> > next few months. Right now, work is ongoing in refactoring the display codebase
> > to remove as much as possible any unnecessary dependencies on i915 specific data
> > structures there..
> > 
> > We currently have 2 submission backends, execlists and GuC. The execlist is
> > meant mostly for testing and is not fully functional while GuC backend is fully
> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > required and should be placed in /lib/firmware/xe.
> 
> What is the plan going forward for the execlists backend? I think it would
> be preferable to not upstream something semi-functional and so to carry
> technical debt in the brand new code base, from the very start. If it is for
> Tigerlake, which is the starting platform for Xe, could it be made GuC only
> Tigerlake for instance?
> 

A little background here. In the original PoC written by Jason and Dave,
the execlist backend was the only one present and it was in semi-working
state. As soon as myself and a few others started working on Xe we went
full in a on the GuC backend. We left the execlist backend basically in
the state it was in. We left it in place for 2 reasons.

1. Having 2 backends from the start ensured we layered our code
correctly. The layer was a complete disaster in the i915 so we really
wanted to avoid that.
2. The thought was it might be needed for early product bring up one
day.

As I think about this a bit more, we likely just delete execlist backend
before merging this upstream and perhaps just carry 1 large patch
internally with this implementation that we can use as needed. Final
decession TDB though.

Matt

> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-03 13:02         ` Tvrtko Ursulin
@ 2023-01-05 21:43             ` Matthew Brost
  2023-01-05 21:43             ` Matthew Brost
  1 sibling, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-05 21:43 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, Boris Brezillon, dri-devel

On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
> 
> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > 
> > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > 
> > > > Hello Matthew,
> > > > 
> > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > > > seems a bit odd but let us explain the reasoning below.
> > > > > 
> > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > guaranteed to be the same completion even if targeting the same hardware
> > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > > > apart as the TDR expects submission order == completion order. Using a
> > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > 
> > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > the scheduling between those N command streams, the kernel driver
> > > > does timeslice scheduling to update the command streams passed to the
> > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > because the integration with drm_sched was painful, but also because I
> > > > felt trying to bend drm_sched to make it interact with a
> > > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > > help for a few things (didn't think it through yet), but I feel it's
> > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > 
> > > Ok, so I just had a quick look at the Xe driver and how it
> > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > have a better understanding of how you get away with using drm_sched
> > > while still controlling how scheduling is really done. Here
> > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > selection is dumb because there's only one entity ever bound to the
> > > scheduler (the one that's part of the xe_guc_engine object which also
> > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > have on Arm is the fact that the stream doesn't necessarily get
> > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > gets released. That can probably be sorted out by manually disabling the
> > > job timer and re-enabling it when the stream gets picked by the
> > > scheduler. But my main concern remains, we're basically abusing
> > > drm_sched here.
> > > 
> > > For the Arm driver, that means turning the following sequence
> > > 
> > > 1. wait for job deps
> > > 2. queue job to ringbuf and push the stream to the runnable
> > >     queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > >     to re-evaluate (if the stream is not on a FW slot already)
> > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > >     execution
> > > 
> > > into
> > > 
> > > 1. queue job to entity which takes care of waiting for job deps for
> > >     us
> > > 2. schedule a drm_sched_main iteration
> > > 3. the only available entity is picked, and the first job from this
> > >     entity is dequeued. ->run_job() is called: the job is queued to the
> > >     ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > >     queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > >     the stream is not on a FW slot already)
> > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > >     execution
> > > 
> > > That's one extra step we don't really need. To sum-up, yes, all the
> > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > if we couldn't have that without pulling out the scheduling part of
> > > drm_sched, or maybe I'm missing something, and there's something in
> > > drm_gpu_scheduler you really need.
> > 
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.
> 
> I agree with the concerns and think that how Xe proposes to integrate with
> drm_sched is a problem, or at least significantly inelegant.
>

Inelegant is a matter of opinion, I actually rather like this solution.

BTW this isn't my design rather this was Jason's idea.
 
> AFAICT it proposes to have 1:1 between *userspace* created contexts (per
> context _and_ engine) and drm_sched. I am not sure avoiding invasive changes
> to the shared code is in the spirit of the overall idea and instead
> opportunity should be used to look at way to refactor/improve drm_sched.
>

Yes, it is 1:1 *userspace* engines and drm_sched.

I'm not really prepared to make large changes to DRM scheduler at the
moment for Xe as they are not really required nor does Boris seem they
will be required for his work either. I am interested to see what Boris
comes up with.

> Even on the low level, the idea to replace drm_sched threads with workers
> has a few problems.
> 
> To start with, the pattern of:
> 
>   while (not_stopped) {
> 	keep picking jobs
>   }
> 
> Feels fundamentally in disagreement with workers (while obviously fits
> perfectly with the current kthread design).
>

The while loop breaks and worker exists if no jobs are ready.

> Secondly, it probably demands separate workers (not optional), otherwise
> behaviour of shared workqueues has either the potential to explode number
> kernel threads anyway, or add latency.
> 

Right now the system_unbound_wq is used which does have a limit on the
number of threads, right? I do have a FIXME to allow a worker to be
passed in similar to TDR.

WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
schedulers can be pushing jobs into the backend / cleaning up jobs in
parallel.

> What would be interesting to learn is whether the option of refactoring
> drm_sched to deal with out of order completion was considered and what were
> the conclusions.
>

I coded this up a while back when trying to convert the i915 to the DRM
scheduler it isn't all that hard either. The free flow control on the
ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
sold me on the this design.

> Second option perhaps to split out the drm_sched code into parts which would
> lend themselves more to "pick and choose" of its functionalities.
> Specifically, Xe wants frontend dependency tracking, but not any scheduling
> really (neither least busy drm_sched, neither FIFO/RQ entity picking), so
> even having all these data structures in memory is a waste.
> 

I don't think that we are wasting memory is a very good argument for
making intrusive changes to the DRM scheduler.

> With the first option then the end result could be drm_sched per engine
> class (hardware view), which I think fits with the GuC model. Give all
> schedulable contexts (entities) to the GuC and then mostly forget about
> them. Timeslicing and re-ordering and all happens transparently to the
> kernel from that point until completion.
> 

Out-of-order problem still exists here.

> Or with the second option you would build on some smaller refactored
> sub-components of drm_sched, by maybe splitting the dependency tracking from
> scheduling (RR/FIFO entity picking code).
> 
> Second option is especially a bit vague and I haven't thought about the
> required mechanics, but it just appeared too obvious the proposed design has
> a bit too much impedance mismatch.
>

IMO ROI on this is low and again lets see what Boris comes up with.

Matt

> Oh and as a side note, when I went into the drm_sched code base to remind
> myself how things worked, it is quite easy to find some FIXME comments which
> suggest people working on it are unsure of locking desing there and such. So
> perhaps that all needs cleanup too, I mean would benefit from
> refactoring/improving work as brainstormed above anyway.
> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-05 21:43             ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-05 21:43 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
> 
> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > 
> > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > 
> > > > Hello Matthew,
> > > > 
> > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > > > seems a bit odd but let us explain the reasoning below.
> > > > > 
> > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > guaranteed to be the same completion even if targeting the same hardware
> > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > > > apart as the TDR expects submission order == completion order. Using a
> > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > 
> > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > the scheduling between those N command streams, the kernel driver
> > > > does timeslice scheduling to update the command streams passed to the
> > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > because the integration with drm_sched was painful, but also because I
> > > > felt trying to bend drm_sched to make it interact with a
> > > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > > help for a few things (didn't think it through yet), but I feel it's
> > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > 
> > > Ok, so I just had a quick look at the Xe driver and how it
> > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > have a better understanding of how you get away with using drm_sched
> > > while still controlling how scheduling is really done. Here
> > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > selection is dumb because there's only one entity ever bound to the
> > > scheduler (the one that's part of the xe_guc_engine object which also
> > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > have on Arm is the fact that the stream doesn't necessarily get
> > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > gets released. That can probably be sorted out by manually disabling the
> > > job timer and re-enabling it when the stream gets picked by the
> > > scheduler. But my main concern remains, we're basically abusing
> > > drm_sched here.
> > > 
> > > For the Arm driver, that means turning the following sequence
> > > 
> > > 1. wait for job deps
> > > 2. queue job to ringbuf and push the stream to the runnable
> > >     queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > >     to re-evaluate (if the stream is not on a FW slot already)
> > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > >     execution
> > > 
> > > into
> > > 
> > > 1. queue job to entity which takes care of waiting for job deps for
> > >     us
> > > 2. schedule a drm_sched_main iteration
> > > 3. the only available entity is picked, and the first job from this
> > >     entity is dequeued. ->run_job() is called: the job is queued to the
> > >     ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > >     queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > >     the stream is not on a FW slot already)
> > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > >     execution
> > > 
> > > That's one extra step we don't really need. To sum-up, yes, all the
> > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > if we couldn't have that without pulling out the scheduling part of
> > > drm_sched, or maybe I'm missing something, and there's something in
> > > drm_gpu_scheduler you really need.
> > 
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.
> 
> I agree with the concerns and think that how Xe proposes to integrate with
> drm_sched is a problem, or at least significantly inelegant.
>

Inelegant is a matter of opinion, I actually rather like this solution.

BTW this isn't my design rather this was Jason's idea.
 
> AFAICT it proposes to have 1:1 between *userspace* created contexts (per
> context _and_ engine) and drm_sched. I am not sure avoiding invasive changes
> to the shared code is in the spirit of the overall idea and instead
> opportunity should be used to look at way to refactor/improve drm_sched.
>

Yes, it is 1:1 *userspace* engines and drm_sched.

I'm not really prepared to make large changes to DRM scheduler at the
moment for Xe as they are not really required nor does Boris seem they
will be required for his work either. I am interested to see what Boris
comes up with.

> Even on the low level, the idea to replace drm_sched threads with workers
> has a few problems.
> 
> To start with, the pattern of:
> 
>   while (not_stopped) {
> 	keep picking jobs
>   }
> 
> Feels fundamentally in disagreement with workers (while obviously fits
> perfectly with the current kthread design).
>

The while loop breaks and worker exists if no jobs are ready.

> Secondly, it probably demands separate workers (not optional), otherwise
> behaviour of shared workqueues has either the potential to explode number
> kernel threads anyway, or add latency.
> 

Right now the system_unbound_wq is used which does have a limit on the
number of threads, right? I do have a FIXME to allow a worker to be
passed in similar to TDR.

WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
schedulers can be pushing jobs into the backend / cleaning up jobs in
parallel.

> What would be interesting to learn is whether the option of refactoring
> drm_sched to deal with out of order completion was considered and what were
> the conclusions.
>

I coded this up a while back when trying to convert the i915 to the DRM
scheduler it isn't all that hard either. The free flow control on the
ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
sold me on the this design.

> Second option perhaps to split out the drm_sched code into parts which would
> lend themselves more to "pick and choose" of its functionalities.
> Specifically, Xe wants frontend dependency tracking, but not any scheduling
> really (neither least busy drm_sched, neither FIFO/RQ entity picking), so
> even having all these data structures in memory is a waste.
> 

I don't think that we are wasting memory is a very good argument for
making intrusive changes to the DRM scheduler.

> With the first option then the end result could be drm_sched per engine
> class (hardware view), which I think fits with the GuC model. Give all
> schedulable contexts (entities) to the GuC and then mostly forget about
> them. Timeslicing and re-ordering and all happens transparently to the
> kernel from that point until completion.
> 

Out-of-order problem still exists here.

> Or with the second option you would build on some smaller refactored
> sub-components of drm_sched, by maybe splitting the dependency tracking from
> scheduling (RR/FIFO entity picking code).
> 
> Second option is especially a bit vague and I haven't thought about the
> required mechanics, but it just appeared too obvious the proposed design has
> a bit too much impedance mismatch.
>

IMO ROI on this is low and again lets see what Boris comes up with.

Matt

> Oh and as a side note, when I went into the drm_sched code base to remind
> myself how things worked, it is quite easy to find some FIXME comments which
> suggest people working on it are unsure of locking desing there and such. So
> perhaps that all needs cleanup too, I mean would benefit from
> refactoring/improving work as brainstormed above anyway.
> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-05 21:43             ` Matthew Brost
  (?)
@ 2023-01-06 23:52             ` Matthew Brost
  2023-01-09 13:46               ` Tvrtko Ursulin
  -1 siblings, 1 reply; 161+ messages in thread
From: Matthew Brost @ 2023-01-06 23:52 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Thu, Jan 05, 2023 at 09:43:41PM +0000, Matthew Brost wrote:
> On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
> > 
> > On 02/01/2023 07:30, Boris Brezillon wrote:
> > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > 
> > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > 
> > > > > Hello Matthew,
> > > > > 
> > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > 
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same hardware
> > > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > > > > apart as the TDR expects submission order == completion order. Using a
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > > 
> > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > the scheduling between those N command streams, the kernel driver
> > > > > does timeslice scheduling to update the command streams passed to the
> > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > because the integration with drm_sched was painful, but also because I
> > > > > felt trying to bend drm_sched to make it interact with a
> > > > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > 
> > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > have a better understanding of how you get away with using drm_sched
> > > > while still controlling how scheduling is really done. Here
> > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > selection is dumb because there's only one entity ever bound to the
> > > > scheduler (the one that's part of the xe_guc_engine object which also
> > > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > > have on Arm is the fact that the stream doesn't necessarily get
> > > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > > gets released. That can probably be sorted out by manually disabling the
> > > > job timer and re-enabling it when the stream gets picked by the
> > > > scheduler. But my main concern remains, we're basically abusing
> > > > drm_sched here.
> > > > 
> > > > For the Arm driver, that means turning the following sequence
> > > > 
> > > > 1. wait for job deps
> > > > 2. queue job to ringbuf and push the stream to the runnable
> > > >     queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > > >     to re-evaluate (if the stream is not on a FW slot already)
> > > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > > >     execution
> > > > 
> > > > into
> > > > 
> > > > 1. queue job to entity which takes care of waiting for job deps for
> > > >     us
> > > > 2. schedule a drm_sched_main iteration
> > > > 3. the only available entity is picked, and the first job from this
> > > >     entity is dequeued. ->run_job() is called: the job is queued to the
> > > >     ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > > >     queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > > >     the stream is not on a FW slot already)
> > > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > > >     execution
> > > > 
> > > > That's one extra step we don't really need. To sum-up, yes, all the
> > > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > > if we couldn't have that without pulling out the scheduling part of
> > > > drm_sched, or maybe I'm missing something, and there's something in
> > > > drm_gpu_scheduler you really need.
> > > 
> > > On second thought, that's probably an acceptable overhead (not even
> > > sure the extra step I was mentioning exists in practice, because dep
> > > fence signaled state is checked as part of the drm_sched_main
> > > iteration, so that's basically replacing the worker I schedule to
> > > check job deps), and I like the idea of being able to re-use drm_sched
> > > dep-tracking without resorting to invasive changes to the existing
> > > logic, so I'll probably give it a try.
> > 
> > I agree with the concerns and think that how Xe proposes to integrate with
> > drm_sched is a problem, or at least significantly inelegant.
> >
> 
> Inelegant is a matter of opinion, I actually rather like this solution.
> 
> BTW this isn't my design rather this was Jason's idea.
>  
> > AFAICT it proposes to have 1:1 between *userspace* created contexts (per
> > context _and_ engine) and drm_sched. I am not sure avoiding invasive changes
> > to the shared code is in the spirit of the overall idea and instead
> > opportunity should be used to look at way to refactor/improve drm_sched.
> >
> 
> Yes, it is 1:1 *userspace* engines and drm_sched.
> 
> I'm not really prepared to make large changes to DRM scheduler at the
> moment for Xe as they are not really required nor does Boris seem they
> will be required for his work either. I am interested to see what Boris
> comes up with.
> 
> > Even on the low level, the idea to replace drm_sched threads with workers
> > has a few problems.
> > 
> > To start with, the pattern of:
> > 
> >   while (not_stopped) {
> > 	keep picking jobs
> >   }
> > 
> > Feels fundamentally in disagreement with workers (while obviously fits
> > perfectly with the current kthread design).
> >
> 
> The while loop breaks and worker exists if no jobs are ready.
> 
> > Secondly, it probably demands separate workers (not optional), otherwise
> > behaviour of shared workqueues has either the potential to explode number
> > kernel threads anyway, or add latency.
> > 
> 
> Right now the system_unbound_wq is used which does have a limit on the
> number of threads, right? I do have a FIXME to allow a worker to be
> passed in similar to TDR.
> 
> WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
> schedulers can be pushing jobs into the backend / cleaning up jobs in
> parallel.
> 

Thought of one more point here where why in Xe we absolutely want a 1 to
1 ratio between entity and scheduler - the way we implement timeslicing
for preempt fences.

Let me try to explain.

Preempt fences are implemented via the generic messaging interface [1]
with suspend / resume messages. If a suspend messages is received to
soon after calling resume (this is per entity) we simply sleep in the
suspend call thus giving the entity a timeslice. This completely falls
apart with a many to 1 relationship as now a entity waiting for a
timeslice blocks the other entities. Could we work aroudn this, sure but
just another bunch of code we'd have to add in Xe. Being to freely sleep
in backend without affecting other entities is really, really nice IMO
and I bet Xe isn't the only driver that is going to feel this way.

Last thing I'll say regardless of how anyone feels about Xe using a 1 to
1 relationship this patch IMO makes sense as I hope we can all agree a
workqueue scales better than kthreads.

Matt

[1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 

> > What would be interesting to learn is whether the option of refactoring
> > drm_sched to deal with out of order completion was considered and what were
> > the conclusions.
> >
> 
> I coded this up a while back when trying to convert the i915 to the DRM
> scheduler it isn't all that hard either. The free flow control on the
> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
> sold me on the this design.
> 
> > Second option perhaps to split out the drm_sched code into parts which would
> > lend themselves more to "pick and choose" of its functionalities.
> > Specifically, Xe wants frontend dependency tracking, but not any scheduling
> > really (neither least busy drm_sched, neither FIFO/RQ entity picking), so
> > even having all these data structures in memory is a waste.
> > 
> 
> I don't think that we are wasting memory is a very good argument for
> making intrusive changes to the DRM scheduler.
> 
> > With the first option then the end result could be drm_sched per engine
> > class (hardware view), which I think fits with the GuC model. Give all
> > schedulable contexts (entities) to the GuC and then mostly forget about
> > them. Timeslicing and re-ordering and all happens transparently to the
> > kernel from that point until completion.
> > 
> 
> Out-of-order problem still exists here.
> 
> > Or with the second option you would build on some smaller refactored
> > sub-components of drm_sched, by maybe splitting the dependency tracking from
> > scheduling (RR/FIFO entity picking code).
> > 
> > Second option is especially a bit vague and I haven't thought about the
> > required mechanics, but it just appeared too obvious the proposed design has
> > a bit too much impedance mismatch.
> >
> 
> IMO ROI on this is low and again lets see what Boris comes up with.
> 
> Matt
> 
> > Oh and as a side note, when I went into the drm_sched code base to remind
> > myself how things worked, it is quite easy to find some FIXME comments which
> > suggest people working on it are unsure of locking desing there and such. So
> > perhaps that all needs cleanup too, I mean would benefit from
> > refactoring/improving work as brainstormed above anyway.
> > 
> > Regards,
> > 
> > Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-06 23:52             ` Matthew Brost
@ 2023-01-09 13:46               ` Tvrtko Ursulin
  2023-01-09 17:27                   ` Jason Ekstrand
  0 siblings, 1 reply; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-09 13:46 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel


On 06/01/2023 23:52, Matthew Brost wrote:
> On Thu, Jan 05, 2023 at 09:43:41PM +0000, Matthew Brost wrote:
>> On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
>>>
>>> On 02/01/2023 07:30, Boris Brezillon wrote:
>>>> On Fri, 30 Dec 2022 12:55:08 +0100
>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>>
>>>>> On Fri, 30 Dec 2022 11:20:42 +0100
>>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>>>
>>>>>> Hello Matthew,
>>>>>>
>>>>>> On Thu, 22 Dec 2022 14:21:11 -0800
>>>>>> Matthew Brost <matthew.brost@intel.com> wrote:
>>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>>>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>>>
>>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>>>> guaranteed to be the same completion even if targeting the same hardware
>>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>>>> which allowed to reorder, timeslice, and preempt submissions. If a using
>>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>>>>>>> apart as the TDR expects submission order == completion order. Using a
>>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>>
>>>>>> Oh, that's interesting. I've been trying to solve the same sort of
>>>>>> issues to support Arm's new Mali GPU which is relying on a FW-assisted
>>>>>> scheduling scheme (you give the FW N streams to execute, and it does
>>>>>> the scheduling between those N command streams, the kernel driver
>>>>>> does timeslice scheduling to update the command streams passed to the
>>>>>> FW). I must admit I gave up on using drm_sched at some point, mostly
>>>>>> because the integration with drm_sched was painful, but also because I
>>>>>> felt trying to bend drm_sched to make it interact with a
>>>>>> timeslice-oriented scheduling model wasn't really future proof. Giving
>>>>>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
>>>>>> help for a few things (didn't think it through yet), but I feel it's
>>>>>> coming short on other aspects we have to deal with on Arm GPUs.
>>>>>
>>>>> Ok, so I just had a quick look at the Xe driver and how it
>>>>> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
>>>>> have a better understanding of how you get away with using drm_sched
>>>>> while still controlling how scheduling is really done. Here
>>>>> drm_gpu_scheduler is just a dummy abstract that let's you use the
>>>>> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
>>>>> selection is dumb because there's only one entity ever bound to the
>>>>> scheduler (the one that's part of the xe_guc_engine object which also
>>>>> contains the drm_gpu_scheduler instance). I guess the main issue we'd
>>>>> have on Arm is the fact that the stream doesn't necessarily get
>>>>> scheduled when ->run_job() is called, it can be placed in the runnable
>>>>> queue and be picked later by the kernel-side scheduler when a FW slot
>>>>> gets released. That can probably be sorted out by manually disabling the
>>>>> job timer and re-enabling it when the stream gets picked by the
>>>>> scheduler. But my main concern remains, we're basically abusing
>>>>> drm_sched here.
>>>>>
>>>>> For the Arm driver, that means turning the following sequence
>>>>>
>>>>> 1. wait for job deps
>>>>> 2. queue job to ringbuf and push the stream to the runnable
>>>>>      queue (if it wasn't queued already). Wakeup the timeslice scheduler
>>>>>      to re-evaluate (if the stream is not on a FW slot already)
>>>>> 3. stream gets picked by the timeslice scheduler and sent to the FW for
>>>>>      execution
>>>>>
>>>>> into
>>>>>
>>>>> 1. queue job to entity which takes care of waiting for job deps for
>>>>>      us
>>>>> 2. schedule a drm_sched_main iteration
>>>>> 3. the only available entity is picked, and the first job from this
>>>>>      entity is dequeued. ->run_job() is called: the job is queued to the
>>>>>      ringbuf and the stream is pushed to the runnable queue (if it wasn't
>>>>>      queued already). Wakeup the timeslice scheduler to re-evaluate (if
>>>>>      the stream is not on a FW slot already)
>>>>> 4. stream gets picked by the timeslice scheduler and sent to the FW for
>>>>>      execution
>>>>>
>>>>> That's one extra step we don't really need. To sum-up, yes, all the
>>>>> job/entity tracking might be interesting to share/re-use, but I wonder
>>>>> if we couldn't have that without pulling out the scheduling part of
>>>>> drm_sched, or maybe I'm missing something, and there's something in
>>>>> drm_gpu_scheduler you really need.
>>>>
>>>> On second thought, that's probably an acceptable overhead (not even
>>>> sure the extra step I was mentioning exists in practice, because dep
>>>> fence signaled state is checked as part of the drm_sched_main
>>>> iteration, so that's basically replacing the worker I schedule to
>>>> check job deps), and I like the idea of being able to re-use drm_sched
>>>> dep-tracking without resorting to invasive changes to the existing
>>>> logic, so I'll probably give it a try.
>>>
>>> I agree with the concerns and think that how Xe proposes to integrate with
>>> drm_sched is a problem, or at least significantly inelegant.
>>>
>>
>> Inelegant is a matter of opinion, I actually rather like this solution.
>>
>> BTW this isn't my design rather this was Jason's idea.
>>   
>>> AFAICT it proposes to have 1:1 between *userspace* created contexts (per
>>> context _and_ engine) and drm_sched. I am not sure avoiding invasive changes
>>> to the shared code is in the spirit of the overall idea and instead
>>> opportunity should be used to look at way to refactor/improve drm_sched.
>>>
>>
>> Yes, it is 1:1 *userspace* engines and drm_sched.
>>
>> I'm not really prepared to make large changes to DRM scheduler at the
>> moment for Xe as they are not really required nor does Boris seem they
>> will be required for his work either. I am interested to see what Boris
>> comes up with.
>>
>>> Even on the low level, the idea to replace drm_sched threads with workers
>>> has a few problems.
>>>
>>> To start with, the pattern of:
>>>
>>>    while (not_stopped) {
>>> 	keep picking jobs
>>>    }
>>>
>>> Feels fundamentally in disagreement with workers (while obviously fits
>>> perfectly with the current kthread design).
>>>
>>
>> The while loop breaks and worker exists if no jobs are ready.
>>
>>> Secondly, it probably demands separate workers (not optional), otherwise
>>> behaviour of shared workqueues has either the potential to explode number
>>> kernel threads anyway, or add latency.
>>>
>>
>> Right now the system_unbound_wq is used which does have a limit on the
>> number of threads, right? I do have a FIXME to allow a worker to be
>> passed in similar to TDR.
>>
>> WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
>> schedulers can be pushing jobs into the backend / cleaning up jobs in
>> parallel.
>>
> 
> Thought of one more point here where why in Xe we absolutely want a 1 to
> 1 ratio between entity and scheduler - the way we implement timeslicing
> for preempt fences.
> 
> Let me try to explain.
> 
> Preempt fences are implemented via the generic messaging interface [1]
> with suspend / resume messages. If a suspend messages is received to
> soon after calling resume (this is per entity) we simply sleep in the
> suspend call thus giving the entity a timeslice. This completely falls
> apart with a many to 1 relationship as now a entity waiting for a
> timeslice blocks the other entities. Could we work aroudn this, sure but
> just another bunch of code we'd have to add in Xe. Being to freely sleep
> in backend without affecting other entities is really, really nice IMO
> and I bet Xe isn't the only driver that is going to feel this way.
> 
> Last thing I'll say regardless of how anyone feels about Xe using a 1 to
> 1 relationship this patch IMO makes sense as I hope we can all agree a
> workqueue scales better than kthreads.

I don't know for sure what will scale better and for what use case, 
combination of CPU cores vs number of GPU engines to keep busy vs other 
system activity. But I wager someone is bound to ask for some numbers to 
make sure proposal is not negatively affecting any other drivers.

In any case that's a low level question caused by the high level design 
decision. So I'd think first focus on the high level - which is the 1:1 
mapping of entity to scheduler instance proposal.

Fundamentally it will be up to the DRM maintainers and the community to 
bless your approach. And it is important to stress 1:1 is about 
userspace contexts, so I believe unlike any other current scheduler 
user. And also important to stress this effectively does not make Xe 
_really_ use the scheduler that much.

I can only offer my opinion, which is that the two options mentioned in 
this thread (either improve drm scheduler to cope with what is required, 
or split up the code so you can use just the parts of drm_sched which 
you want - which is frontend dependency tracking) shouldn't be so 
readily dismissed, given how I think the idea was for the new driver to 
work less in a silo and more in the community (not do kludges to 
workaround stuff because it is thought to be too hard to improve common 
code), but fundamentally, "goto previous paragraph" for what I am concerned.

Regards,

Tvrtko

P.S. And as a related side note, there are more areas where drm_sched 
could be improved, like for instance priority handling.
Take a look at msm_submitqueue_create / msm_gpu_convert_priority / 
get_sched_entity to see how msm works around the drm_sched hardcoded 
limit of available priority levels, in order to avoid having to leave a 
hw capability unused. I suspect msm would be happier if they could have 
all priority levels equal in terms of whether they apply only at the 
frontend level or completely throughout the pipeline.

> [1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> 
>>> What would be interesting to learn is whether the option of refactoring
>>> drm_sched to deal with out of order completion was considered and what were
>>> the conclusions.
>>>
>>
>> I coded this up a while back when trying to convert the i915 to the DRM
>> scheduler it isn't all that hard either. The free flow control on the
>> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
>> sold me on the this design.
>>
>>> Second option perhaps to split out the drm_sched code into parts which would
>>> lend themselves more to "pick and choose" of its functionalities.
>>> Specifically, Xe wants frontend dependency tracking, but not any scheduling
>>> really (neither least busy drm_sched, neither FIFO/RQ entity picking), so
>>> even having all these data structures in memory is a waste.
>>>
>>
>> I don't think that we are wasting memory is a very good argument for
>> making intrusive changes to the DRM scheduler.
>>
>>> With the first option then the end result could be drm_sched per engine
>>> class (hardware view), which I think fits with the GuC model. Give all
>>> schedulable contexts (entities) to the GuC and then mostly forget about
>>> them. Timeslicing and re-ordering and all happens transparently to the
>>> kernel from that point until completion.
>>>
>>
>> Out-of-order problem still exists here.
>>
>>> Or with the second option you would build on some smaller refactored
>>> sub-components of drm_sched, by maybe splitting the dependency tracking from
>>> scheduling (RR/FIFO entity picking code).
>>>
>>> Second option is especially a bit vague and I haven't thought about the
>>> required mechanics, but it just appeared too obvious the proposed design has
>>> a bit too much impedance mismatch.
>>>
>>
>> IMO ROI on this is low and again lets see what Boris comes up with.
>>
>> Matt
>>
>>> Oh and as a side note, when I went into the drm_sched code base to remind
>>> myself how things worked, it is quite easy to find some FIXME comments which
>>> suggest people working on it are unsure of locking desing there and such. So
>>> perhaps that all needs cleanup too, I mean would benefit from
>>> refactoring/improving work as brainstormed above anyway.
>>>
>>> Regards,
>>>
>>> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-05 19:40           ` [Intel-gfx] " Matthew Brost
@ 2023-01-09 15:45             ` Jason Ekstrand
  -1 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-09 15:45 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, Boris Brezillon, dri-devel

[-- Attachment #1: Type: text/plain, Size: 9131 bytes --]

On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
wrote:

> On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >
> > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >
> > > > Hello Matthew,
> > > >
> > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > >
> > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> > > > > seems a bit odd but let us explain the reasoning below.
> > > > >
> > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > guaranteed to be the same completion even if targeting the same
> hardware
> > > > > engine. This is because in XE we have a firmware scheduler, the
> GuC,
> > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> using
> > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> > > > > apart as the TDR expects submission order == completion order.
> Using a
> > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> > > >
> > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > the scheduling between those N command streams, the kernel driver
> > > > does timeslice scheduling to update the command streams passed to the
> > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > because the integration with drm_sched was painful, but also because
> I
> > > > felt trying to bend drm_sched to make it interact with a
> > > > timeslice-oriented scheduling model wasn't really future proof.
> Giving
> > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> > > > help for a few things (didn't think it through yet), but I feel it's
> > > > coming short on other aspects we have to deal with on Arm GPUs.
> > >
> > > Ok, so I just had a quick look at the Xe driver and how it
> > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > have a better understanding of how you get away with using drm_sched
> > > while still controlling how scheduling is really done. Here
> > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
>
> You nailed it here, we use the DRM scheduler for queuing jobs,
> dependency tracking and releasing jobs to be scheduled when dependencies
> are met, and lastly a tracking mechanism of inflights jobs that need to
> be cleaned up if an error occurs. It doesn't actually do any scheduling
> aside from the most basic level of not overflowing the submission ring
> buffer. In this sense, a 1 to 1 relationship between entity and
> scheduler fits quite well.
>

Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
want here and what you need for Arm thanks to the number of FW queues
available. I don't remember the exact number of GuC queues but it's at
least 1k. This puts it in an entirely different class from what you have on
Mali. Roughly, there's about three categories here:

 1. Hardware where the kernel is placing jobs on actual HW rings. This is
old Mali, Intel Haswell and earlier, and probably a bunch of others.
(Intel BDW+ with execlists is a weird case that doesn't fit in this
categorization.)

 2. Hardware (or firmware) with a very limited number of queues where
you're going to have to juggle in the kernel in order to run desktop Linux.

 3. Firmware scheduling with a high queue count. In this case, you don't
want the kernel scheduling anything. Just throw it at the firmware and let
it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
temporarily pause some low-priority contexts and do some juggling or,
frankly, just fail userspace queue creation and tell the user to close some
windows.

The existence of this 2nd class is a bit annoying but it's where we are. I
think it's worth recognizing that Xe and panfrost are in different places
here and will require different designs. For Xe, we really are just using
drm/scheduler as a front-end and the firmware does all the real scheduling.

How do we deal with class 2? That's an interesting question.  We may
eventually want to break that off into a separate discussion and not litter
the Xe thread but let's keep going here for a bit.  I think there are some
pretty reasonable solutions but they're going to look a bit different.

The way I did this for Xe with execlists was to keep the 1:1:1 mapping
between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
ring and then there was a tiny kernel which operated entirely in IRQ
handlers which juggled those execlists by smashing HW registers.  For
Panfrost, I think we want something slightly different but can borrow some
ideas here.  In particular, have the schedulers feed kernel-side SW queues
(they can even be fixed-size if that helps) and then have a kthread which
juggles those feeds the limited FW queues.  In the case where you have few
enough active contexts to fit them all in FW, I do think it's best to have
them all active in FW and let it schedule. But with only 31, you need to be
able to juggle if you run out.

FWIW this design was also ran by AMD quite a while ago (off the list)
> and we didn't get any serious push back. Things can change however...
>

Yup, AMD and NVIDIA both want this, more-or-less.


> > > selection is dumb because there's only one entity ever bound to the
> > > scheduler (the one that's part of the xe_guc_engine object which also
> > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > have on Arm is the fact that the stream doesn't necessarily get
> > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > gets released. That can probably be sorted out by manually disabling
> the
> > > job timer and re-enabling it when the stream gets picked by the
> > > scheduler. But my main concern remains, we're basically abusing
> > > drm_sched here.
> > >
>
> That's a matter of opinion, yes we are using it slightly differently
> than anyone else but IMO the fact the DRM scheduler works for the Xe use
> case with barely any changes is a testament to its design.
>
> > > For the Arm driver, that means turning the following sequence
> > >
> > > 1. wait for job deps
> > > 2. queue job to ringbuf and push the stream to the runnable
> > >    queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > >    to re-evaluate (if the stream is not on a FW slot already)
> > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > >    execution
> > >
> > > into
> > >
> > > 1. queue job to entity which takes care of waiting for job deps for
> > >    us
> > > 2. schedule a drm_sched_main iteration
> > > 3. the only available entity is picked, and the first job from this
> > >    entity is dequeued. ->run_job() is called: the job is queued to the
> > >    ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > >    queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > >    the stream is not on a FW slot already)
> > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > >    execution
> > >
>
> Yes, an extra step but you get to use all the nice DRM scheduler
> functions for dependency tracking. Also in our case we really want a
> single entry point in the backend (the work queue). Also see [1] which
> helped us seal a bunch of races we had in the i915 by using a single
> entry point. All these benefits are why we landed on the DRM scheduler
> and it has worked of rather nicely compared to the i915.
>
> [1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>
> > > That's one extra step we don't really need. To sum-up, yes, all the
> > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > if we couldn't have that without pulling out the scheduling part of
> > > drm_sched, or maybe I'm missing something, and there's something in
> > > drm_gpu_scheduler you really need.
> >
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.
>
> Let me know how this goes.
>
> Matt
>

[-- Attachment #2: Type: text/html, Size: 11408 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-09 15:45             ` Jason Ekstrand
  0 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-09 15:45 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 9131 bytes --]

On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
wrote:

> On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >
> > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >
> > > > Hello Matthew,
> > > >
> > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > >
> > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> > > > > seems a bit odd but let us explain the reasoning below.
> > > > >
> > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > guaranteed to be the same completion even if targeting the same
> hardware
> > > > > engine. This is because in XE we have a firmware scheduler, the
> GuC,
> > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> using
> > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> > > > > apart as the TDR expects submission order == completion order.
> Using a
> > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> > > >
> > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > the scheduling between those N command streams, the kernel driver
> > > > does timeslice scheduling to update the command streams passed to the
> > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > because the integration with drm_sched was painful, but also because
> I
> > > > felt trying to bend drm_sched to make it interact with a
> > > > timeslice-oriented scheduling model wasn't really future proof.
> Giving
> > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> > > > help for a few things (didn't think it through yet), but I feel it's
> > > > coming short on other aspects we have to deal with on Arm GPUs.
> > >
> > > Ok, so I just had a quick look at the Xe driver and how it
> > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > have a better understanding of how you get away with using drm_sched
> > > while still controlling how scheduling is really done. Here
> > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
>
> You nailed it here, we use the DRM scheduler for queuing jobs,
> dependency tracking and releasing jobs to be scheduled when dependencies
> are met, and lastly a tracking mechanism of inflights jobs that need to
> be cleaned up if an error occurs. It doesn't actually do any scheduling
> aside from the most basic level of not overflowing the submission ring
> buffer. In this sense, a 1 to 1 relationship between entity and
> scheduler fits quite well.
>

Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
want here and what you need for Arm thanks to the number of FW queues
available. I don't remember the exact number of GuC queues but it's at
least 1k. This puts it in an entirely different class from what you have on
Mali. Roughly, there's about three categories here:

 1. Hardware where the kernel is placing jobs on actual HW rings. This is
old Mali, Intel Haswell and earlier, and probably a bunch of others.
(Intel BDW+ with execlists is a weird case that doesn't fit in this
categorization.)

 2. Hardware (or firmware) with a very limited number of queues where
you're going to have to juggle in the kernel in order to run desktop Linux.

 3. Firmware scheduling with a high queue count. In this case, you don't
want the kernel scheduling anything. Just throw it at the firmware and let
it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
temporarily pause some low-priority contexts and do some juggling or,
frankly, just fail userspace queue creation and tell the user to close some
windows.

The existence of this 2nd class is a bit annoying but it's where we are. I
think it's worth recognizing that Xe and panfrost are in different places
here and will require different designs. For Xe, we really are just using
drm/scheduler as a front-end and the firmware does all the real scheduling.

How do we deal with class 2? That's an interesting question.  We may
eventually want to break that off into a separate discussion and not litter
the Xe thread but let's keep going here for a bit.  I think there are some
pretty reasonable solutions but they're going to look a bit different.

The way I did this for Xe with execlists was to keep the 1:1:1 mapping
between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
ring and then there was a tiny kernel which operated entirely in IRQ
handlers which juggled those execlists by smashing HW registers.  For
Panfrost, I think we want something slightly different but can borrow some
ideas here.  In particular, have the schedulers feed kernel-side SW queues
(they can even be fixed-size if that helps) and then have a kthread which
juggles those feeds the limited FW queues.  In the case where you have few
enough active contexts to fit them all in FW, I do think it's best to have
them all active in FW and let it schedule. But with only 31, you need to be
able to juggle if you run out.

FWIW this design was also ran by AMD quite a while ago (off the list)
> and we didn't get any serious push back. Things can change however...
>

Yup, AMD and NVIDIA both want this, more-or-less.


> > > selection is dumb because there's only one entity ever bound to the
> > > scheduler (the one that's part of the xe_guc_engine object which also
> > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > have on Arm is the fact that the stream doesn't necessarily get
> > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > gets released. That can probably be sorted out by manually disabling
> the
> > > job timer and re-enabling it when the stream gets picked by the
> > > scheduler. But my main concern remains, we're basically abusing
> > > drm_sched here.
> > >
>
> That's a matter of opinion, yes we are using it slightly differently
> than anyone else but IMO the fact the DRM scheduler works for the Xe use
> case with barely any changes is a testament to its design.
>
> > > For the Arm driver, that means turning the following sequence
> > >
> > > 1. wait for job deps
> > > 2. queue job to ringbuf and push the stream to the runnable
> > >    queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > >    to re-evaluate (if the stream is not on a FW slot already)
> > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > >    execution
> > >
> > > into
> > >
> > > 1. queue job to entity which takes care of waiting for job deps for
> > >    us
> > > 2. schedule a drm_sched_main iteration
> > > 3. the only available entity is picked, and the first job from this
> > >    entity is dequeued. ->run_job() is called: the job is queued to the
> > >    ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > >    queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > >    the stream is not on a FW slot already)
> > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > >    execution
> > >
>
> Yes, an extra step but you get to use all the nice DRM scheduler
> functions for dependency tracking. Also in our case we really want a
> single entry point in the backend (the work queue). Also see [1] which
> helped us seal a bunch of races we had in the i915 by using a single
> entry point. All these benefits are why we landed on the DRM scheduler
> and it has worked of rather nicely compared to the i915.
>
> [1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>
> > > That's one extra step we don't really need. To sum-up, yes, all the
> > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > if we couldn't have that without pulling out the scheduling part of
> > > drm_sched, or maybe I'm missing something, and there's something in
> > > drm_gpu_scheduler you really need.
> >
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.
>
> Let me know how this goes.
>
> Matt
>

[-- Attachment #2: Type: text/html, Size: 11408 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-09 15:45             ` Jason Ekstrand
@ 2023-01-09 17:17               ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-09 17:17 UTC (permalink / raw)
  To: Jason Ekstrand, intel-gfx; +Cc: Matthew Brost, dri-devel

Hi Jason,

On Mon, 9 Jan 2023 09:45:09 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> wrote:
> 
> > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >  
> > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > >  
> > > > > Hello Matthew,
> > > > >
> > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > >  
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > this  
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > >
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same  
> > hardware  
> > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > GuC,  
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > using  
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > falls  
> > > > > > apart as the TDR expects submission order == completion order.  
> > Using a  
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > problem.  
> > > > >
> > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > issues to support Arm's new Mali GPU which is relying on a  
> > FW-assisted  
> > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > the scheduling between those N command streams, the kernel driver
> > > > > does timeslice scheduling to update the command streams passed to the
> > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > because the integration with drm_sched was painful, but also because  
> > I  
> > > > > felt trying to bend drm_sched to make it interact with a
> > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > Giving  
> > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > might  
> > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > >
> > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > have a better understanding of how you get away with using drm_sched
> > > > while still controlling how scheduling is really done. Here
> > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> >
> > You nailed it here, we use the DRM scheduler for queuing jobs,
> > dependency tracking and releasing jobs to be scheduled when dependencies
> > are met, and lastly a tracking mechanism of inflights jobs that need to
> > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > aside from the most basic level of not overflowing the submission ring
> > buffer. In this sense, a 1 to 1 relationship between entity and
> > scheduler fits quite well.
> >  
> 
> Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> want here and what you need for Arm thanks to the number of FW queues
> available. I don't remember the exact number of GuC queues but it's at
> least 1k. This puts it in an entirely different class from what you have on
> Mali. Roughly, there's about three categories here:
> 
>  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> old Mali, Intel Haswell and earlier, and probably a bunch of others.
> (Intel BDW+ with execlists is a weird case that doesn't fit in this
> categorization.)
> 
>  2. Hardware (or firmware) with a very limited number of queues where
> you're going to have to juggle in the kernel in order to run desktop Linux.
> 
>  3. Firmware scheduling with a high queue count. In this case, you don't
> want the kernel scheduling anything. Just throw it at the firmware and let
> it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> temporarily pause some low-priority contexts and do some juggling or,
> frankly, just fail userspace queue creation and tell the user to close some
> windows.
> 
> The existence of this 2nd class is a bit annoying but it's where we are. I
> think it's worth recognizing that Xe and panfrost are in different places
> here and will require different designs. For Xe, we really are just using
> drm/scheduler as a front-end and the firmware does all the real scheduling.
> 
> How do we deal with class 2? That's an interesting question.  We may
> eventually want to break that off into a separate discussion and not litter
> the Xe thread but let's keep going here for a bit.  I think there are some
> pretty reasonable solutions but they're going to look a bit different.
> 
> The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> ring and then there was a tiny kernel which operated entirely in IRQ
> handlers which juggled those execlists by smashing HW registers.  For
> Panfrost, I think we want something slightly different but can borrow some
> ideas here.  In particular, have the schedulers feed kernel-side SW queues
> (they can even be fixed-size if that helps) and then have a kthread which
> juggles those feeds the limited FW queues.  In the case where you have few
> enough active contexts to fit them all in FW, I do think it's best to have
> them all active in FW and let it schedule. But with only 31, you need to be
> able to juggle if you run out.

That's more or less what I do right now, except I don't use the
drm_sched front-end to handle deps or queue jobs (at least not yet). The
kernel-side timeslice-based scheduler juggling with runnable queues
(queues with pending jobs that are not yet resident on a FW slot)
uses a dedicated ordered-workqueue instead of a thread, with scheduler
ticks being handled with a delayed-work (tick happening every X
milliseconds when queues are waiting for a slot). It all seems very
HW/FW-specific though, and I think it's a bit premature to try to
generalize that part, but the dep-tracking logic implemented by
drm_sched looked like something I could easily re-use, hence my
interest in Xe's approach.

Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-09 17:17               ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-09 17:17 UTC (permalink / raw)
  To: Jason Ekstrand, intel-gfx; +Cc: dri-devel

Hi Jason,

On Mon, 9 Jan 2023 09:45:09 -0600
Jason Ekstrand <jason@jlekstrand.net> wrote:

> On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> wrote:
> 
> > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >  
> > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > >  
> > > > > Hello Matthew,
> > > > >
> > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > >  
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > this  
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > >
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same  
> > hardware  
> > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > GuC,  
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > using  
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > falls  
> > > > > > apart as the TDR expects submission order == completion order.  
> > Using a  
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > problem.  
> > > > >
> > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > issues to support Arm's new Mali GPU which is relying on a  
> > FW-assisted  
> > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > the scheduling between those N command streams, the kernel driver
> > > > > does timeslice scheduling to update the command streams passed to the
> > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > because the integration with drm_sched was painful, but also because  
> > I  
> > > > > felt trying to bend drm_sched to make it interact with a
> > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > Giving  
> > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > might  
> > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > >
> > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > have a better understanding of how you get away with using drm_sched
> > > > while still controlling how scheduling is really done. Here
> > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> >
> > You nailed it here, we use the DRM scheduler for queuing jobs,
> > dependency tracking and releasing jobs to be scheduled when dependencies
> > are met, and lastly a tracking mechanism of inflights jobs that need to
> > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > aside from the most basic level of not overflowing the submission ring
> > buffer. In this sense, a 1 to 1 relationship between entity and
> > scheduler fits quite well.
> >  
> 
> Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> want here and what you need for Arm thanks to the number of FW queues
> available. I don't remember the exact number of GuC queues but it's at
> least 1k. This puts it in an entirely different class from what you have on
> Mali. Roughly, there's about three categories here:
> 
>  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> old Mali, Intel Haswell and earlier, and probably a bunch of others.
> (Intel BDW+ with execlists is a weird case that doesn't fit in this
> categorization.)
> 
>  2. Hardware (or firmware) with a very limited number of queues where
> you're going to have to juggle in the kernel in order to run desktop Linux.
> 
>  3. Firmware scheduling with a high queue count. In this case, you don't
> want the kernel scheduling anything. Just throw it at the firmware and let
> it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> temporarily pause some low-priority contexts and do some juggling or,
> frankly, just fail userspace queue creation and tell the user to close some
> windows.
> 
> The existence of this 2nd class is a bit annoying but it's where we are. I
> think it's worth recognizing that Xe and panfrost are in different places
> here and will require different designs. For Xe, we really are just using
> drm/scheduler as a front-end and the firmware does all the real scheduling.
> 
> How do we deal with class 2? That's an interesting question.  We may
> eventually want to break that off into a separate discussion and not litter
> the Xe thread but let's keep going here for a bit.  I think there are some
> pretty reasonable solutions but they're going to look a bit different.
> 
> The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> ring and then there was a tiny kernel which operated entirely in IRQ
> handlers which juggled those execlists by smashing HW registers.  For
> Panfrost, I think we want something slightly different but can borrow some
> ideas here.  In particular, have the schedulers feed kernel-side SW queues
> (they can even be fixed-size if that helps) and then have a kthread which
> juggles those feeds the limited FW queues.  In the case where you have few
> enough active contexts to fit them all in FW, I do think it's best to have
> them all active in FW and let it schedule. But with only 31, you need to be
> able to juggle if you run out.

That's more or less what I do right now, except I don't use the
drm_sched front-end to handle deps or queue jobs (at least not yet). The
kernel-side timeslice-based scheduler juggling with runnable queues
(queues with pending jobs that are not yet resident on a FW slot)
uses a dedicated ordered-workqueue instead of a thread, with scheduler
ticks being handled with a delayed-work (tick happening every X
milliseconds when queues are waiting for a slot). It all seems very
HW/FW-specific though, and I think it's a bit premature to try to
generalize that part, but the dep-tracking logic implemented by
drm_sched looked like something I could easily re-use, hence my
interest in Xe's approach.

Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-09 13:46               ` Tvrtko Ursulin
@ 2023-01-09 17:27                   ` Jason Ekstrand
  0 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-09 17:27 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Matthew Brost, intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 19490 bytes --]

On Mon, Jan 9, 2023 at 7:46 AM Tvrtko Ursulin <
tvrtko.ursulin@linux.intel.com> wrote:

>
> On 06/01/2023 23:52, Matthew Brost wrote:
> > On Thu, Jan 05, 2023 at 09:43:41PM +0000, Matthew Brost wrote:
> >> On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
> >>>
> >>> On 02/01/2023 07:30, Boris Brezillon wrote:
> >>>> On Fri, 30 Dec 2022 12:55:08 +0100
> >>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>>
> >>>>> On Fri, 30 Dec 2022 11:20:42 +0100
> >>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>>>
> >>>>>> Hello Matthew,
> >>>>>>
> >>>>>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>>>>> Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> >>>>>>> seems a bit odd but let us explain the reasoning below.
> >>>>>>>
> >>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>>>>>> guaranteed to be the same completion even if targeting the same
> hardware
> >>>>>>> engine. This is because in XE we have a firmware scheduler, the
> GuC,
> >>>>>>> which allowed to reorder, timeslice, and preempt submissions. If a
> using
> >>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> >>>>>>> apart as the TDR expects submission order == completion order.
> Using a
> >>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> >>>>>>
> >>>>>> Oh, that's interesting. I've been trying to solve the same sort of
> >>>>>> issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> >>>>>> scheduling scheme (you give the FW N streams to execute, and it does
> >>>>>> the scheduling between those N command streams, the kernel driver
> >>>>>> does timeslice scheduling to update the command streams passed to
> the
> >>>>>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>>>>> because the integration with drm_sched was painful, but also
> because I
> >>>>>> felt trying to bend drm_sched to make it interact with a
> >>>>>> timeslice-oriented scheduling model wasn't really future proof.
> Giving
> >>>>>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> >>>>>> help for a few things (didn't think it through yet), but I feel it's
> >>>>>> coming short on other aspects we have to deal with on Arm GPUs.
> >>>>>
> >>>>> Ok, so I just had a quick look at the Xe driver and how it
> >>>>> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think
> I
> >>>>> have a better understanding of how you get away with using drm_sched
> >>>>> while still controlling how scheduling is really done. Here
> >>>>> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >>>>> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >>>>> selection is dumb because there's only one entity ever bound to the
> >>>>> scheduler (the one that's part of the xe_guc_engine object which also
> >>>>> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >>>>> have on Arm is the fact that the stream doesn't necessarily get
> >>>>> scheduled when ->run_job() is called, it can be placed in the
> runnable
> >>>>> queue and be picked later by the kernel-side scheduler when a FW slot
> >>>>> gets released. That can probably be sorted out by manually disabling
> the
> >>>>> job timer and re-enabling it when the stream gets picked by the
> >>>>> scheduler. But my main concern remains, we're basically abusing
> >>>>> drm_sched here.
> >>>>>
> >>>>> For the Arm driver, that means turning the following sequence
> >>>>>
> >>>>> 1. wait for job deps
> >>>>> 2. queue job to ringbuf and push the stream to the runnable
> >>>>>      queue (if it wasn't queued already). Wakeup the timeslice
> scheduler
> >>>>>      to re-evaluate (if the stream is not on a FW slot already)
> >>>>> 3. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >>>>>      execution
> >>>>>
> >>>>> into
> >>>>>
> >>>>> 1. queue job to entity which takes care of waiting for job deps for
> >>>>>      us
> >>>>> 2. schedule a drm_sched_main iteration
> >>>>> 3. the only available entity is picked, and the first job from this
> >>>>>      entity is dequeued. ->run_job() is called: the job is queued to
> the
> >>>>>      ringbuf and the stream is pushed to the runnable queue (if it
> wasn't
> >>>>>      queued already). Wakeup the timeslice scheduler to re-evaluate
> (if
> >>>>>      the stream is not on a FW slot already)
> >>>>> 4. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >>>>>      execution
> >>>>>
> >>>>> That's one extra step we don't really need. To sum-up, yes, all the
> >>>>> job/entity tracking might be interesting to share/re-use, but I
> wonder
> >>>>> if we couldn't have that without pulling out the scheduling part of
> >>>>> drm_sched, or maybe I'm missing something, and there's something in
> >>>>> drm_gpu_scheduler you really need.
> >>>>
> >>>> On second thought, that's probably an acceptable overhead (not even
> >>>> sure the extra step I was mentioning exists in practice, because dep
> >>>> fence signaled state is checked as part of the drm_sched_main
> >>>> iteration, so that's basically replacing the worker I schedule to
> >>>> check job deps), and I like the idea of being able to re-use drm_sched
> >>>> dep-tracking without resorting to invasive changes to the existing
> >>>> logic, so I'll probably give it a try.
> >>>
> >>> I agree with the concerns and think that how Xe proposes to integrate
> with
> >>> drm_sched is a problem, or at least significantly inelegant.
> >>>
> >>
> >> Inelegant is a matter of opinion, I actually rather like this solution.
> >>
> >> BTW this isn't my design rather this was Jason's idea.
>

Sure, throw me under the bus, why don't you? :-P  Nah, it's all fine.  It's
my design and I'm happy to defend it or be blamed for it in the history
books as the case may be.


> >>> AFAICT it proposes to have 1:1 between *userspace* created contexts
> (per
> >>> context _and_ engine) and drm_sched. I am not sure avoiding invasive
> changes
> >>> to the shared code is in the spirit of the overall idea and instead
> >>> opportunity should be used to look at way to refactor/improve
> drm_sched.
>

Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
really needs to drive a re-factor.  (More on that later.)  There's only one
real issue which is that it fires off potentially a lot of kthreads. Even
that's not that bad given that kthreads are pretty light and you're not
likely to have more kthreads than userspace threads which are much
heavier.  Not ideal, but not the end of the world either.  Definitely
something we can/should optimize but if we went through with Xe without
this patch, it would probably be mostly ok.


> >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >>
> >> I'm not really prepared to make large changes to DRM scheduler at the
> >> moment for Xe as they are not really required nor does Boris seem they
> >> will be required for his work either. I am interested to see what Boris
> >> comes up with.
> >>
> >>> Even on the low level, the idea to replace drm_sched threads with
> workers
> >>> has a few problems.
> >>>
> >>> To start with, the pattern of:
> >>>
> >>>    while (not_stopped) {
> >>>     keep picking jobs
> >>>    }
> >>>
> >>> Feels fundamentally in disagreement with workers (while obviously fits
> >>> perfectly with the current kthread design).
> >>
> >> The while loop breaks and worker exists if no jobs are ready.
>

I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work item
which handles all available jobs?


> >>> Secondly, it probably demands separate workers (not optional),
> otherwise
> >>> behaviour of shared workqueues has either the potential to explode
> number
> >>> kernel threads anyway, or add latency.
> >>>
> >>
> >> Right now the system_unbound_wq is used which does have a limit on the
> >> number of threads, right? I do have a FIXME to allow a worker to be
> >> passed in similar to TDR.
> >>
> >> WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
> >> schedulers can be pushing jobs into the backend / cleaning up jobs in
> >> parallel.
> >>
> >
> > Thought of one more point here where why in Xe we absolutely want a 1 to
> > 1 ratio between entity and scheduler - the way we implement timeslicing
> > for preempt fences.
> >
> > Let me try to explain.
> >
> > Preempt fences are implemented via the generic messaging interface [1]
> > with suspend / resume messages. If a suspend messages is received to
> > soon after calling resume (this is per entity) we simply sleep in the
> > suspend call thus giving the entity a timeslice. This completely falls
> > apart with a many to 1 relationship as now a entity waiting for a
> > timeslice blocks the other entities. Could we work aroudn this, sure but
> > just another bunch of code we'd have to add in Xe. Being to freely sleep
> > in backend without affecting other entities is really, really nice IMO
> > and I bet Xe isn't the only driver that is going to feel this way.
> >
> > Last thing I'll say regardless of how anyone feels about Xe using a 1 to
> > 1 relationship this patch IMO makes sense as I hope we can all agree a
> > workqueue scales better than kthreads.
>
> I don't know for sure what will scale better and for what use case,
> combination of CPU cores vs number of GPU engines to keep busy vs other
> system activity. But I wager someone is bound to ask for some numbers to
> make sure proposal is not negatively affecting any other drivers.
>

Then let them ask.  Waving your hands vaguely in the direction of the rest
of DRM and saying "Uh, someone (not me) might object" is profoundly
unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
think there's someone in particular who might have a useful opinion on
this, throw them in the CC so they don't miss the e-mail thread.

Or are you asking for numbers?  If so, what numbers are you asking for?

Also, If we're talking about a design that might paint us into an
Intel-HW-specific hole, that would be one thing.  But we're not.  We're
talking about switching which kernel threading/task mechanism to use for
what's really a very generic problem.  The core Xe design works without
this patch (just with more kthreads).  If we land this patch or something
like it and get it wrong and it causes a performance problem for someone
down the line, we can revisit it.


> In any case that's a low level question caused by the high level design
> decision. So I'd think first focus on the high level - which is the 1:1
> mapping of entity to scheduler instance proposal.
>
> Fundamentally it will be up to the DRM maintainers and the community to
> bless your approach. And it is important to stress 1:1 is about
> userspace contexts, so I believe unlike any other current scheduler
> user. And also important to stress this effectively does not make Xe
> _really_ use the scheduler that much.
>

I don't think this makes Xe nearly as much of a one-off as you think it
does.  I've already told the Asahi team working on Apple M1/2 hardware to
do it this way and it seems to be a pretty good mapping for them.  I
believe this is roughly the plan for nouveau as well.  It's not the way it
currently works for anyone because most other groups aren't doing FW
scheduling yet.  In the world of FW scheduling and hardware designed to
support userspace direct-to-FW submit, I think the design makes perfect
sense (see below) and I expect we'll see more drivers move in this
direction as those drivers evolve.  (AMD is doing some customish thing for
how with gpu_scheduler on the front-end somehow.  I've not dug into those
details.)


> I can only offer my opinion, which is that the two options mentioned in
> this thread (either improve drm scheduler to cope with what is required,
> or split up the code so you can use just the parts of drm_sched which
> you want - which is frontend dependency tracking) shouldn't be so
> readily dismissed, given how I think the idea was for the new driver to
> work less in a silo and more in the community (not do kludges to
> workaround stuff because it is thought to be too hard to improve common
> code), but fundamentally, "goto previous paragraph" for what I am
> concerned.
>

Meta comment:  It appears as if you're falling into the standard i915 team
trap of having an internal discussion about what the community discussion
might look like instead of actually having the community discussion.  If
you are seriously concerned about interactions with other drivers or
whether or setting common direction, the right way to do that is to break a
patch or two out into a separate RFC series and tag a handful of driver
maintainers.  Trying to predict the questions other people might ask is
pointless. Cc them and asking for their input instead.


> Regards,
>
> Tvrtko
>
> P.S. And as a related side note, there are more areas where drm_sched
> could be improved, like for instance priority handling.
> Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> get_sched_entity to see how msm works around the drm_sched hardcoded
> limit of available priority levels, in order to avoid having to leave a
> hw capability unused. I suspect msm would be happier if they could have
> all priority levels equal in terms of whether they apply only at the
> frontend level or completely throughout the pipeline.
>
> > [1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >>> What would be interesting to learn is whether the option of refactoring
> >>> drm_sched to deal with out of order completion was considered and what
> were
> >>> the conclusions.
> >>>
> >>
> >> I coded this up a while back when trying to convert the i915 to the DRM
> >> scheduler it isn't all that hard either. The free flow control on the
> >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
> >> sold me on the this design.
>

You're not the only one to suggest supporting out-of-order completion.
However, it's tricky and breaks a lot of internal assumptions of the
scheduler. It also reduces functionality a bit because it can no longer
automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
yes, it probably could but it becomes a substantially harder problem.)

It also seems like a worse mapping to me.  The goal here is to turn
submissions on a userspace-facing engine/queue into submissions to a FW
queue submissions, sorting out any dma_fence dependencies.  Matt's
description of saying this is a 1:1 mapping between sched/entity doesn't
tell the whole story. It's a 1:1:1 mapping between xe_engine,
gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
Why is that better?

There are two places where this 1:1:1 mapping is causing problems:

 1. It creates lots of kthreads. This is what this patch is trying to
solve. IDK if it's solving it the best way but that's the goal.

 2. There are a far more limited number of communication queues between the
kernel and GuC for more meta things like pausing and resuming queues,
getting events back from GuC, etc. Unless we're in a weird pressure
scenario, the amount of traffic on this queue should be low so we can
probably just have one per physical device.  The vast majority of kernel ->
GuC communication should be on the individual FW queue rings and maybe
smashing in-memory doorbells.

Doing out-of-order completion sort-of solves the 1 but does nothing for 2
and actually makes managing FW queues harder because we no longer have
built-in rate limiting.  Seems like a net loss to me.

>>> Second option perhaps to split out the drm_sched code into parts which
> would
> >>> lend themselves more to "pick and choose" of its functionalities.
> >>> Specifically, Xe wants frontend dependency tracking, but not any
> scheduling
> >>> really (neither least busy drm_sched, neither FIFO/RQ entity picking),
> so
> >>> even having all these data structures in memory is a waste.
> >>>
> >>
> >> I don't think that we are wasting memory is a very good argument for
> >> making intrusive changes to the DRM scheduler.
>

Worse than that, I think the "we could split it up" kind-of misses the
point of the way Xe is using drm/scheduler.  It's not just about re-using a
tiny bit of dependency tracking code.  Using the scheduler in this way
provides a clean separation between front-end and back-end.  The job of the
userspace-facing ioctl code is to shove things on the scheduler.  The job
of the run_job callback is to encode the job into the FW queue format,
stick it in the FW queue ring, and maybe smash a doorbell.  Everything else
happens in terms of managing those queues side-band.  The gpu_scheduler
code manages the front-end queues and Xe manages the FW queues via the
Kernel <-> GuC communication rings.  From a high level, this is a really
clean design.  There are potentially some sticky bits around the dual-use
of dma_fence for scheduling and memory management but none of those are
solved by breaking the DRM scheduler into chunks or getting rid of the
1:1:1 mapping.

If we split it out, we're basically asking the driver to implement a bunch
of kthread or workqueue stuff, all the ring rate-limiting, etc.  It may not
be all that much code but also, why?  To save a few bytes of memory per
engine?  Each engine already has 32K(ish) worth of context state and a
similar size ring to communicate with the FW.  No one is going to notice an
extra CPU data structure.

I'm not seeing a solid argument against the 1:1:1 design here other than
that it doesn't seem like the way DRM scheduler was intended to be used.  I
won't argue that.  It's not.  But it is a fairly natural way to take
advantage of the benefits the DRM scheduler does provide while also mapping
it to hardware that was designed for userspace direct-to-FW submit.

--Jason



> >>> With the first option then the end result could be drm_sched per engine
> >>> class (hardware view), which I think fits with the GuC model. Give all
> >>> schedulable contexts (entities) to the GuC and then mostly forget about
> >>> them. Timeslicing and re-ordering and all happens transparently to the
> >>> kernel from that point until completion.
> >>>
> >>
> >> Out-of-order problem still exists here.
> >>
> >>> Or with the second option you would build on some smaller refactored
> >>> sub-components of drm_sched, by maybe splitting the dependency
> tracking from
> >>> scheduling (RR/FIFO entity picking code).
> >>>
> >>> Second option is especially a bit vague and I haven't thought about the
> >>> required mechanics, but it just appeared too obvious the proposed
> design has
> >>> a bit too much impedance mismatch.
> >>>
> >>
> >> IMO ROI on this is low and again lets see what Boris comes up with.
> >>
> >> Matt
> >>
> >>> Oh and as a side note, when I went into the drm_sched code base to
> remind
> >>> myself how things worked, it is quite easy to find some FIXME comments
> which
> >>> suggest people working on it are unsure of locking desing there and
> such. So
> >>> perhaps that all needs cleanup too, I mean would benefit from
> >>> refactoring/improving work as brainstormed above anyway.
> >>>
> >>> Regards,
> >>>
> >>> Tvrtko
>

[-- Attachment #2: Type: text/html, Size: 25084 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-09 17:27                   ` Jason Ekstrand
  0 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-09 17:27 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 19490 bytes --]

On Mon, Jan 9, 2023 at 7:46 AM Tvrtko Ursulin <
tvrtko.ursulin@linux.intel.com> wrote:

>
> On 06/01/2023 23:52, Matthew Brost wrote:
> > On Thu, Jan 05, 2023 at 09:43:41PM +0000, Matthew Brost wrote:
> >> On Tue, Jan 03, 2023 at 01:02:15PM +0000, Tvrtko Ursulin wrote:
> >>>
> >>> On 02/01/2023 07:30, Boris Brezillon wrote:
> >>>> On Fri, 30 Dec 2022 12:55:08 +0100
> >>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>>
> >>>>> On Fri, 30 Dec 2022 11:20:42 +0100
> >>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>>>
> >>>>>> Hello Matthew,
> >>>>>>
> >>>>>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>>>>> Matthew Brost <matthew.brost@intel.com> wrote:
> >>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> >>>>>>> seems a bit odd but let us explain the reasoning below.
> >>>>>>>
> >>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>>>>>> guaranteed to be the same completion even if targeting the same
> hardware
> >>>>>>> engine. This is because in XE we have a firmware scheduler, the
> GuC,
> >>>>>>> which allowed to reorder, timeslice, and preempt submissions. If a
> using
> >>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> >>>>>>> apart as the TDR expects submission order == completion order.
> Using a
> >>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> >>>>>>
> >>>>>> Oh, that's interesting. I've been trying to solve the same sort of
> >>>>>> issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> >>>>>> scheduling scheme (you give the FW N streams to execute, and it does
> >>>>>> the scheduling between those N command streams, the kernel driver
> >>>>>> does timeslice scheduling to update the command streams passed to
> the
> >>>>>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>>>>> because the integration with drm_sched was painful, but also
> because I
> >>>>>> felt trying to bend drm_sched to make it interact with a
> >>>>>> timeslice-oriented scheduling model wasn't really future proof.
> Giving
> >>>>>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> >>>>>> help for a few things (didn't think it through yet), but I feel it's
> >>>>>> coming short on other aspects we have to deal with on Arm GPUs.
> >>>>>
> >>>>> Ok, so I just had a quick look at the Xe driver and how it
> >>>>> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think
> I
> >>>>> have a better understanding of how you get away with using drm_sched
> >>>>> while still controlling how scheduling is really done. Here
> >>>>> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >>>>> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >>>>> selection is dumb because there's only one entity ever bound to the
> >>>>> scheduler (the one that's part of the xe_guc_engine object which also
> >>>>> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >>>>> have on Arm is the fact that the stream doesn't necessarily get
> >>>>> scheduled when ->run_job() is called, it can be placed in the
> runnable
> >>>>> queue and be picked later by the kernel-side scheduler when a FW slot
> >>>>> gets released. That can probably be sorted out by manually disabling
> the
> >>>>> job timer and re-enabling it when the stream gets picked by the
> >>>>> scheduler. But my main concern remains, we're basically abusing
> >>>>> drm_sched here.
> >>>>>
> >>>>> For the Arm driver, that means turning the following sequence
> >>>>>
> >>>>> 1. wait for job deps
> >>>>> 2. queue job to ringbuf and push the stream to the runnable
> >>>>>      queue (if it wasn't queued already). Wakeup the timeslice
> scheduler
> >>>>>      to re-evaluate (if the stream is not on a FW slot already)
> >>>>> 3. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >>>>>      execution
> >>>>>
> >>>>> into
> >>>>>
> >>>>> 1. queue job to entity which takes care of waiting for job deps for
> >>>>>      us
> >>>>> 2. schedule a drm_sched_main iteration
> >>>>> 3. the only available entity is picked, and the first job from this
> >>>>>      entity is dequeued. ->run_job() is called: the job is queued to
> the
> >>>>>      ringbuf and the stream is pushed to the runnable queue (if it
> wasn't
> >>>>>      queued already). Wakeup the timeslice scheduler to re-evaluate
> (if
> >>>>>      the stream is not on a FW slot already)
> >>>>> 4. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >>>>>      execution
> >>>>>
> >>>>> That's one extra step we don't really need. To sum-up, yes, all the
> >>>>> job/entity tracking might be interesting to share/re-use, but I
> wonder
> >>>>> if we couldn't have that without pulling out the scheduling part of
> >>>>> drm_sched, or maybe I'm missing something, and there's something in
> >>>>> drm_gpu_scheduler you really need.
> >>>>
> >>>> On second thought, that's probably an acceptable overhead (not even
> >>>> sure the extra step I was mentioning exists in practice, because dep
> >>>> fence signaled state is checked as part of the drm_sched_main
> >>>> iteration, so that's basically replacing the worker I schedule to
> >>>> check job deps), and I like the idea of being able to re-use drm_sched
> >>>> dep-tracking without resorting to invasive changes to the existing
> >>>> logic, so I'll probably give it a try.
> >>>
> >>> I agree with the concerns and think that how Xe proposes to integrate
> with
> >>> drm_sched is a problem, or at least significantly inelegant.
> >>>
> >>
> >> Inelegant is a matter of opinion, I actually rather like this solution.
> >>
> >> BTW this isn't my design rather this was Jason's idea.
>

Sure, throw me under the bus, why don't you? :-P  Nah, it's all fine.  It's
my design and I'm happy to defend it or be blamed for it in the history
books as the case may be.


> >>> AFAICT it proposes to have 1:1 between *userspace* created contexts
> (per
> >>> context _and_ engine) and drm_sched. I am not sure avoiding invasive
> changes
> >>> to the shared code is in the spirit of the overall idea and instead
> >>> opportunity should be used to look at way to refactor/improve
> drm_sched.
>

Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
really needs to drive a re-factor.  (More on that later.)  There's only one
real issue which is that it fires off potentially a lot of kthreads. Even
that's not that bad given that kthreads are pretty light and you're not
likely to have more kthreads than userspace threads which are much
heavier.  Not ideal, but not the end of the world either.  Definitely
something we can/should optimize but if we went through with Xe without
this patch, it would probably be mostly ok.


> >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >>
> >> I'm not really prepared to make large changes to DRM scheduler at the
> >> moment for Xe as they are not really required nor does Boris seem they
> >> will be required for his work either. I am interested to see what Boris
> >> comes up with.
> >>
> >>> Even on the low level, the idea to replace drm_sched threads with
> workers
> >>> has a few problems.
> >>>
> >>> To start with, the pattern of:
> >>>
> >>>    while (not_stopped) {
> >>>     keep picking jobs
> >>>    }
> >>>
> >>> Feels fundamentally in disagreement with workers (while obviously fits
> >>> perfectly with the current kthread design).
> >>
> >> The while loop breaks and worker exists if no jobs are ready.
>

I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work item
which handles all available jobs?


> >>> Secondly, it probably demands separate workers (not optional),
> otherwise
> >>> behaviour of shared workqueues has either the potential to explode
> number
> >>> kernel threads anyway, or add latency.
> >>>
> >>
> >> Right now the system_unbound_wq is used which does have a limit on the
> >> number of threads, right? I do have a FIXME to allow a worker to be
> >> passed in similar to TDR.
> >>
> >> WRT to latency, the 1:1 ratio could actually have lower latency as 2 GPU
> >> schedulers can be pushing jobs into the backend / cleaning up jobs in
> >> parallel.
> >>
> >
> > Thought of one more point here where why in Xe we absolutely want a 1 to
> > 1 ratio between entity and scheduler - the way we implement timeslicing
> > for preempt fences.
> >
> > Let me try to explain.
> >
> > Preempt fences are implemented via the generic messaging interface [1]
> > with suspend / resume messages. If a suspend messages is received to
> > soon after calling resume (this is per entity) we simply sleep in the
> > suspend call thus giving the entity a timeslice. This completely falls
> > apart with a many to 1 relationship as now a entity waiting for a
> > timeslice blocks the other entities. Could we work aroudn this, sure but
> > just another bunch of code we'd have to add in Xe. Being to freely sleep
> > in backend without affecting other entities is really, really nice IMO
> > and I bet Xe isn't the only driver that is going to feel this way.
> >
> > Last thing I'll say regardless of how anyone feels about Xe using a 1 to
> > 1 relationship this patch IMO makes sense as I hope we can all agree a
> > workqueue scales better than kthreads.
>
> I don't know for sure what will scale better and for what use case,
> combination of CPU cores vs number of GPU engines to keep busy vs other
> system activity. But I wager someone is bound to ask for some numbers to
> make sure proposal is not negatively affecting any other drivers.
>

Then let them ask.  Waving your hands vaguely in the direction of the rest
of DRM and saying "Uh, someone (not me) might object" is profoundly
unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
think there's someone in particular who might have a useful opinion on
this, throw them in the CC so they don't miss the e-mail thread.

Or are you asking for numbers?  If so, what numbers are you asking for?

Also, If we're talking about a design that might paint us into an
Intel-HW-specific hole, that would be one thing.  But we're not.  We're
talking about switching which kernel threading/task mechanism to use for
what's really a very generic problem.  The core Xe design works without
this patch (just with more kthreads).  If we land this patch or something
like it and get it wrong and it causes a performance problem for someone
down the line, we can revisit it.


> In any case that's a low level question caused by the high level design
> decision. So I'd think first focus on the high level - which is the 1:1
> mapping of entity to scheduler instance proposal.
>
> Fundamentally it will be up to the DRM maintainers and the community to
> bless your approach. And it is important to stress 1:1 is about
> userspace contexts, so I believe unlike any other current scheduler
> user. And also important to stress this effectively does not make Xe
> _really_ use the scheduler that much.
>

I don't think this makes Xe nearly as much of a one-off as you think it
does.  I've already told the Asahi team working on Apple M1/2 hardware to
do it this way and it seems to be a pretty good mapping for them.  I
believe this is roughly the plan for nouveau as well.  It's not the way it
currently works for anyone because most other groups aren't doing FW
scheduling yet.  In the world of FW scheduling and hardware designed to
support userspace direct-to-FW submit, I think the design makes perfect
sense (see below) and I expect we'll see more drivers move in this
direction as those drivers evolve.  (AMD is doing some customish thing for
how with gpu_scheduler on the front-end somehow.  I've not dug into those
details.)


> I can only offer my opinion, which is that the two options mentioned in
> this thread (either improve drm scheduler to cope with what is required,
> or split up the code so you can use just the parts of drm_sched which
> you want - which is frontend dependency tracking) shouldn't be so
> readily dismissed, given how I think the idea was for the new driver to
> work less in a silo and more in the community (not do kludges to
> workaround stuff because it is thought to be too hard to improve common
> code), but fundamentally, "goto previous paragraph" for what I am
> concerned.
>

Meta comment:  It appears as if you're falling into the standard i915 team
trap of having an internal discussion about what the community discussion
might look like instead of actually having the community discussion.  If
you are seriously concerned about interactions with other drivers or
whether or setting common direction, the right way to do that is to break a
patch or two out into a separate RFC series and tag a handful of driver
maintainers.  Trying to predict the questions other people might ask is
pointless. Cc them and asking for their input instead.


> Regards,
>
> Tvrtko
>
> P.S. And as a related side note, there are more areas where drm_sched
> could be improved, like for instance priority handling.
> Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> get_sched_entity to see how msm works around the drm_sched hardcoded
> limit of available priority levels, in order to avoid having to leave a
> hw capability unused. I suspect msm would be happier if they could have
> all priority levels equal in terms of whether they apply only at the
> frontend level or completely throughout the pipeline.
>
> > [1] https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >>> What would be interesting to learn is whether the option of refactoring
> >>> drm_sched to deal with out of order completion was considered and what
> were
> >>> the conclusions.
> >>>
> >>
> >> I coded this up a while back when trying to convert the i915 to the DRM
> >> scheduler it isn't all that hard either. The free flow control on the
> >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is really what
> >> sold me on the this design.
>

You're not the only one to suggest supporting out-of-order completion.
However, it's tricky and breaks a lot of internal assumptions of the
scheduler. It also reduces functionality a bit because it can no longer
automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
yes, it probably could but it becomes a substantially harder problem.)

It also seems like a worse mapping to me.  The goal here is to turn
submissions on a userspace-facing engine/queue into submissions to a FW
queue submissions, sorting out any dma_fence dependencies.  Matt's
description of saying this is a 1:1 mapping between sched/entity doesn't
tell the whole story. It's a 1:1:1 mapping between xe_engine,
gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
Why is that better?

There are two places where this 1:1:1 mapping is causing problems:

 1. It creates lots of kthreads. This is what this patch is trying to
solve. IDK if it's solving it the best way but that's the goal.

 2. There are a far more limited number of communication queues between the
kernel and GuC for more meta things like pausing and resuming queues,
getting events back from GuC, etc. Unless we're in a weird pressure
scenario, the amount of traffic on this queue should be low so we can
probably just have one per physical device.  The vast majority of kernel ->
GuC communication should be on the individual FW queue rings and maybe
smashing in-memory doorbells.

Doing out-of-order completion sort-of solves the 1 but does nothing for 2
and actually makes managing FW queues harder because we no longer have
built-in rate limiting.  Seems like a net loss to me.

>>> Second option perhaps to split out the drm_sched code into parts which
> would
> >>> lend themselves more to "pick and choose" of its functionalities.
> >>> Specifically, Xe wants frontend dependency tracking, but not any
> scheduling
> >>> really (neither least busy drm_sched, neither FIFO/RQ entity picking),
> so
> >>> even having all these data structures in memory is a waste.
> >>>
> >>
> >> I don't think that we are wasting memory is a very good argument for
> >> making intrusive changes to the DRM scheduler.
>

Worse than that, I think the "we could split it up" kind-of misses the
point of the way Xe is using drm/scheduler.  It's not just about re-using a
tiny bit of dependency tracking code.  Using the scheduler in this way
provides a clean separation between front-end and back-end.  The job of the
userspace-facing ioctl code is to shove things on the scheduler.  The job
of the run_job callback is to encode the job into the FW queue format,
stick it in the FW queue ring, and maybe smash a doorbell.  Everything else
happens in terms of managing those queues side-band.  The gpu_scheduler
code manages the front-end queues and Xe manages the FW queues via the
Kernel <-> GuC communication rings.  From a high level, this is a really
clean design.  There are potentially some sticky bits around the dual-use
of dma_fence for scheduling and memory management but none of those are
solved by breaking the DRM scheduler into chunks or getting rid of the
1:1:1 mapping.

If we split it out, we're basically asking the driver to implement a bunch
of kthread or workqueue stuff, all the ring rate-limiting, etc.  It may not
be all that much code but also, why?  To save a few bytes of memory per
engine?  Each engine already has 32K(ish) worth of context state and a
similar size ring to communicate with the FW.  No one is going to notice an
extra CPU data structure.

I'm not seeing a solid argument against the 1:1:1 design here other than
that it doesn't seem like the way DRM scheduler was intended to be used.  I
won't argue that.  It's not.  But it is a fairly natural way to take
advantage of the benefits the DRM scheduler does provide while also mapping
it to hardware that was designed for userspace direct-to-FW submit.

--Jason



> >>> With the first option then the end result could be drm_sched per engine
> >>> class (hardware view), which I think fits with the GuC model. Give all
> >>> schedulable contexts (entities) to the GuC and then mostly forget about
> >>> them. Timeslicing and re-ordering and all happens transparently to the
> >>> kernel from that point until completion.
> >>>
> >>
> >> Out-of-order problem still exists here.
> >>
> >>> Or with the second option you would build on some smaller refactored
> >>> sub-components of drm_sched, by maybe splitting the dependency
> tracking from
> >>> scheduling (RR/FIFO entity picking code).
> >>>
> >>> Second option is especially a bit vague and I haven't thought about the
> >>> required mechanics, but it just appeared too obvious the proposed
> design has
> >>> a bit too much impedance mismatch.
> >>>
> >>
> >> IMO ROI on this is low and again lets see what Boris comes up with.
> >>
> >> Matt
> >>
> >>> Oh and as a side note, when I went into the drm_sched code base to
> remind
> >>> myself how things worked, it is quite easy to find some FIXME comments
> which
> >>> suggest people working on it are unsure of locking desing there and
> such. So
> >>> perhaps that all needs cleanup too, I mean would benefit from
> >>> refactoring/improving work as brainstormed above anyway.
> >>>
> >>> Regards,
> >>>
> >>> Tvrtko
>

[-- Attachment #2: Type: text/html, Size: 25084 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-09 17:17               ` Boris Brezillon
@ 2023-01-09 20:40                 ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-09 20:40 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> Hi Jason,
> 
> On Mon, 9 Jan 2023 09:45:09 -0600
> Jason Ekstrand <jason@jlekstrand.net> wrote:
> 
> > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > wrote:
> > 
> > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > >  
> > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > >  
> > > > > > Hello Matthew,
> > > > > >
> > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > >  
> > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > > this  
> > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > >
> > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > guaranteed to be the same completion even if targeting the same  
> > > hardware  
> > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > GuC,  
> > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > > using  
> > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > > falls  
> > > > > > > apart as the TDR expects submission order == completion order.  
> > > Using a  
> > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > problem.  
> > > > > >
> > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > FW-assisted  
> > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > because the integration with drm_sched was painful, but also because  
> > > I  
> > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > Giving  
> > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > might  
> > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > >
> > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > have a better understanding of how you get away with using drm_sched
> > > > > while still controlling how scheduling is really done. Here
> > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > >
> > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > aside from the most basic level of not overflowing the submission ring
> > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > scheduler fits quite well.
> > >  
> > 
> > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > want here and what you need for Arm thanks to the number of FW queues
> > available. I don't remember the exact number of GuC queues but it's at
> > least 1k. This puts it in an entirely different class from what you have on
> > Mali. Roughly, there's about three categories here:
> > 
> >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > categorization.)
> > 
> >  2. Hardware (or firmware) with a very limited number of queues where
> > you're going to have to juggle in the kernel in order to run desktop Linux.
> > 
> >  3. Firmware scheduling with a high queue count. In this case, you don't
> > want the kernel scheduling anything. Just throw it at the firmware and let
> > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > temporarily pause some low-priority contexts and do some juggling or,
> > frankly, just fail userspace queue creation and tell the user to close some
> > windows.
> > 
> > The existence of this 2nd class is a bit annoying but it's where we are. I
> > think it's worth recognizing that Xe and panfrost are in different places
> > here and will require different designs. For Xe, we really are just using
> > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > 
> > How do we deal with class 2? That's an interesting question.  We may
> > eventually want to break that off into a separate discussion and not litter
> > the Xe thread but let's keep going here for a bit.  I think there are some
> > pretty reasonable solutions but they're going to look a bit different.
> > 
> > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > ring and then there was a tiny kernel which operated entirely in IRQ
> > handlers which juggled those execlists by smashing HW registers.  For
> > Panfrost, I think we want something slightly different but can borrow some
> > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > (they can even be fixed-size if that helps) and then have a kthread which
> > juggles those feeds the limited FW queues.  In the case where you have few
> > enough active contexts to fit them all in FW, I do think it's best to have
> > them all active in FW and let it schedule. But with only 31, you need to be
> > able to juggle if you run out.
> 
> That's more or less what I do right now, except I don't use the
> drm_sched front-end to handle deps or queue jobs (at least not yet). The
> kernel-side timeslice-based scheduler juggling with runnable queues
> (queues with pending jobs that are not yet resident on a FW slot)
> uses a dedicated ordered-workqueue instead of a thread, with scheduler
> ticks being handled with a delayed-work (tick happening every X
> milliseconds when queues are waiting for a slot). It all seems very
> HW/FW-specific though, and I think it's a bit premature to try to
> generalize that part, but the dep-tracking logic implemented by
> drm_sched looked like something I could easily re-use, hence my
> interest in Xe's approach.

So another option for these few fw queue slots schedulers would be to
treat them as vram and enlist ttm.

Well maybe more enlist ttm and less treat them like vram, but ttm can
handle idr (or xarray or whatever you want) and then help you with all the
pipelining (and the drm_sched then with sorting out dependencies). If you
then also preferentially "evict" low-priority queus you pretty much have
the perfect thing.

Note that GuC with sriov splits up the id space and together with some
restrictions due to multi-engine contexts media needs might also need this
all.

If you're balking at the idea of enlisting ttm just for fw queue
management, amdgpu has a shoddy version of id allocation for their vm/tlb
index allocation. Might be worth it to instead lift that into some sched
helper code.

Either way there's two imo rather solid approaches available to sort this
out. And once you have that, then there shouldn't be any big difference in
driver design between fw with defacto unlimited queue ids, and those with
severe restrictions in number of queues.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-09 20:40                 ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-09 20:40 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel

On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> Hi Jason,
> 
> On Mon, 9 Jan 2023 09:45:09 -0600
> Jason Ekstrand <jason@jlekstrand.net> wrote:
> 
> > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > wrote:
> > 
> > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > >  
> > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > >  
> > > > > > Hello Matthew,
> > > > > >
> > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > >  
> > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > > this  
> > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > >
> > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > guaranteed to be the same completion even if targeting the same  
> > > hardware  
> > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > GuC,  
> > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > > using  
> > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > > falls  
> > > > > > > apart as the TDR expects submission order == completion order.  
> > > Using a  
> > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > problem.  
> > > > > >
> > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > FW-assisted  
> > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > because the integration with drm_sched was painful, but also because  
> > > I  
> > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > Giving  
> > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > might  
> > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > >
> > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > have a better understanding of how you get away with using drm_sched
> > > > > while still controlling how scheduling is really done. Here
> > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > >
> > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > aside from the most basic level of not overflowing the submission ring
> > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > scheduler fits quite well.
> > >  
> > 
> > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > want here and what you need for Arm thanks to the number of FW queues
> > available. I don't remember the exact number of GuC queues but it's at
> > least 1k. This puts it in an entirely different class from what you have on
> > Mali. Roughly, there's about three categories here:
> > 
> >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > categorization.)
> > 
> >  2. Hardware (or firmware) with a very limited number of queues where
> > you're going to have to juggle in the kernel in order to run desktop Linux.
> > 
> >  3. Firmware scheduling with a high queue count. In this case, you don't
> > want the kernel scheduling anything. Just throw it at the firmware and let
> > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > temporarily pause some low-priority contexts and do some juggling or,
> > frankly, just fail userspace queue creation and tell the user to close some
> > windows.
> > 
> > The existence of this 2nd class is a bit annoying but it's where we are. I
> > think it's worth recognizing that Xe and panfrost are in different places
> > here and will require different designs. For Xe, we really are just using
> > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > 
> > How do we deal with class 2? That's an interesting question.  We may
> > eventually want to break that off into a separate discussion and not litter
> > the Xe thread but let's keep going here for a bit.  I think there are some
> > pretty reasonable solutions but they're going to look a bit different.
> > 
> > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > ring and then there was a tiny kernel which operated entirely in IRQ
> > handlers which juggled those execlists by smashing HW registers.  For
> > Panfrost, I think we want something slightly different but can borrow some
> > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > (they can even be fixed-size if that helps) and then have a kthread which
> > juggles those feeds the limited FW queues.  In the case where you have few
> > enough active contexts to fit them all in FW, I do think it's best to have
> > them all active in FW and let it schedule. But with only 31, you need to be
> > able to juggle if you run out.
> 
> That's more or less what I do right now, except I don't use the
> drm_sched front-end to handle deps or queue jobs (at least not yet). The
> kernel-side timeslice-based scheduler juggling with runnable queues
> (queues with pending jobs that are not yet resident on a FW slot)
> uses a dedicated ordered-workqueue instead of a thread, with scheduler
> ticks being handled with a delayed-work (tick happening every X
> milliseconds when queues are waiting for a slot). It all seems very
> HW/FW-specific though, and I think it's a bit premature to try to
> generalize that part, but the dep-tracking logic implemented by
> drm_sched looked like something I could easily re-use, hence my
> interest in Xe's approach.

So another option for these few fw queue slots schedulers would be to
treat them as vram and enlist ttm.

Well maybe more enlist ttm and less treat them like vram, but ttm can
handle idr (or xarray or whatever you want) and then help you with all the
pipelining (and the drm_sched then with sorting out dependencies). If you
then also preferentially "evict" low-priority queus you pretty much have
the perfect thing.

Note that GuC with sriov splits up the id space and together with some
restrictions due to multi-engine contexts media needs might also need this
all.

If you're balking at the idea of enlisting ttm just for fw queue
management, amdgpu has a shoddy version of id allocation for their vm/tlb
index allocation. Might be worth it to instead lift that into some sched
helper code.

Either way there's two imo rather solid approaches available to sort this
out. And once you have that, then there shouldn't be any big difference in
driver design between fw with defacto unlimited queue ids, and those with
severe restrictions in number of queues.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-09 20:40                 ` Daniel Vetter
@ 2023-01-10  8:46                   ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-10  8:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

Hi Daniel,

On Mon, 9 Jan 2023 21:40:21 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > Hi Jason,
> > 
> > On Mon, 9 Jan 2023 09:45:09 -0600
> > Jason Ekstrand <jason@jlekstrand.net> wrote:
> >   
> > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > wrote:
> > >   
> > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:    
> > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > >    
> > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > >    
> > > > > > > Hello Matthew,
> > > > > > >
> > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > >    
> > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first    
> > > > this    
> > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > >
> > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > guaranteed to be the same completion even if targeting the same    
> > > > hardware    
> > > > > > > > engine. This is because in XE we have a firmware scheduler, the    
> > > > GuC,    
> > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a    
> > > > using    
> > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR    
> > > > falls    
> > > > > > > > apart as the TDR expects submission order == completion order.    
> > > > Using a    
> > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this    
> > > > problem.    
> > > > > > >
> > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > issues to support Arm's new Mali GPU which is relying on a    
> > > > FW-assisted    
> > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > because the integration with drm_sched was painful, but also because    
> > > > I    
> > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > timeslice-oriented scheduling model wasn't really future proof.    
> > > > Giving    
> > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably    
> > > > might    
> > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > coming short on other aspects we have to deal with on Arm GPUs.    
> > > > > >
> > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > while still controlling how scheduling is really done. Here
> > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue    
> > > >
> > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > aside from the most basic level of not overflowing the submission ring
> > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > scheduler fits quite well.
> > > >    
> > > 
> > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > want here and what you need for Arm thanks to the number of FW queues
> > > available. I don't remember the exact number of GuC queues but it's at
> > > least 1k. This puts it in an entirely different class from what you have on
> > > Mali. Roughly, there's about three categories here:
> > > 
> > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > categorization.)
> > > 
> > >  2. Hardware (or firmware) with a very limited number of queues where
> > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > 
> > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > temporarily pause some low-priority contexts and do some juggling or,
> > > frankly, just fail userspace queue creation and tell the user to close some
> > > windows.
> > > 
> > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > think it's worth recognizing that Xe and panfrost are in different places
> > > here and will require different designs. For Xe, we really are just using
> > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > 
> > > How do we deal with class 2? That's an interesting question.  We may
> > > eventually want to break that off into a separate discussion and not litter
> > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > pretty reasonable solutions but they're going to look a bit different.
> > > 
> > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > handlers which juggled those execlists by smashing HW registers.  For
> > > Panfrost, I think we want something slightly different but can borrow some
> > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > (they can even be fixed-size if that helps) and then have a kthread which
> > > juggles those feeds the limited FW queues.  In the case where you have few
> > > enough active contexts to fit them all in FW, I do think it's best to have
> > > them all active in FW and let it schedule. But with only 31, you need to be
> > > able to juggle if you run out.  
> > 
> > That's more or less what I do right now, except I don't use the
> > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > kernel-side timeslice-based scheduler juggling with runnable queues
> > (queues with pending jobs that are not yet resident on a FW slot)
> > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > ticks being handled with a delayed-work (tick happening every X
> > milliseconds when queues are waiting for a slot). It all seems very
> > HW/FW-specific though, and I think it's a bit premature to try to
> > generalize that part, but the dep-tracking logic implemented by
> > drm_sched looked like something I could easily re-use, hence my
> > interest in Xe's approach.  
> 
> So another option for these few fw queue slots schedulers would be to
> treat them as vram and enlist ttm.
> 
> Well maybe more enlist ttm and less treat them like vram, but ttm can
> handle idr (or xarray or whatever you want) and then help you with all the
> pipelining (and the drm_sched then with sorting out dependencies). If you
> then also preferentially "evict" low-priority queus you pretty much have
> the perfect thing.
> 
> Note that GuC with sriov splits up the id space and together with some
> restrictions due to multi-engine contexts media needs might also need this
> all.
> 
> If you're balking at the idea of enlisting ttm just for fw queue
> management, amdgpu has a shoddy version of id allocation for their vm/tlb
> index allocation. Might be worth it to instead lift that into some sched
> helper code.

Would you mind pointing me to the amdgpu code you're mentioning here?
Still have a hard time seeing what TTM has to do with scheduling, but I
also don't know much about TTM, so I'll keep digging.

> 
> Either way there's two imo rather solid approaches available to sort this
> out. And once you have that, then there shouldn't be any big difference in
> driver design between fw with defacto unlimited queue ids, and those with
> severe restrictions in number of queues.

Honestly, I don't think there's much difference between those two cases
already. There's just a bunch of additional code to schedule queues on
FW slots for the limited-number-of-FW-slots case, which, right now, is
driver specific. The job queuing front-end pretty much achieves what
drm_sched does already: queuing job to entities, checking deps,
submitting job to HW (in our case, writing to the command stream ring
buffer). Things start to differ after that point: once a scheduling
entity has pending jobs, we add it to one of the runnable queues (one
queue per prio) and kick the kernel-side timeslice-based scheduler to
re-evaluate, if needed.

I'm all for using generic code when it makes sense, even if that means
adding this common code when it doesn't exists, but I don't want to be
dragged into some major refactoring that might take years to land.
Especially if pancsf is the first
FW-assisted-scheduler-with-few-FW-slot driver.

Here's a link to my WIP branch [1], and here is the scheduler logic
[2] if you want to have a look. Don't pay too much attention to the
driver uAPI (it's being redesigned).

Regards,

Boris

[1]https://gitlab.freedesktop.org/bbrezillon/linux/-/tree/pancsf
[2]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/pancsf/drivers/gpu/drm/pancsf/pancsf_sched.c

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10  8:46                   ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-10  8:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Daniel,

On Mon, 9 Jan 2023 21:40:21 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > Hi Jason,
> > 
> > On Mon, 9 Jan 2023 09:45:09 -0600
> > Jason Ekstrand <jason@jlekstrand.net> wrote:
> >   
> > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > wrote:
> > >   
> > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:    
> > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > >    
> > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > >    
> > > > > > > Hello Matthew,
> > > > > > >
> > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > >    
> > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first    
> > > > this    
> > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > >
> > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > guaranteed to be the same completion even if targeting the same    
> > > > hardware    
> > > > > > > > engine. This is because in XE we have a firmware scheduler, the    
> > > > GuC,    
> > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a    
> > > > using    
> > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR    
> > > > falls    
> > > > > > > > apart as the TDR expects submission order == completion order.    
> > > > Using a    
> > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this    
> > > > problem.    
> > > > > > >
> > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > issues to support Arm's new Mali GPU which is relying on a    
> > > > FW-assisted    
> > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > because the integration with drm_sched was painful, but also because    
> > > > I    
> > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > timeslice-oriented scheduling model wasn't really future proof.    
> > > > Giving    
> > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably    
> > > > might    
> > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > coming short on other aspects we have to deal with on Arm GPUs.    
> > > > > >
> > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > while still controlling how scheduling is really done. Here
> > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue    
> > > >
> > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > aside from the most basic level of not overflowing the submission ring
> > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > scheduler fits quite well.
> > > >    
> > > 
> > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > want here and what you need for Arm thanks to the number of FW queues
> > > available. I don't remember the exact number of GuC queues but it's at
> > > least 1k. This puts it in an entirely different class from what you have on
> > > Mali. Roughly, there's about three categories here:
> > > 
> > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > categorization.)
> > > 
> > >  2. Hardware (or firmware) with a very limited number of queues where
> > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > 
> > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > temporarily pause some low-priority contexts and do some juggling or,
> > > frankly, just fail userspace queue creation and tell the user to close some
> > > windows.
> > > 
> > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > think it's worth recognizing that Xe and panfrost are in different places
> > > here and will require different designs. For Xe, we really are just using
> > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > 
> > > How do we deal with class 2? That's an interesting question.  We may
> > > eventually want to break that off into a separate discussion and not litter
> > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > pretty reasonable solutions but they're going to look a bit different.
> > > 
> > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > handlers which juggled those execlists by smashing HW registers.  For
> > > Panfrost, I think we want something slightly different but can borrow some
> > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > (they can even be fixed-size if that helps) and then have a kthread which
> > > juggles those feeds the limited FW queues.  In the case where you have few
> > > enough active contexts to fit them all in FW, I do think it's best to have
> > > them all active in FW and let it schedule. But with only 31, you need to be
> > > able to juggle if you run out.  
> > 
> > That's more or less what I do right now, except I don't use the
> > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > kernel-side timeslice-based scheduler juggling with runnable queues
> > (queues with pending jobs that are not yet resident on a FW slot)
> > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > ticks being handled with a delayed-work (tick happening every X
> > milliseconds when queues are waiting for a slot). It all seems very
> > HW/FW-specific though, and I think it's a bit premature to try to
> > generalize that part, but the dep-tracking logic implemented by
> > drm_sched looked like something I could easily re-use, hence my
> > interest in Xe's approach.  
> 
> So another option for these few fw queue slots schedulers would be to
> treat them as vram and enlist ttm.
> 
> Well maybe more enlist ttm and less treat them like vram, but ttm can
> handle idr (or xarray or whatever you want) and then help you with all the
> pipelining (and the drm_sched then with sorting out dependencies). If you
> then also preferentially "evict" low-priority queus you pretty much have
> the perfect thing.
> 
> Note that GuC with sriov splits up the id space and together with some
> restrictions due to multi-engine contexts media needs might also need this
> all.
> 
> If you're balking at the idea of enlisting ttm just for fw queue
> management, amdgpu has a shoddy version of id allocation for their vm/tlb
> index allocation. Might be worth it to instead lift that into some sched
> helper code.

Would you mind pointing me to the amdgpu code you're mentioning here?
Still have a hard time seeing what TTM has to do with scheduling, but I
also don't know much about TTM, so I'll keep digging.

> 
> Either way there's two imo rather solid approaches available to sort this
> out. And once you have that, then there shouldn't be any big difference in
> driver design between fw with defacto unlimited queue ids, and those with
> severe restrictions in number of queues.

Honestly, I don't think there's much difference between those two cases
already. There's just a bunch of additional code to schedule queues on
FW slots for the limited-number-of-FW-slots case, which, right now, is
driver specific. The job queuing front-end pretty much achieves what
drm_sched does already: queuing job to entities, checking deps,
submitting job to HW (in our case, writing to the command stream ring
buffer). Things start to differ after that point: once a scheduling
entity has pending jobs, we add it to one of the runnable queues (one
queue per prio) and kick the kernel-side timeslice-based scheduler to
re-evaluate, if needed.

I'm all for using generic code when it makes sense, even if that means
adding this common code when it doesn't exists, but I don't want to be
dragged into some major refactoring that might take years to land.
Especially if pancsf is the first
FW-assisted-scheduler-with-few-FW-slot driver.

Here's a link to my WIP branch [1], and here is the scheduler logic
[2] if you want to have a look. Don't pay too much attention to the
driver uAPI (it's being redesigned).

Regards,

Boris

[1]https://gitlab.freedesktop.org/bbrezillon/linux/-/tree/pancsf
[2]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/pancsf/drivers/gpu/drm/pancsf/pancsf_sched.c

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-09 17:27                   ` Jason Ekstrand
@ 2023-01-10 11:28                     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-10 11:28 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Matthew Brost, intel-gfx, dri-devel



On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]

>      >>> AFAICT it proposes to have 1:1 between *userspace* created
>     contexts (per
>      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>     invasive changes
>      >>> to the shared code is in the spirit of the overall idea and instead
>      >>> opportunity should be used to look at way to refactor/improve
>     drm_sched.
> 
> 
> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
> really needs to drive a re-factor.  (More on that later.)  There's only 
> one real issue which is that it fires off potentially a lot of kthreads. 
> Even that's not that bad given that kthreads are pretty light and you're 
> not likely to have more kthreads than userspace threads which are much 
> heavier.  Not ideal, but not the end of the world either.  Definitely 
> something we can/should optimize but if we went through with Xe without 
> this patch, it would probably be mostly ok.
> 
>      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>      >>
>      >> I'm not really prepared to make large changes to DRM scheduler
>     at the
>      >> moment for Xe as they are not really required nor does Boris
>     seem they
>      >> will be required for his work either. I am interested to see
>     what Boris
>      >> comes up with.
>      >>
>      >>> Even on the low level, the idea to replace drm_sched threads
>     with workers
>      >>> has a few problems.
>      >>>
>      >>> To start with, the pattern of:
>      >>>
>      >>>    while (not_stopped) {
>      >>>     keep picking jobs
>      >>>    }
>      >>>
>      >>> Feels fundamentally in disagreement with workers (while
>     obviously fits
>      >>> perfectly with the current kthread design).
>      >>
>      >> The while loop breaks and worker exists if no jobs are ready.
> 
> 
> I'm not very familiar with workqueues. What are you saying would fit 
> better? One scheduling job per work item rather than one big work item 
> which handles all available jobs?

Yes and no, it indeed IMO does not fit to have a work item which is 
potentially unbound in runtime. But it is a bit moot conceptual mismatch 
because it is a worst case / theoretical, and I think due more 
fundamental concerns.

If we have to go back to the low level side of things, I've picked this 
random spot to consolidate what I have already mentioned and perhaps expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many 
cycles. That means maintaining just enough concurrency to prevent work 
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued 
work items (more interesting when mixing up with the system_unbound_wq), 
in the proposed design the number of instantiated worker threads does 
not correspond to the number of user threads (as you have elsewhere 
stated), but pessimistically to the number of active user contexts. That 
is the number which drives the maximum number of not-runnable jobs that 
can become runnable at once, and hence spawn that many work items, and 
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads 
than the number of CPU cores - it simply creates a scheduling storm.

Unbound workers have no CPU / cache locality either and no connection 
with the CPU scheduler to optimize scheduling patterns. This may matter 
either on large systems or on small ones. Whereas the current design 
allows for scheduler to notice userspace CPU thread keeps waking up the 
same drm scheduler kernel thread, and so it can keep them on the same 
CPU, the unbound workers lose that ability and so 2nd CPU might be 
getting woken up from low sleep for every submission.

Hence, apart from being a bit of a impedance mismatch, the proposal has 
the potential to change performance and power patterns and both large 
and small machines.

>      >>> Secondly, it probably demands separate workers (not optional),
>     otherwise
>      >>> behaviour of shared workqueues has either the potential to
>     explode number
>      >>> kernel threads anyway, or add latency.
>      >>>
>      >>
>      >> Right now the system_unbound_wq is used which does have a limit
>     on the
>      >> number of threads, right? I do have a FIXME to allow a worker to be
>      >> passed in similar to TDR.
>      >>
>      >> WRT to latency, the 1:1 ratio could actually have lower latency
>     as 2 GPU
>      >> schedulers can be pushing jobs into the backend / cleaning up
>     jobs in
>      >> parallel.
>      >>
>      >
>      > Thought of one more point here where why in Xe we absolutely want
>     a 1 to
>      > 1 ratio between entity and scheduler - the way we implement
>     timeslicing
>      > for preempt fences.
>      >
>      > Let me try to explain.
>      >
>      > Preempt fences are implemented via the generic messaging
>     interface [1]
>      > with suspend / resume messages. If a suspend messages is received to
>      > soon after calling resume (this is per entity) we simply sleep in the
>      > suspend call thus giving the entity a timeslice. This completely
>     falls
>      > apart with a many to 1 relationship as now a entity waiting for a
>      > timeslice blocks the other entities. Could we work aroudn this,
>     sure but
>      > just another bunch of code we'd have to add in Xe. Being to
>     freely sleep
>      > in backend without affecting other entities is really, really
>     nice IMO
>      > and I bet Xe isn't the only driver that is going to feel this way.
>      >
>      > Last thing I'll say regardless of how anyone feels about Xe using
>     a 1 to
>      > 1 relationship this patch IMO makes sense as I hope we can all
>     agree a
>      > workqueue scales better than kthreads.
> 
>     I don't know for sure what will scale better and for what use case,
>     combination of CPU cores vs number of GPU engines to keep busy vs other
>     system activity. But I wager someone is bound to ask for some
>     numbers to
>     make sure proposal is not negatively affecting any other drivers.
> 
> 
> Then let them ask.  Waving your hands vaguely in the direction of the 
> rest of DRM and saying "Uh, someone (not me) might object" is profoundly 
> unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you 
> think there's someone in particular who might have a useful opinion on 
> this, throw them in the CC so they don't miss the e-mail thread.
> 
> Or are you asking for numbers?  If so, what numbers are you asking for?

It was a heads up to the Xe team in case people weren't appreciating how 
the proposed change has the potential influence power and performance 
across the board. And nothing in the follow up discussion made me think 
it was considered so I don't think it was redundant to raise it.

In my experience it is typical that such core changes come with some 
numbers. Which is in case of drm scheduler is tricky and probably 
requires explicitly asking everyone to test (rather than count on "don't 
miss the email thread"). Real products can fail to ship due ten mW here 
or there. Like suddenly an extra core prevented from getting into deep 
sleep.

If that was "profoundly unhelpful" so be it.

> Also, If we're talking about a design that might paint us into an 
> Intel-HW-specific hole, that would be one thing.  But we're not.  We're 
> talking about switching which kernel threading/task mechanism to use for 
> what's really a very generic problem.  The core Xe design works without 
> this patch (just with more kthreads).  If we land this patch or 
> something like it and get it wrong and it causes a performance problem 
> for someone down the line, we can revisit it.

For some definition of "it works" - I really wouldn't suggest shipping a 
kthread per user context at any point.

>     In any case that's a low level question caused by the high level design
>     decision. So I'd think first focus on the high level - which is the 1:1
>     mapping of entity to scheduler instance proposal.
> 
>     Fundamentally it will be up to the DRM maintainers and the community to
>     bless your approach. And it is important to stress 1:1 is about
>     userspace contexts, so I believe unlike any other current scheduler
>     user. And also important to stress this effectively does not make Xe
>     _really_ use the scheduler that much.
> 
> 
> I don't think this makes Xe nearly as much of a one-off as you think it 
> does.  I've already told the Asahi team working on Apple M1/2 hardware 
> to do it this way and it seems to be a pretty good mapping for them. I 
> believe this is roughly the plan for nouveau as well.  It's not the way 
> it currently works for anyone because most other groups aren't doing FW 
> scheduling yet.  In the world of FW scheduling and hardware designed to 
> support userspace direct-to-FW submit, I think the design makes perfect 
> sense (see below) and I expect we'll see more drivers move in this 
> direction as those drivers evolve.  (AMD is doing some customish thing 
> for how with gpu_scheduler on the front-end somehow. I've not dug into 
> those details.)
> 
>     I can only offer my opinion, which is that the two options mentioned in
>     this thread (either improve drm scheduler to cope with what is
>     required,
>     or split up the code so you can use just the parts of drm_sched which
>     you want - which is frontend dependency tracking) shouldn't be so
>     readily dismissed, given how I think the idea was for the new driver to
>     work less in a silo and more in the community (not do kludges to
>     workaround stuff because it is thought to be too hard to improve common
>     code), but fundamentally, "goto previous paragraph" for what I am
>     concerned.
> 
> 
> Meta comment:  It appears as if you're falling into the standard i915 
> team trap of having an internal discussion about what the community 
> discussion might look like instead of actually having the community 
> discussion.  If you are seriously concerned about interactions with 
> other drivers or whether or setting common direction, the right way to 
> do that is to break a patch or two out into a separate RFC series and 
> tag a handful of driver maintainers.  Trying to predict the questions 
> other people might ask is pointless. Cc them and asking for their input 
> instead.

I don't follow you here. It's not an internal discussion - I am raising 
my concerns on the design publicly. I am supposed to write a patch to 
show something, but am allowed to comment on a RFC series?

It is "drm/sched: Convert drm scheduler to use a work queue rather than 
kthread" which should have Cc-ed _everyone_ who use drm scheduler.

> 
>     Regards,
> 
>     Tvrtko
> 
>     P.S. And as a related side note, there are more areas where drm_sched
>     could be improved, like for instance priority handling.
>     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
>     get_sched_entity to see how msm works around the drm_sched hardcoded
>     limit of available priority levels, in order to avoid having to leave a
>     hw capability unused. I suspect msm would be happier if they could have
>     all priority levels equal in terms of whether they apply only at the
>     frontend level or completely throughout the pipeline.
> 
>      > [1]
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>      >
>      >>> What would be interesting to learn is whether the option of
>     refactoring
>      >>> drm_sched to deal with out of order completion was considered
>     and what were
>      >>> the conclusions.
>      >>>
>      >>
>      >> I coded this up a while back when trying to convert the i915 to
>     the DRM
>      >> scheduler it isn't all that hard either. The free flow control
>     on the
>      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
>     really what
>      >> sold me on the this design.
> 
> 
> You're not the only one to suggest supporting out-of-order completion. 
> However, it's tricky and breaks a lot of internal assumptions of the 
> scheduler. It also reduces functionality a bit because it can no longer 
> automatically rate-limit HW/FW queues which are often fixed-size.  (Ok, 
> yes, it probably could but it becomes a substantially harder problem.)
> 
> It also seems like a worse mapping to me.  The goal here is to turn 
> submissions on a userspace-facing engine/queue into submissions to a FW 
> queue submissions, sorting out any dma_fence dependencies.  Matt's 
> description of saying this is a 1:1 mapping between sched/entity doesn't 
> tell the whole story. It's a 1:1:1 mapping between xe_engine, 
> gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?  
> Why is that better?

As I have stated before, what I think what would fit well for Xe is one 
drm_scheduler per engine class. In specific terms on our current 
hardware, one drm scheduler instance for render, compute, blitter, video 
and video enhance. Userspace contexts remain scheduler entities.

That way you avoid the whole kthread/kworker story and you have it 
actually use the entity picking code in the scheduler, which may be 
useful when the backend is congested.

Yes you have to solve the out of order problem so in my mind that is 
something to discuss. What the problem actually is (just TDR?), how 
tricky and why etc.

And yes you lose the handy LRCA ring buffer size management so you'd 
have to make those entities not runnable in some other way.

Regarding the argument you raise below - would any of that make the 
frontend / backend separation worse and why? Do you think it is less 
natural? If neither is true then all remains is that it appears extra 
work to support out of order completion of entities has been discounted 
in favour of an easy but IMO inelegant option.

> There are two places where this 1:1:1 mapping is causing problems:
> 
>   1. It creates lots of kthreads. This is what this patch is trying to 
> solve. IDK if it's solving it the best way but that's the goal.
> 
>   2. There are a far more limited number of communication queues between 
> the kernel and GuC for more meta things like pausing and resuming 
> queues, getting events back from GuC, etc. Unless we're in a weird 
> pressure scenario, the amount of traffic on this queue should be low so 
> we can probably just have one per physical device.  The vast majority of 
> kernel -> GuC communication should be on the individual FW queue rings 
> and maybe smashing in-memory doorbells.

I don't follow your terminology here. I suppose you are talking about 
global GuC CT and context ringbuffers. If so then isn't "far more 
limited" actually one?

Regards,

Tvrtko

> Doing out-of-order completion sort-of solves the 1 but does nothing for 
> 2 and actually makes managing FW queues harder because we no longer have 
> built-in rate limiting.  Seems like a net loss to me.
> 
>      >>> Second option perhaps to split out the drm_sched code into
>     parts which would
>      >>> lend themselves more to "pick and choose" of its functionalities.
>      >>> Specifically, Xe wants frontend dependency tracking, but not
>     any scheduling
>      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
>     picking), so
>      >>> even having all these data structures in memory is a waste.
>      >>>
>      >>
>      >> I don't think that we are wasting memory is a very good argument for
>      >> making intrusive changes to the DRM scheduler.
> 
> 
> Worse than that, I think the "we could split it up" kind-of misses the 
> point of the way Xe is using drm/scheduler.  It's not just about 
> re-using a tiny bit of dependency tracking code.  Using the scheduler in 
> this way provides a clean separation between front-end and back-end.  
> The job of the userspace-facing ioctl code is to shove things on the 
> scheduler.  The job of the run_job callback is to encode the job into 
> the FW queue format, stick it in the FW queue ring, and maybe smash a 
> doorbell.  Everything else happens in terms of managing those queues 
> side-band.  The gpu_scheduler code manages the front-end queues and Xe 
> manages the FW queues via the Kernel <-> GuC communication rings.  From 
> a high level, this is a really clean design.  There are potentially some 
> sticky bits around the dual-use of dma_fence for scheduling and memory 
> management but none of those are solved by breaking the DRM scheduler 
> into chunks or getting rid of the 1:1:1 mapping.
> 
> If we split it out, we're basically asking the driver to implement a 
> bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.  
> It may not be all that much code but also, why?  To save a few bytes of 
> memory per engine?  Each engine already has 32K(ish) worth of context 
> state and a similar size ring to communicate with the FW.  No one is 
> going to notice an extra CPU data structure.
> 
> I'm not seeing a solid argument against the 1:1:1 design here other than 
> that it doesn't seem like the way DRM scheduler was intended to be 
> used.  I won't argue that.  It's not.  But it is a fairly natural way to 
> take advantage of the benefits the DRM scheduler does provide while also 
> mapping it to hardware that was designed for userspace direct-to-FW submit.
> 
> --Jason
> 
>      >>> With the first option then the end result could be drm_sched
>     per engine
>      >>> class (hardware view), which I think fits with the GuC model.
>     Give all
>      >>> schedulable contexts (entities) to the GuC and then mostly
>     forget about
>      >>> them. Timeslicing and re-ordering and all happens transparently
>     to the
>      >>> kernel from that point until completion.
>      >>>
>      >>
>      >> Out-of-order problem still exists here.
>      >>
>      >>> Or with the second option you would build on some smaller
>     refactored
>      >>> sub-components of drm_sched, by maybe splitting the dependency
>     tracking from
>      >>> scheduling (RR/FIFO entity picking code).
>      >>>
>      >>> Second option is especially a bit vague and I haven't thought
>     about the
>      >>> required mechanics, but it just appeared too obvious the
>     proposed design has
>      >>> a bit too much impedance mismatch.
>      >>>
>      >>
>      >> IMO ROI on this is low and again lets see what Boris comes up with.
>      >>
>      >> Matt
>      >>
>      >>> Oh and as a side note, when I went into the drm_sched code base
>     to remind
>      >>> myself how things worked, it is quite easy to find some FIXME
>     comments which
>      >>> suggest people working on it are unsure of locking desing there
>     and such. So
>      >>> perhaps that all needs cleanup too, I mean would benefit from
>      >>> refactoring/improving work as brainstormed above anyway.
>      >>>
>      >>> Regards,
>      >>>
>      >>> Tvrtko
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 11:28                     ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-10 11:28 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel



On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]

>      >>> AFAICT it proposes to have 1:1 between *userspace* created
>     contexts (per
>      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>     invasive changes
>      >>> to the shared code is in the spirit of the overall idea and instead
>      >>> opportunity should be used to look at way to refactor/improve
>     drm_sched.
> 
> 
> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
> really needs to drive a re-factor.  (More on that later.)  There's only 
> one real issue which is that it fires off potentially a lot of kthreads. 
> Even that's not that bad given that kthreads are pretty light and you're 
> not likely to have more kthreads than userspace threads which are much 
> heavier.  Not ideal, but not the end of the world either.  Definitely 
> something we can/should optimize but if we went through with Xe without 
> this patch, it would probably be mostly ok.
> 
>      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>      >>
>      >> I'm not really prepared to make large changes to DRM scheduler
>     at the
>      >> moment for Xe as they are not really required nor does Boris
>     seem they
>      >> will be required for his work either. I am interested to see
>     what Boris
>      >> comes up with.
>      >>
>      >>> Even on the low level, the idea to replace drm_sched threads
>     with workers
>      >>> has a few problems.
>      >>>
>      >>> To start with, the pattern of:
>      >>>
>      >>>    while (not_stopped) {
>      >>>     keep picking jobs
>      >>>    }
>      >>>
>      >>> Feels fundamentally in disagreement with workers (while
>     obviously fits
>      >>> perfectly with the current kthread design).
>      >>
>      >> The while loop breaks and worker exists if no jobs are ready.
> 
> 
> I'm not very familiar with workqueues. What are you saying would fit 
> better? One scheduling job per work item rather than one big work item 
> which handles all available jobs?

Yes and no, it indeed IMO does not fit to have a work item which is 
potentially unbound in runtime. But it is a bit moot conceptual mismatch 
because it is a worst case / theoretical, and I think due more 
fundamental concerns.

If we have to go back to the low level side of things, I've picked this 
random spot to consolidate what I have already mentioned and perhaps expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many 
cycles. That means maintaining just enough concurrency to prevent work 
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued 
work items (more interesting when mixing up with the system_unbound_wq), 
in the proposed design the number of instantiated worker threads does 
not correspond to the number of user threads (as you have elsewhere 
stated), but pessimistically to the number of active user contexts. That 
is the number which drives the maximum number of not-runnable jobs that 
can become runnable at once, and hence spawn that many work items, and 
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads 
than the number of CPU cores - it simply creates a scheduling storm.

Unbound workers have no CPU / cache locality either and no connection 
with the CPU scheduler to optimize scheduling patterns. This may matter 
either on large systems or on small ones. Whereas the current design 
allows for scheduler to notice userspace CPU thread keeps waking up the 
same drm scheduler kernel thread, and so it can keep them on the same 
CPU, the unbound workers lose that ability and so 2nd CPU might be 
getting woken up from low sleep for every submission.

Hence, apart from being a bit of a impedance mismatch, the proposal has 
the potential to change performance and power patterns and both large 
and small machines.

>      >>> Secondly, it probably demands separate workers (not optional),
>     otherwise
>      >>> behaviour of shared workqueues has either the potential to
>     explode number
>      >>> kernel threads anyway, or add latency.
>      >>>
>      >>
>      >> Right now the system_unbound_wq is used which does have a limit
>     on the
>      >> number of threads, right? I do have a FIXME to allow a worker to be
>      >> passed in similar to TDR.
>      >>
>      >> WRT to latency, the 1:1 ratio could actually have lower latency
>     as 2 GPU
>      >> schedulers can be pushing jobs into the backend / cleaning up
>     jobs in
>      >> parallel.
>      >>
>      >
>      > Thought of one more point here where why in Xe we absolutely want
>     a 1 to
>      > 1 ratio between entity and scheduler - the way we implement
>     timeslicing
>      > for preempt fences.
>      >
>      > Let me try to explain.
>      >
>      > Preempt fences are implemented via the generic messaging
>     interface [1]
>      > with suspend / resume messages. If a suspend messages is received to
>      > soon after calling resume (this is per entity) we simply sleep in the
>      > suspend call thus giving the entity a timeslice. This completely
>     falls
>      > apart with a many to 1 relationship as now a entity waiting for a
>      > timeslice blocks the other entities. Could we work aroudn this,
>     sure but
>      > just another bunch of code we'd have to add in Xe. Being to
>     freely sleep
>      > in backend without affecting other entities is really, really
>     nice IMO
>      > and I bet Xe isn't the only driver that is going to feel this way.
>      >
>      > Last thing I'll say regardless of how anyone feels about Xe using
>     a 1 to
>      > 1 relationship this patch IMO makes sense as I hope we can all
>     agree a
>      > workqueue scales better than kthreads.
> 
>     I don't know for sure what will scale better and for what use case,
>     combination of CPU cores vs number of GPU engines to keep busy vs other
>     system activity. But I wager someone is bound to ask for some
>     numbers to
>     make sure proposal is not negatively affecting any other drivers.
> 
> 
> Then let them ask.  Waving your hands vaguely in the direction of the 
> rest of DRM and saying "Uh, someone (not me) might object" is profoundly 
> unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you 
> think there's someone in particular who might have a useful opinion on 
> this, throw them in the CC so they don't miss the e-mail thread.
> 
> Or are you asking for numbers?  If so, what numbers are you asking for?

It was a heads up to the Xe team in case people weren't appreciating how 
the proposed change has the potential influence power and performance 
across the board. And nothing in the follow up discussion made me think 
it was considered so I don't think it was redundant to raise it.

In my experience it is typical that such core changes come with some 
numbers. Which is in case of drm scheduler is tricky and probably 
requires explicitly asking everyone to test (rather than count on "don't 
miss the email thread"). Real products can fail to ship due ten mW here 
or there. Like suddenly an extra core prevented from getting into deep 
sleep.

If that was "profoundly unhelpful" so be it.

> Also, If we're talking about a design that might paint us into an 
> Intel-HW-specific hole, that would be one thing.  But we're not.  We're 
> talking about switching which kernel threading/task mechanism to use for 
> what's really a very generic problem.  The core Xe design works without 
> this patch (just with more kthreads).  If we land this patch or 
> something like it and get it wrong and it causes a performance problem 
> for someone down the line, we can revisit it.

For some definition of "it works" - I really wouldn't suggest shipping a 
kthread per user context at any point.

>     In any case that's a low level question caused by the high level design
>     decision. So I'd think first focus on the high level - which is the 1:1
>     mapping of entity to scheduler instance proposal.
> 
>     Fundamentally it will be up to the DRM maintainers and the community to
>     bless your approach. And it is important to stress 1:1 is about
>     userspace contexts, so I believe unlike any other current scheduler
>     user. And also important to stress this effectively does not make Xe
>     _really_ use the scheduler that much.
> 
> 
> I don't think this makes Xe nearly as much of a one-off as you think it 
> does.  I've already told the Asahi team working on Apple M1/2 hardware 
> to do it this way and it seems to be a pretty good mapping for them. I 
> believe this is roughly the plan for nouveau as well.  It's not the way 
> it currently works for anyone because most other groups aren't doing FW 
> scheduling yet.  In the world of FW scheduling and hardware designed to 
> support userspace direct-to-FW submit, I think the design makes perfect 
> sense (see below) and I expect we'll see more drivers move in this 
> direction as those drivers evolve.  (AMD is doing some customish thing 
> for how with gpu_scheduler on the front-end somehow. I've not dug into 
> those details.)
> 
>     I can only offer my opinion, which is that the two options mentioned in
>     this thread (either improve drm scheduler to cope with what is
>     required,
>     or split up the code so you can use just the parts of drm_sched which
>     you want - which is frontend dependency tracking) shouldn't be so
>     readily dismissed, given how I think the idea was for the new driver to
>     work less in a silo and more in the community (not do kludges to
>     workaround stuff because it is thought to be too hard to improve common
>     code), but fundamentally, "goto previous paragraph" for what I am
>     concerned.
> 
> 
> Meta comment:  It appears as if you're falling into the standard i915 
> team trap of having an internal discussion about what the community 
> discussion might look like instead of actually having the community 
> discussion.  If you are seriously concerned about interactions with 
> other drivers or whether or setting common direction, the right way to 
> do that is to break a patch or two out into a separate RFC series and 
> tag a handful of driver maintainers.  Trying to predict the questions 
> other people might ask is pointless. Cc them and asking for their input 
> instead.

I don't follow you here. It's not an internal discussion - I am raising 
my concerns on the design publicly. I am supposed to write a patch to 
show something, but am allowed to comment on a RFC series?

It is "drm/sched: Convert drm scheduler to use a work queue rather than 
kthread" which should have Cc-ed _everyone_ who use drm scheduler.

> 
>     Regards,
> 
>     Tvrtko
> 
>     P.S. And as a related side note, there are more areas where drm_sched
>     could be improved, like for instance priority handling.
>     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
>     get_sched_entity to see how msm works around the drm_sched hardcoded
>     limit of available priority levels, in order to avoid having to leave a
>     hw capability unused. I suspect msm would be happier if they could have
>     all priority levels equal in terms of whether they apply only at the
>     frontend level or completely throughout the pipeline.
> 
>      > [1]
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>      >
>      >>> What would be interesting to learn is whether the option of
>     refactoring
>      >>> drm_sched to deal with out of order completion was considered
>     and what were
>      >>> the conclusions.
>      >>>
>      >>
>      >> I coded this up a while back when trying to convert the i915 to
>     the DRM
>      >> scheduler it isn't all that hard either. The free flow control
>     on the
>      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
>     really what
>      >> sold me on the this design.
> 
> 
> You're not the only one to suggest supporting out-of-order completion. 
> However, it's tricky and breaks a lot of internal assumptions of the 
> scheduler. It also reduces functionality a bit because it can no longer 
> automatically rate-limit HW/FW queues which are often fixed-size.  (Ok, 
> yes, it probably could but it becomes a substantially harder problem.)
> 
> It also seems like a worse mapping to me.  The goal here is to turn 
> submissions on a userspace-facing engine/queue into submissions to a FW 
> queue submissions, sorting out any dma_fence dependencies.  Matt's 
> description of saying this is a 1:1 mapping between sched/entity doesn't 
> tell the whole story. It's a 1:1:1 mapping between xe_engine, 
> gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?  
> Why is that better?

As I have stated before, what I think what would fit well for Xe is one 
drm_scheduler per engine class. In specific terms on our current 
hardware, one drm scheduler instance for render, compute, blitter, video 
and video enhance. Userspace contexts remain scheduler entities.

That way you avoid the whole kthread/kworker story and you have it 
actually use the entity picking code in the scheduler, which may be 
useful when the backend is congested.

Yes you have to solve the out of order problem so in my mind that is 
something to discuss. What the problem actually is (just TDR?), how 
tricky and why etc.

And yes you lose the handy LRCA ring buffer size management so you'd 
have to make those entities not runnable in some other way.

Regarding the argument you raise below - would any of that make the 
frontend / backend separation worse and why? Do you think it is less 
natural? If neither is true then all remains is that it appears extra 
work to support out of order completion of entities has been discounted 
in favour of an easy but IMO inelegant option.

> There are two places where this 1:1:1 mapping is causing problems:
> 
>   1. It creates lots of kthreads. This is what this patch is trying to 
> solve. IDK if it's solving it the best way but that's the goal.
> 
>   2. There are a far more limited number of communication queues between 
> the kernel and GuC for more meta things like pausing and resuming 
> queues, getting events back from GuC, etc. Unless we're in a weird 
> pressure scenario, the amount of traffic on this queue should be low so 
> we can probably just have one per physical device.  The vast majority of 
> kernel -> GuC communication should be on the individual FW queue rings 
> and maybe smashing in-memory doorbells.

I don't follow your terminology here. I suppose you are talking about 
global GuC CT and context ringbuffers. If so then isn't "far more 
limited" actually one?

Regards,

Tvrtko

> Doing out-of-order completion sort-of solves the 1 but does nothing for 
> 2 and actually makes managing FW queues harder because we no longer have 
> built-in rate limiting.  Seems like a net loss to me.
> 
>      >>> Second option perhaps to split out the drm_sched code into
>     parts which would
>      >>> lend themselves more to "pick and choose" of its functionalities.
>      >>> Specifically, Xe wants frontend dependency tracking, but not
>     any scheduling
>      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
>     picking), so
>      >>> even having all these data structures in memory is a waste.
>      >>>
>      >>
>      >> I don't think that we are wasting memory is a very good argument for
>      >> making intrusive changes to the DRM scheduler.
> 
> 
> Worse than that, I think the "we could split it up" kind-of misses the 
> point of the way Xe is using drm/scheduler.  It's not just about 
> re-using a tiny bit of dependency tracking code.  Using the scheduler in 
> this way provides a clean separation between front-end and back-end.  
> The job of the userspace-facing ioctl code is to shove things on the 
> scheduler.  The job of the run_job callback is to encode the job into 
> the FW queue format, stick it in the FW queue ring, and maybe smash a 
> doorbell.  Everything else happens in terms of managing those queues 
> side-band.  The gpu_scheduler code manages the front-end queues and Xe 
> manages the FW queues via the Kernel <-> GuC communication rings.  From 
> a high level, this is a really clean design.  There are potentially some 
> sticky bits around the dual-use of dma_fence for scheduling and memory 
> management but none of those are solved by breaking the DRM scheduler 
> into chunks or getting rid of the 1:1:1 mapping.
> 
> If we split it out, we're basically asking the driver to implement a 
> bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.  
> It may not be all that much code but also, why?  To save a few bytes of 
> memory per engine?  Each engine already has 32K(ish) worth of context 
> state and a similar size ring to communicate with the FW.  No one is 
> going to notice an extra CPU data structure.
> 
> I'm not seeing a solid argument against the 1:1:1 design here other than 
> that it doesn't seem like the way DRM scheduler was intended to be 
> used.  I won't argue that.  It's not.  But it is a fairly natural way to 
> take advantage of the benefits the DRM scheduler does provide while also 
> mapping it to hardware that was designed for userspace direct-to-FW submit.
> 
> --Jason
> 
>      >>> With the first option then the end result could be drm_sched
>     per engine
>      >>> class (hardware view), which I think fits with the GuC model.
>     Give all
>      >>> schedulable contexts (entities) to the GuC and then mostly
>     forget about
>      >>> them. Timeslicing and re-ordering and all happens transparently
>     to the
>      >>> kernel from that point until completion.
>      >>>
>      >>
>      >> Out-of-order problem still exists here.
>      >>
>      >>> Or with the second option you would build on some smaller
>     refactored
>      >>> sub-components of drm_sched, by maybe splitting the dependency
>     tracking from
>      >>> scheduling (RR/FIFO entity picking code).
>      >>>
>      >>> Second option is especially a bit vague and I haven't thought
>     about the
>      >>> required mechanics, but it just appeared too obvious the
>     proposed design has
>      >>> a bit too much impedance mismatch.
>      >>>
>      >>
>      >> IMO ROI on this is low and again lets see what Boris comes up with.
>      >>
>      >> Matt
>      >>
>      >>> Oh and as a side note, when I went into the drm_sched code base
>     to remind
>      >>> myself how things worked, it is quite easy to find some FIXME
>     comments which
>      >>> suggest people working on it are unsure of locking desing there
>     and such. So
>      >>> perhaps that all needs cleanup too, I mean would benefit from
>      >>> refactoring/improving work as brainstormed above anyway.
>      >>>
>      >>> Regards,
>      >>>
>      >>> Tvrtko
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 11:28                     ` Tvrtko Ursulin
@ 2023-01-10 12:19                       ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-10 12:19 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Matthew Brost, intel-gfx, dri-devel


On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> 
> 
> On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
> [snip]
> 
>>      >>> AFAICT it proposes to have 1:1 between *userspace* created
>>     contexts (per
>>      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>     invasive changes
>>      >>> to the shared code is in the spirit of the overall idea and 
>> instead
>>      >>> opportunity should be used to look at way to refactor/improve
>>     drm_sched.
>>
>>
>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
>> really needs to drive a re-factor.  (More on that later.)  There's 
>> only one real issue which is that it fires off potentially a lot of 
>> kthreads. Even that's not that bad given that kthreads are pretty 
>> light and you're not likely to have more kthreads than userspace 
>> threads which are much heavier.  Not ideal, but not the end of the 
>> world either.  Definitely something we can/should optimize but if we 
>> went through with Xe without this patch, it would probably be mostly ok.
>>
>>      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>      >>
>>      >> I'm not really prepared to make large changes to DRM scheduler
>>     at the
>>      >> moment for Xe as they are not really required nor does Boris
>>     seem they
>>      >> will be required for his work either. I am interested to see
>>     what Boris
>>      >> comes up with.
>>      >>
>>      >>> Even on the low level, the idea to replace drm_sched threads
>>     with workers
>>      >>> has a few problems.
>>      >>>
>>      >>> To start with, the pattern of:
>>      >>>
>>      >>>    while (not_stopped) {
>>      >>>     keep picking jobs
>>      >>>    }
>>      >>>
>>      >>> Feels fundamentally in disagreement with workers (while
>>     obviously fits
>>      >>> perfectly with the current kthread design).
>>      >>
>>      >> The while loop breaks and worker exists if no jobs are ready.
>>
>>
>> I'm not very familiar with workqueues. What are you saying would fit 
>> better? One scheduling job per work item rather than one big work item 
>> which handles all available jobs?
> 
> Yes and no, it indeed IMO does not fit to have a work item which is 
> potentially unbound in runtime. But it is a bit moot conceptual mismatch 
> because it is a worst case / theoretical, and I think due more 
> fundamental concerns.
> 
> If we have to go back to the low level side of things, I've picked this 
> random spot to consolidate what I have already mentioned and perhaps 
> expand.
> 
> To start with, let me pull out some thoughts from workqueue.rst:
> 
> """
> Generally, work items are not expected to hog a CPU and consume many 
> cycles. That means maintaining just enough concurrency to prevent work 
> processing from stalling should be optimal.
> """
> 
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
> 
> Given the unbound queues will be spawned on demand to service all queued 
> work items (more interesting when mixing up with the system_unbound_wq), 
> in the proposed design the number of instantiated worker threads does 
> not correspond to the number of user threads (as you have elsewhere 
> stated), but pessimistically to the number of active user contexts. That 
> is the number which drives the maximum number of not-runnable jobs that 
> can become runnable at once, and hence spawn that many work items, and 
> in turn unbound worker threads.
> 
> Several problems there.
> 
> It is fundamentally pointless to have potentially that many more threads 
> than the number of CPU cores - it simply creates a scheduling storm.

To make matters worse, if I follow the code correctly, all these per 
user context worker thread / work items end up contending on the same 
lock or circular buffer, both are one instance per GPU:

guc_engine_run_job
  -> submit_engine
     a) wq_item_append
         -> wq_wait_for_space
           -> msleep
     b) xe_guc_ct_send
         -> guc_ct_send
           -> mutex_lock(&ct->lock);
           -> later a potential msleep in h2g_has_room

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 12:19                       ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-10 12:19 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel


On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> 
> 
> On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
> [snip]
> 
>>      >>> AFAICT it proposes to have 1:1 between *userspace* created
>>     contexts (per
>>      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>     invasive changes
>>      >>> to the shared code is in the spirit of the overall idea and 
>> instead
>>      >>> opportunity should be used to look at way to refactor/improve
>>     drm_sched.
>>
>>
>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
>> really needs to drive a re-factor.  (More on that later.)  There's 
>> only one real issue which is that it fires off potentially a lot of 
>> kthreads. Even that's not that bad given that kthreads are pretty 
>> light and you're not likely to have more kthreads than userspace 
>> threads which are much heavier.  Not ideal, but not the end of the 
>> world either.  Definitely something we can/should optimize but if we 
>> went through with Xe without this patch, it would probably be mostly ok.
>>
>>      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>      >>
>>      >> I'm not really prepared to make large changes to DRM scheduler
>>     at the
>>      >> moment for Xe as they are not really required nor does Boris
>>     seem they
>>      >> will be required for his work either. I am interested to see
>>     what Boris
>>      >> comes up with.
>>      >>
>>      >>> Even on the low level, the idea to replace drm_sched threads
>>     with workers
>>      >>> has a few problems.
>>      >>>
>>      >>> To start with, the pattern of:
>>      >>>
>>      >>>    while (not_stopped) {
>>      >>>     keep picking jobs
>>      >>>    }
>>      >>>
>>      >>> Feels fundamentally in disagreement with workers (while
>>     obviously fits
>>      >>> perfectly with the current kthread design).
>>      >>
>>      >> The while loop breaks and worker exists if no jobs are ready.
>>
>>
>> I'm not very familiar with workqueues. What are you saying would fit 
>> better? One scheduling job per work item rather than one big work item 
>> which handles all available jobs?
> 
> Yes and no, it indeed IMO does not fit to have a work item which is 
> potentially unbound in runtime. But it is a bit moot conceptual mismatch 
> because it is a worst case / theoretical, and I think due more 
> fundamental concerns.
> 
> If we have to go back to the low level side of things, I've picked this 
> random spot to consolidate what I have already mentioned and perhaps 
> expand.
> 
> To start with, let me pull out some thoughts from workqueue.rst:
> 
> """
> Generally, work items are not expected to hog a CPU and consume many 
> cycles. That means maintaining just enough concurrency to prevent work 
> processing from stalling should be optimal.
> """
> 
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
> 
> Given the unbound queues will be spawned on demand to service all queued 
> work items (more interesting when mixing up with the system_unbound_wq), 
> in the proposed design the number of instantiated worker threads does 
> not correspond to the number of user threads (as you have elsewhere 
> stated), but pessimistically to the number of active user contexts. That 
> is the number which drives the maximum number of not-runnable jobs that 
> can become runnable at once, and hence spawn that many work items, and 
> in turn unbound worker threads.
> 
> Several problems there.
> 
> It is fundamentally pointless to have potentially that many more threads 
> than the number of CPU cores - it simply creates a scheduling storm.

To make matters worse, if I follow the code correctly, all these per 
user context worker thread / work items end up contending on the same 
lock or circular buffer, both are one instance per GPU:

guc_engine_run_job
  -> submit_engine
     a) wq_item_append
         -> wq_wait_for_space
           -> msleep
     b) xe_guc_ct_send
         -> guc_ct_send
           -> mutex_lock(&ct->lock);
           -> later a potential msleep in h2g_has_room

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2023-01-10 12:33   ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-10 12:33 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

+Frank, who's also working on the pvr uAPI.

Hi,

On Thu, 22 Dec 2022 14:21:07 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/

Sorry to hijack this thread, again, but I'm currently working on the
pancsf uAPI, and I was wondering how DRM maintainers/developers felt
about the new direction taken by the Xe driver on some aspects of their
uAPI (to decide if I should copy these patterns or go the old way):

- plan for ioctl extensions through '__u64 extensions;' fields (the
  vulkan way, basically)
- turning the GETPARAM in DEV_QUERY which can return more than a 64-bit
  integer at a time
- having ioctls taking sub-operations instead of one ioctl per
  operation (I'm referring to VM_BIND here, which handles map, unmap,
  restart, ... through a single entry point)

Regards,

Boris





^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-10 12:33   ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-10 12:33 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, Frank Binns, Daniel Vetter, dri-devel

+Frank, who's also working on the pvr uAPI.

Hi,

On Thu, 22 Dec 2022 14:21:07 -0800
Matthew Brost <matthew.brost@intel.com> wrote:

> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/

Sorry to hijack this thread, again, but I'm currently working on the
pancsf uAPI, and I was wondering how DRM maintainers/developers felt
about the new direction taken by the Xe driver on some aspects of their
uAPI (to decide if I should copy these patterns or go the old way):

- plan for ioctl extensions through '__u64 extensions;' fields (the
  vulkan way, basically)
- turning the GETPARAM in DEV_QUERY which can return more than a 64-bit
  integer at a time
- having ioctls taking sub-operations instead of one ioctl per
  operation (I'm referring to VM_BIND here, which handles map, unmap,
  restart, ... through a single entry point)

Regards,

Boris





^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 11:28                     ` Tvrtko Ursulin
@ 2023-01-10 14:08                       ` Jason Ekstrand
  -1 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-10 14:08 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Matthew Brost, intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 26300 bytes --]

On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin <
tvrtko.ursulin@linux.intel.com> wrote:

>
>
> On 09/01/2023 17:27, Jason Ekstrand wrote:
>
> [snip]
>
> >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >     contexts (per
> >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >     invasive changes
> >      >>> to the shared code is in the spirit of the overall idea and
> instead
> >      >>> opportunity should be used to look at way to refactor/improve
> >     drm_sched.
> >
> >
> > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > really needs to drive a re-factor.  (More on that later.)  There's only
> > one real issue which is that it fires off potentially a lot of kthreads.
> > Even that's not that bad given that kthreads are pretty light and you're
> > not likely to have more kthreads than userspace threads which are much
> > heavier.  Not ideal, but not the end of the world either.  Definitely
> > something we can/should optimize but if we went through with Xe without
> > this patch, it would probably be mostly ok.
> >
> >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >>
> >      >> I'm not really prepared to make large changes to DRM scheduler
> >     at the
> >      >> moment for Xe as they are not really required nor does Boris
> >     seem they
> >      >> will be required for his work either. I am interested to see
> >     what Boris
> >      >> comes up with.
> >      >>
> >      >>> Even on the low level, the idea to replace drm_sched threads
> >     with workers
> >      >>> has a few problems.
> >      >>>
> >      >>> To start with, the pattern of:
> >      >>>
> >      >>>    while (not_stopped) {
> >      >>>     keep picking jobs
> >      >>>    }
> >      >>>
> >      >>> Feels fundamentally in disagreement with workers (while
> >     obviously fits
> >      >>> perfectly with the current kthread design).
> >      >>
> >      >> The while loop breaks and worker exists if no jobs are ready.
> >
> >
> > I'm not very familiar with workqueues. What are you saying would fit
> > better? One scheduling job per work item rather than one big work item
> > which handles all available jobs?
>
> Yes and no, it indeed IMO does not fit to have a work item which is
> potentially unbound in runtime. But it is a bit moot conceptual mismatch
> because it is a worst case / theoretical, and I think due more
> fundamental concerns.
>
> If we have to go back to the low level side of things, I've picked this
> random spot to consolidate what I have already mentioned and perhaps
> expand.
>
> To start with, let me pull out some thoughts from workqueue.rst:
>
> """
> Generally, work items are not expected to hog a CPU and consume many
> cycles. That means maintaining just enough concurrency to prevent work
> processing from stalling should be optimal.
> """
>
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
>
> Given the unbound queues will be spawned on demand to service all queued
> work items (more interesting when mixing up with the system_unbound_wq),
> in the proposed design the number of instantiated worker threads does
> not correspond to the number of user threads (as you have elsewhere
> stated), but pessimistically to the number of active user contexts.


Those are pretty much the same in practice.  Rather, user threads is
typically an upper bound on the number of contexts.  Yes, a single user
thread could have a bunch of contexts but basically nothing does that
except IGT.  In real-world usage, it's at most one context per user thread.


> That
> is the number which drives the maximum number of not-runnable jobs that
> can become runnable at once, and hence spawn that many work items, and
> in turn unbound worker threads.
>
> Several problems there.
>
> It is fundamentally pointless to have potentially that many more threads
> than the number of CPU cores - it simply creates a scheduling storm.
>
> Unbound workers have no CPU / cache locality either and no connection
> with the CPU scheduler to optimize scheduling patterns. This may matter
> either on large systems or on small ones. Whereas the current design
> allows for scheduler to notice userspace CPU thread keeps waking up the
> same drm scheduler kernel thread, and so it can keep them on the same
> CPU, the unbound workers lose that ability and so 2nd CPU might be
> getting woken up from low sleep for every submission.
>
> Hence, apart from being a bit of a impedance mismatch, the proposal has
> the potential to change performance and power patterns and both large
> and small machines.
>

Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
deferred kwork does appear to mismatch somewhat with what the scheduler
needs or at least how it's worked in the past.  How much impact will that
mismatch have?  Unclear.


> >      >>> Secondly, it probably demands separate workers (not optional),
> >     otherwise
> >      >>> behaviour of shared workqueues has either the potential to
> >     explode number
> >      >>> kernel threads anyway, or add latency.
> >      >>>
> >      >>
> >      >> Right now the system_unbound_wq is used which does have a limit
> >     on the
> >      >> number of threads, right? I do have a FIXME to allow a worker to
> be
> >      >> passed in similar to TDR.
> >      >>
> >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> >     as 2 GPU
> >      >> schedulers can be pushing jobs into the backend / cleaning up
> >     jobs in
> >      >> parallel.
> >      >>
> >      >
> >      > Thought of one more point here where why in Xe we absolutely want
> >     a 1 to
> >      > 1 ratio between entity and scheduler - the way we implement
> >     timeslicing
> >      > for preempt fences.
> >      >
> >      > Let me try to explain.
> >      >
> >      > Preempt fences are implemented via the generic messaging
> >     interface [1]
> >      > with suspend / resume messages. If a suspend messages is received
> to
> >      > soon after calling resume (this is per entity) we simply sleep in
> the
> >      > suspend call thus giving the entity a timeslice. This completely
> >     falls
> >      > apart with a many to 1 relationship as now a entity waiting for a
> >      > timeslice blocks the other entities. Could we work aroudn this,
> >     sure but
> >      > just another bunch of code we'd have to add in Xe. Being to
> >     freely sleep
> >      > in backend without affecting other entities is really, really
> >     nice IMO
> >      > and I bet Xe isn't the only driver that is going to feel this way.
> >      >
> >      > Last thing I'll say regardless of how anyone feels about Xe using
> >     a 1 to
> >      > 1 relationship this patch IMO makes sense as I hope we can all
> >     agree a
> >      > workqueue scales better than kthreads.
> >
> >     I don't know for sure what will scale better and for what use case,
> >     combination of CPU cores vs number of GPU engines to keep busy vs
> other
> >     system activity. But I wager someone is bound to ask for some
> >     numbers to
> >     make sure proposal is not negatively affecting any other drivers.
> >
> >
> > Then let them ask.  Waving your hands vaguely in the direction of the
> > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > think there's someone in particular who might have a useful opinion on
> > this, throw them in the CC so they don't miss the e-mail thread.
> >
> > Or are you asking for numbers?  If so, what numbers are you asking for?
>
> It was a heads up to the Xe team in case people weren't appreciating how
> the proposed change has the potential influence power and performance
> across the board. And nothing in the follow up discussion made me think
> it was considered so I don't think it was redundant to raise it.
>
> In my experience it is typical that such core changes come with some
> numbers. Which is in case of drm scheduler is tricky and probably
> requires explicitly asking everyone to test (rather than count on "don't
> miss the email thread"). Real products can fail to ship due ten mW here
> or there. Like suddenly an extra core prevented from getting into deep
> sleep.
>
> If that was "profoundly unhelpful" so be it.
>

With your above explanation, it makes more sense what you're asking.  It's
still not something Matt is likely to be able to provide on his own.  We
need to tag some other folks and ask them to test it out.  We could play
around a bit with it on Xe but it's not exactly production grade yet and is
going to hit this differently from most.  Likely candidates are probably
AMD and Freedreno.


> > Also, If we're talking about a design that might paint us into an
> > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > talking about switching which kernel threading/task mechanism to use for
> > what's really a very generic problem.  The core Xe design works without
> > this patch (just with more kthreads).  If we land this patch or
> > something like it and get it wrong and it causes a performance problem
> > for someone down the line, we can revisit it.
>
> For some definition of "it works" - I really wouldn't suggest shipping a
> kthread per user context at any point.
>

You have yet to elaborate on why. What resources is it consuming that's
going to be a problem? Are you anticipating CPU affinity problems? Or does
it just seem wasteful?

I think I largely agree that it's probably unnecessary/wasteful but
reducing the number of kthreads seems like a tractable problem to solve
regardless of where we put the gpu_scheduler object.  Is this the right
solution?  Maybe not.  It was also proposed at one point that we could
split the scheduler into two pieces: A scheduler which owns the kthread,
and a back-end which targets some HW ring thing where you can have multiple
back-ends per scheduler.  That's certainly more invasive from a DRM
scheduler internal API PoV but would solve the kthread problem in a way
that's more similar to what we have now.


> >     In any case that's a low level question caused by the high level
> design
> >     decision. So I'd think first focus on the high level - which is the
> 1:1
> >     mapping of entity to scheduler instance proposal.
> >
> >     Fundamentally it will be up to the DRM maintainers and the community
> to
> >     bless your approach. And it is important to stress 1:1 is about
> >     userspace contexts, so I believe unlike any other current scheduler
> >     user. And also important to stress this effectively does not make Xe
> >     _really_ use the scheduler that much.
> >
> >
> > I don't think this makes Xe nearly as much of a one-off as you think it
> > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > to do it this way and it seems to be a pretty good mapping for them. I
> > believe this is roughly the plan for nouveau as well.  It's not the way
> > it currently works for anyone because most other groups aren't doing FW
> > scheduling yet.  In the world of FW scheduling and hardware designed to
> > support userspace direct-to-FW submit, I think the design makes perfect
> > sense (see below) and I expect we'll see more drivers move in this
> > direction as those drivers evolve.  (AMD is doing some customish thing
> > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > those details.)
> >
> >     I can only offer my opinion, which is that the two options mentioned
> in
> >     this thread (either improve drm scheduler to cope with what is
> >     required,
> >     or split up the code so you can use just the parts of drm_sched which
> >     you want - which is frontend dependency tracking) shouldn't be so
> >     readily dismissed, given how I think the idea was for the new driver
> to
> >     work less in a silo and more in the community (not do kludges to
> >     workaround stuff because it is thought to be too hard to improve
> common
> >     code), but fundamentally, "goto previous paragraph" for what I am
> >     concerned.
> >
> >
> > Meta comment:  It appears as if you're falling into the standard i915
> > team trap of having an internal discussion about what the community
> > discussion might look like instead of actually having the community
> > discussion.  If you are seriously concerned about interactions with
> > other drivers or whether or setting common direction, the right way to
> > do that is to break a patch or two out into a separate RFC series and
> > tag a handful of driver maintainers.  Trying to predict the questions
> > other people might ask is pointless. Cc them and asking for their input
> > instead.
>
> I don't follow you here. It's not an internal discussion - I am raising
> my concerns on the design publicly. I am supposed to write a patch to
> show something, but am allowed to comment on a RFC series?
>

I may have misread your tone a bit.  It felt a bit like too many
discussions I've had in the past where people are trying to predict what
others will say instead of just asking them.  Reading it again, I was
probably jumping to conclusions a bit.  Sorry about that.


> It is "drm/sched: Convert drm scheduler to use a work queue rather than
> kthread" which should have Cc-ed _everyone_ who use drm scheduler.
>

Yeah, it probably should have.  I think that's mostly what I've been trying
to say.


> >
> >     Regards,
> >
> >     Tvrtko
> >
> >     P.S. And as a related side note, there are more areas where drm_sched
> >     could be improved, like for instance priority handling.
> >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> >     get_sched_entity to see how msm works around the drm_sched hardcoded
> >     limit of available priority levels, in order to avoid having to
> leave a
> >     hw capability unused. I suspect msm would be happier if they could
> have
> >     all priority levels equal in terms of whether they apply only at the
> >     frontend level or completely throughout the pipeline.
> >
> >      > [1]
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >      >
> >      >>> What would be interesting to learn is whether the option of
> >     refactoring
> >      >>> drm_sched to deal with out of order completion was considered
> >     and what were
> >      >>> the conclusions.
> >      >>>
> >      >>
> >      >> I coded this up a while back when trying to convert the i915 to
> >     the DRM
> >      >> scheduler it isn't all that hard either. The free flow control
> >     on the
> >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> >     really what
> >      >> sold me on the this design.
> >
> >
> > You're not the only one to suggest supporting out-of-order completion.
> > However, it's tricky and breaks a lot of internal assumptions of the
> > scheduler. It also reduces functionality a bit because it can no longer
> > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > yes, it probably could but it becomes a substantially harder problem.)
> >
> > It also seems like a worse mapping to me.  The goal here is to turn
> > submissions on a userspace-facing engine/queue into submissions to a FW
> > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > description of saying this is a 1:1 mapping between sched/entity doesn't
> > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > Why is that better?
>
> As I have stated before, what I think what would fit well for Xe is one
> drm_scheduler per engine class. In specific terms on our current
> hardware, one drm scheduler instance for render, compute, blitter, video
> and video enhance. Userspace contexts remain scheduler entities.
>

And this is where we fairly strongly disagree.  More in a bit.


> That way you avoid the whole kthread/kworker story and you have it
> actually use the entity picking code in the scheduler, which may be
> useful when the backend is congested.
>

What back-end congestion are you referring to here?  Running out of FW
queue IDs?  Something else?


> Yes you have to solve the out of order problem so in my mind that is
> something to discuss. What the problem actually is (just TDR?), how
> tricky and why etc.
>
> And yes you lose the handy LRCA ring buffer size management so you'd
> have to make those entities not runnable in some other way.
>
> Regarding the argument you raise below - would any of that make the
> frontend / backend separation worse and why? Do you think it is less
> natural? If neither is true then all remains is that it appears extra
> work to support out of order completion of entities has been discounted
> in favour of an easy but IMO inelegant option.
>

Broadly speaking, the kernel needs to stop thinking about GPU scheduling in
terms of scheduling jobs and start thinking in terms of scheduling
contexts/engines.  There is still some need for scheduling individual jobs
but that is only for the purpose of delaying them as needed to resolve
dma_fence dependencies.  Once dependencies are resolved, they get shoved
onto the context/engine queue and from there the kernel only really manages
whole contexts/engines.  This is a major architectural shift, entirely
different from the way i915 scheduling works.  It's also different from the
historical usage of DRM scheduler which I think is why this all looks a bit
funny.

To justify this architectural shift, let's look at where we're headed.  In
the glorious future...

 1. Userspace submits directly to firmware queues.  The kernel has no
visibility whatsoever into individual jobs.  At most it can pause/resume FW
contexts as needed to handle eviction and memory management.

 2. Because of 1, apart from handing out the FW queue IDs at the beginning,
the kernel can't really juggle them that much.  Depending on FW design, it
may be able to pause a client, give its IDs to another, and then resume it
later when IDs free up.  What it's not doing is juggling IDs on a
job-by-job basis like i915 currently is.

 3. Long-running compute jobs may not complete for days.  This means that
memory management needs to happen in terms of pause/resume of entire
contexts/engines using the memory rather than based on waiting for
individual jobs to complete or pausing individual jobs until the memory is
available.

 4. Synchronization happens via userspace memory fences (UMF) and the
kernel is mostly unaware of most dependencies and when a context/engine is
or is not runnable.  Instead, it keeps as many of them minimally active
(memory is available, even if it's in system RAM) as possible and lets the
FW sort out dependencies.  (There may need to be some facility for sleeping
a context until a memory change similar to futex() or poll() for userspace
threads.  There are some details TBD.)

Are there potential problems that will need to be solved here?  Yes.  Is it
a good design?  Well, Microsoft has been living in this future for half a
decade or better and it's working quite well for them.  It's also the way
all modern game consoles work.  It really is just Linux that's stuck with
the same old job model we've had since the monumental shift to DRI2.

To that end, one of the core goals of the Xe project was to make the driver
internally behave as close to the above model as possible while keeping the
old-school job model as a very thin layer on top.  As the broader ecosystem
problems (window-system support for UMF, for instance) are solved, that
layer can be peeled back.  The core driver will already be ready for it.

To that end, the point of the DRM scheduler in Xe isn't to schedule jobs.
It's to resolve syncobj and dma-buf implicit sync dependencies and stuff
jobs into their respective context/engine queue once they're ready.  All
the actual scheduling happens in firmware and any scheduling the kernel
does to deal with contention, oversubscriptions, too many contexts, etc. is
between contexts/engines, not individual jobs.  Sure, the individual job
visibility is nice, but if we design around it, we'll never get to the
glorious future.

I really need to turn the above (with a bit more detail) into a blog
post.... Maybe I'll do that this week.

In any case, I hope that provides more insight into why Xe is designed the
way it is and why I'm pushing back so hard on trying to make it more of a
"classic" driver as far as scheduling is concerned.  Are there potential
problems here?  Yes, that's why Xe has been labeled a prototype.  Are such
radical changes necessary to get to said glorious future?  Yes, I think
they are.  Will it be worth it?  I believe so.

> There are two places where this 1:1:1 mapping is causing problems:
> >
> >   1. It creates lots of kthreads. This is what this patch is trying to
> > solve. IDK if it's solving it the best way but that's the goal.
> >
> >   2. There are a far more limited number of communication queues between
> > the kernel and GuC for more meta things like pausing and resuming
> > queues, getting events back from GuC, etc. Unless we're in a weird
> > pressure scenario, the amount of traffic on this queue should be low so
> > we can probably just have one per physical device.  The vast majority of
> > kernel -> GuC communication should be on the individual FW queue rings
> > and maybe smashing in-memory doorbells.
>
> I don't follow your terminology here. I suppose you are talking about
> global GuC CT and context ringbuffers. If so then isn't "far more
> limited" actually one?
>

I thought there could be more than one but I think in practice it's just
the one.

--Jason



> Regards,
>
> Tvrtko
>
> > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > 2 and actually makes managing FW queues harder because we no longer have
> > built-in rate limiting.  Seems like a net loss to me.
> >
> >      >>> Second option perhaps to split out the drm_sched code into
> >     parts which would
> >      >>> lend themselves more to "pick and choose" of its
> functionalities.
> >      >>> Specifically, Xe wants frontend dependency tracking, but not
> >     any scheduling
> >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> >     picking), so
> >      >>> even having all these data structures in memory is a waste.
> >      >>>
> >      >>
> >      >> I don't think that we are wasting memory is a very good argument
> for
> >      >> making intrusive changes to the DRM scheduler.
> >
> >
> > Worse than that, I think the "we could split it up" kind-of misses the
> > point of the way Xe is using drm/scheduler.  It's not just about
> > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > this way provides a clean separation between front-end and back-end.
> > The job of the userspace-facing ioctl code is to shove things on the
> > scheduler.  The job of the run_job callback is to encode the job into
> > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > doorbell.  Everything else happens in terms of managing those queues
> > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > a high level, this is a really clean design.  There are potentially some
> > sticky bits around the dual-use of dma_fence for scheduling and memory
> > management but none of those are solved by breaking the DRM scheduler
> > into chunks or getting rid of the 1:1:1 mapping.
> >
> > If we split it out, we're basically asking the driver to implement a
> > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > It may not be all that much code but also, why?  To save a few bytes of
> > memory per engine?  Each engine already has 32K(ish) worth of context
> > state and a similar size ring to communicate with the FW.  No one is
> > going to notice an extra CPU data structure.
> >
> > I'm not seeing a solid argument against the 1:1:1 design here other than
> > that it doesn't seem like the way DRM scheduler was intended to be
> > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > take advantage of the benefits the DRM scheduler does provide while also
> > mapping it to hardware that was designed for userspace direct-to-FW
> submit.
> >
> > --Jason
> >
> >      >>> With the first option then the end result could be drm_sched
> >     per engine
> >      >>> class (hardware view), which I think fits with the GuC model.
> >     Give all
> >      >>> schedulable contexts (entities) to the GuC and then mostly
> >     forget about
> >      >>> them. Timeslicing and re-ordering and all happens transparently
> >     to the
> >      >>> kernel from that point until completion.
> >      >>>
> >      >>
> >      >> Out-of-order problem still exists here.
> >      >>
> >      >>> Or with the second option you would build on some smaller
> >     refactored
> >      >>> sub-components of drm_sched, by maybe splitting the dependency
> >     tracking from
> >      >>> scheduling (RR/FIFO entity picking code).
> >      >>>
> >      >>> Second option is especially a bit vague and I haven't thought
> >     about the
> >      >>> required mechanics, but it just appeared too obvious the
> >     proposed design has
> >      >>> a bit too much impedance mismatch.
> >      >>>
> >      >>
> >      >> IMO ROI on this is low and again lets see what Boris comes up
> with.
> >      >>
> >      >> Matt
> >      >>
> >      >>> Oh and as a side note, when I went into the drm_sched code base
> >     to remind
> >      >>> myself how things worked, it is quite easy to find some FIXME
> >     comments which
> >      >>> suggest people working on it are unsure of locking desing there
> >     and such. So
> >      >>> perhaps that all needs cleanup too, I mean would benefit from
> >      >>> refactoring/improving work as brainstormed above anyway.
> >      >>>
> >      >>> Regards,
> >      >>>
> >      >>> Tvrtko
> >
>

[-- Attachment #2: Type: text/html, Size: 33105 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 14:08                       ` Jason Ekstrand
  0 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-10 14:08 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 26300 bytes --]

On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin <
tvrtko.ursulin@linux.intel.com> wrote:

>
>
> On 09/01/2023 17:27, Jason Ekstrand wrote:
>
> [snip]
>
> >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >     contexts (per
> >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >     invasive changes
> >      >>> to the shared code is in the spirit of the overall idea and
> instead
> >      >>> opportunity should be used to look at way to refactor/improve
> >     drm_sched.
> >
> >
> > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > really needs to drive a re-factor.  (More on that later.)  There's only
> > one real issue which is that it fires off potentially a lot of kthreads.
> > Even that's not that bad given that kthreads are pretty light and you're
> > not likely to have more kthreads than userspace threads which are much
> > heavier.  Not ideal, but not the end of the world either.  Definitely
> > something we can/should optimize but if we went through with Xe without
> > this patch, it would probably be mostly ok.
> >
> >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >>
> >      >> I'm not really prepared to make large changes to DRM scheduler
> >     at the
> >      >> moment for Xe as they are not really required nor does Boris
> >     seem they
> >      >> will be required for his work either. I am interested to see
> >     what Boris
> >      >> comes up with.
> >      >>
> >      >>> Even on the low level, the idea to replace drm_sched threads
> >     with workers
> >      >>> has a few problems.
> >      >>>
> >      >>> To start with, the pattern of:
> >      >>>
> >      >>>    while (not_stopped) {
> >      >>>     keep picking jobs
> >      >>>    }
> >      >>>
> >      >>> Feels fundamentally in disagreement with workers (while
> >     obviously fits
> >      >>> perfectly with the current kthread design).
> >      >>
> >      >> The while loop breaks and worker exists if no jobs are ready.
> >
> >
> > I'm not very familiar with workqueues. What are you saying would fit
> > better? One scheduling job per work item rather than one big work item
> > which handles all available jobs?
>
> Yes and no, it indeed IMO does not fit to have a work item which is
> potentially unbound in runtime. But it is a bit moot conceptual mismatch
> because it is a worst case / theoretical, and I think due more
> fundamental concerns.
>
> If we have to go back to the low level side of things, I've picked this
> random spot to consolidate what I have already mentioned and perhaps
> expand.
>
> To start with, let me pull out some thoughts from workqueue.rst:
>
> """
> Generally, work items are not expected to hog a CPU and consume many
> cycles. That means maintaining just enough concurrency to prevent work
> processing from stalling should be optimal.
> """
>
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
>
> Given the unbound queues will be spawned on demand to service all queued
> work items (more interesting when mixing up with the system_unbound_wq),
> in the proposed design the number of instantiated worker threads does
> not correspond to the number of user threads (as you have elsewhere
> stated), but pessimistically to the number of active user contexts.


Those are pretty much the same in practice.  Rather, user threads is
typically an upper bound on the number of contexts.  Yes, a single user
thread could have a bunch of contexts but basically nothing does that
except IGT.  In real-world usage, it's at most one context per user thread.


> That
> is the number which drives the maximum number of not-runnable jobs that
> can become runnable at once, and hence spawn that many work items, and
> in turn unbound worker threads.
>
> Several problems there.
>
> It is fundamentally pointless to have potentially that many more threads
> than the number of CPU cores - it simply creates a scheduling storm.
>
> Unbound workers have no CPU / cache locality either and no connection
> with the CPU scheduler to optimize scheduling patterns. This may matter
> either on large systems or on small ones. Whereas the current design
> allows for scheduler to notice userspace CPU thread keeps waking up the
> same drm scheduler kernel thread, and so it can keep them on the same
> CPU, the unbound workers lose that ability and so 2nd CPU might be
> getting woken up from low sleep for every submission.
>
> Hence, apart from being a bit of a impedance mismatch, the proposal has
> the potential to change performance and power patterns and both large
> and small machines.
>

Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
deferred kwork does appear to mismatch somewhat with what the scheduler
needs or at least how it's worked in the past.  How much impact will that
mismatch have?  Unclear.


> >      >>> Secondly, it probably demands separate workers (not optional),
> >     otherwise
> >      >>> behaviour of shared workqueues has either the potential to
> >     explode number
> >      >>> kernel threads anyway, or add latency.
> >      >>>
> >      >>
> >      >> Right now the system_unbound_wq is used which does have a limit
> >     on the
> >      >> number of threads, right? I do have a FIXME to allow a worker to
> be
> >      >> passed in similar to TDR.
> >      >>
> >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> >     as 2 GPU
> >      >> schedulers can be pushing jobs into the backend / cleaning up
> >     jobs in
> >      >> parallel.
> >      >>
> >      >
> >      > Thought of one more point here where why in Xe we absolutely want
> >     a 1 to
> >      > 1 ratio between entity and scheduler - the way we implement
> >     timeslicing
> >      > for preempt fences.
> >      >
> >      > Let me try to explain.
> >      >
> >      > Preempt fences are implemented via the generic messaging
> >     interface [1]
> >      > with suspend / resume messages. If a suspend messages is received
> to
> >      > soon after calling resume (this is per entity) we simply sleep in
> the
> >      > suspend call thus giving the entity a timeslice. This completely
> >     falls
> >      > apart with a many to 1 relationship as now a entity waiting for a
> >      > timeslice blocks the other entities. Could we work aroudn this,
> >     sure but
> >      > just another bunch of code we'd have to add in Xe. Being to
> >     freely sleep
> >      > in backend without affecting other entities is really, really
> >     nice IMO
> >      > and I bet Xe isn't the only driver that is going to feel this way.
> >      >
> >      > Last thing I'll say regardless of how anyone feels about Xe using
> >     a 1 to
> >      > 1 relationship this patch IMO makes sense as I hope we can all
> >     agree a
> >      > workqueue scales better than kthreads.
> >
> >     I don't know for sure what will scale better and for what use case,
> >     combination of CPU cores vs number of GPU engines to keep busy vs
> other
> >     system activity. But I wager someone is bound to ask for some
> >     numbers to
> >     make sure proposal is not negatively affecting any other drivers.
> >
> >
> > Then let them ask.  Waving your hands vaguely in the direction of the
> > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > think there's someone in particular who might have a useful opinion on
> > this, throw them in the CC so they don't miss the e-mail thread.
> >
> > Or are you asking for numbers?  If so, what numbers are you asking for?
>
> It was a heads up to the Xe team in case people weren't appreciating how
> the proposed change has the potential influence power and performance
> across the board. And nothing in the follow up discussion made me think
> it was considered so I don't think it was redundant to raise it.
>
> In my experience it is typical that such core changes come with some
> numbers. Which is in case of drm scheduler is tricky and probably
> requires explicitly asking everyone to test (rather than count on "don't
> miss the email thread"). Real products can fail to ship due ten mW here
> or there. Like suddenly an extra core prevented from getting into deep
> sleep.
>
> If that was "profoundly unhelpful" so be it.
>

With your above explanation, it makes more sense what you're asking.  It's
still not something Matt is likely to be able to provide on his own.  We
need to tag some other folks and ask them to test it out.  We could play
around a bit with it on Xe but it's not exactly production grade yet and is
going to hit this differently from most.  Likely candidates are probably
AMD and Freedreno.


> > Also, If we're talking about a design that might paint us into an
> > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > talking about switching which kernel threading/task mechanism to use for
> > what's really a very generic problem.  The core Xe design works without
> > this patch (just with more kthreads).  If we land this patch or
> > something like it and get it wrong and it causes a performance problem
> > for someone down the line, we can revisit it.
>
> For some definition of "it works" - I really wouldn't suggest shipping a
> kthread per user context at any point.
>

You have yet to elaborate on why. What resources is it consuming that's
going to be a problem? Are you anticipating CPU affinity problems? Or does
it just seem wasteful?

I think I largely agree that it's probably unnecessary/wasteful but
reducing the number of kthreads seems like a tractable problem to solve
regardless of where we put the gpu_scheduler object.  Is this the right
solution?  Maybe not.  It was also proposed at one point that we could
split the scheduler into two pieces: A scheduler which owns the kthread,
and a back-end which targets some HW ring thing where you can have multiple
back-ends per scheduler.  That's certainly more invasive from a DRM
scheduler internal API PoV but would solve the kthread problem in a way
that's more similar to what we have now.


> >     In any case that's a low level question caused by the high level
> design
> >     decision. So I'd think first focus on the high level - which is the
> 1:1
> >     mapping of entity to scheduler instance proposal.
> >
> >     Fundamentally it will be up to the DRM maintainers and the community
> to
> >     bless your approach. And it is important to stress 1:1 is about
> >     userspace contexts, so I believe unlike any other current scheduler
> >     user. And also important to stress this effectively does not make Xe
> >     _really_ use the scheduler that much.
> >
> >
> > I don't think this makes Xe nearly as much of a one-off as you think it
> > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > to do it this way and it seems to be a pretty good mapping for them. I
> > believe this is roughly the plan for nouveau as well.  It's not the way
> > it currently works for anyone because most other groups aren't doing FW
> > scheduling yet.  In the world of FW scheduling and hardware designed to
> > support userspace direct-to-FW submit, I think the design makes perfect
> > sense (see below) and I expect we'll see more drivers move in this
> > direction as those drivers evolve.  (AMD is doing some customish thing
> > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > those details.)
> >
> >     I can only offer my opinion, which is that the two options mentioned
> in
> >     this thread (either improve drm scheduler to cope with what is
> >     required,
> >     or split up the code so you can use just the parts of drm_sched which
> >     you want - which is frontend dependency tracking) shouldn't be so
> >     readily dismissed, given how I think the idea was for the new driver
> to
> >     work less in a silo and more in the community (not do kludges to
> >     workaround stuff because it is thought to be too hard to improve
> common
> >     code), but fundamentally, "goto previous paragraph" for what I am
> >     concerned.
> >
> >
> > Meta comment:  It appears as if you're falling into the standard i915
> > team trap of having an internal discussion about what the community
> > discussion might look like instead of actually having the community
> > discussion.  If you are seriously concerned about interactions with
> > other drivers or whether or setting common direction, the right way to
> > do that is to break a patch or two out into a separate RFC series and
> > tag a handful of driver maintainers.  Trying to predict the questions
> > other people might ask is pointless. Cc them and asking for their input
> > instead.
>
> I don't follow you here. It's not an internal discussion - I am raising
> my concerns on the design publicly. I am supposed to write a patch to
> show something, but am allowed to comment on a RFC series?
>

I may have misread your tone a bit.  It felt a bit like too many
discussions I've had in the past where people are trying to predict what
others will say instead of just asking them.  Reading it again, I was
probably jumping to conclusions a bit.  Sorry about that.


> It is "drm/sched: Convert drm scheduler to use a work queue rather than
> kthread" which should have Cc-ed _everyone_ who use drm scheduler.
>

Yeah, it probably should have.  I think that's mostly what I've been trying
to say.


> >
> >     Regards,
> >
> >     Tvrtko
> >
> >     P.S. And as a related side note, there are more areas where drm_sched
> >     could be improved, like for instance priority handling.
> >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> >     get_sched_entity to see how msm works around the drm_sched hardcoded
> >     limit of available priority levels, in order to avoid having to
> leave a
> >     hw capability unused. I suspect msm would be happier if they could
> have
> >     all priority levels equal in terms of whether they apply only at the
> >     frontend level or completely throughout the pipeline.
> >
> >      > [1]
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >      >
> >      >>> What would be interesting to learn is whether the option of
> >     refactoring
> >      >>> drm_sched to deal with out of order completion was considered
> >     and what were
> >      >>> the conclusions.
> >      >>>
> >      >>
> >      >> I coded this up a while back when trying to convert the i915 to
> >     the DRM
> >      >> scheduler it isn't all that hard either. The free flow control
> >     on the
> >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> >     really what
> >      >> sold me on the this design.
> >
> >
> > You're not the only one to suggest supporting out-of-order completion.
> > However, it's tricky and breaks a lot of internal assumptions of the
> > scheduler. It also reduces functionality a bit because it can no longer
> > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > yes, it probably could but it becomes a substantially harder problem.)
> >
> > It also seems like a worse mapping to me.  The goal here is to turn
> > submissions on a userspace-facing engine/queue into submissions to a FW
> > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > description of saying this is a 1:1 mapping between sched/entity doesn't
> > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > Why is that better?
>
> As I have stated before, what I think what would fit well for Xe is one
> drm_scheduler per engine class. In specific terms on our current
> hardware, one drm scheduler instance for render, compute, blitter, video
> and video enhance. Userspace contexts remain scheduler entities.
>

And this is where we fairly strongly disagree.  More in a bit.


> That way you avoid the whole kthread/kworker story and you have it
> actually use the entity picking code in the scheduler, which may be
> useful when the backend is congested.
>

What back-end congestion are you referring to here?  Running out of FW
queue IDs?  Something else?


> Yes you have to solve the out of order problem so in my mind that is
> something to discuss. What the problem actually is (just TDR?), how
> tricky and why etc.
>
> And yes you lose the handy LRCA ring buffer size management so you'd
> have to make those entities not runnable in some other way.
>
> Regarding the argument you raise below - would any of that make the
> frontend / backend separation worse and why? Do you think it is less
> natural? If neither is true then all remains is that it appears extra
> work to support out of order completion of entities has been discounted
> in favour of an easy but IMO inelegant option.
>

Broadly speaking, the kernel needs to stop thinking about GPU scheduling in
terms of scheduling jobs and start thinking in terms of scheduling
contexts/engines.  There is still some need for scheduling individual jobs
but that is only for the purpose of delaying them as needed to resolve
dma_fence dependencies.  Once dependencies are resolved, they get shoved
onto the context/engine queue and from there the kernel only really manages
whole contexts/engines.  This is a major architectural shift, entirely
different from the way i915 scheduling works.  It's also different from the
historical usage of DRM scheduler which I think is why this all looks a bit
funny.

To justify this architectural shift, let's look at where we're headed.  In
the glorious future...

 1. Userspace submits directly to firmware queues.  The kernel has no
visibility whatsoever into individual jobs.  At most it can pause/resume FW
contexts as needed to handle eviction and memory management.

 2. Because of 1, apart from handing out the FW queue IDs at the beginning,
the kernel can't really juggle them that much.  Depending on FW design, it
may be able to pause a client, give its IDs to another, and then resume it
later when IDs free up.  What it's not doing is juggling IDs on a
job-by-job basis like i915 currently is.

 3. Long-running compute jobs may not complete for days.  This means that
memory management needs to happen in terms of pause/resume of entire
contexts/engines using the memory rather than based on waiting for
individual jobs to complete or pausing individual jobs until the memory is
available.

 4. Synchronization happens via userspace memory fences (UMF) and the
kernel is mostly unaware of most dependencies and when a context/engine is
or is not runnable.  Instead, it keeps as many of them minimally active
(memory is available, even if it's in system RAM) as possible and lets the
FW sort out dependencies.  (There may need to be some facility for sleeping
a context until a memory change similar to futex() or poll() for userspace
threads.  There are some details TBD.)

Are there potential problems that will need to be solved here?  Yes.  Is it
a good design?  Well, Microsoft has been living in this future for half a
decade or better and it's working quite well for them.  It's also the way
all modern game consoles work.  It really is just Linux that's stuck with
the same old job model we've had since the monumental shift to DRI2.

To that end, one of the core goals of the Xe project was to make the driver
internally behave as close to the above model as possible while keeping the
old-school job model as a very thin layer on top.  As the broader ecosystem
problems (window-system support for UMF, for instance) are solved, that
layer can be peeled back.  The core driver will already be ready for it.

To that end, the point of the DRM scheduler in Xe isn't to schedule jobs.
It's to resolve syncobj and dma-buf implicit sync dependencies and stuff
jobs into their respective context/engine queue once they're ready.  All
the actual scheduling happens in firmware and any scheduling the kernel
does to deal with contention, oversubscriptions, too many contexts, etc. is
between contexts/engines, not individual jobs.  Sure, the individual job
visibility is nice, but if we design around it, we'll never get to the
glorious future.

I really need to turn the above (with a bit more detail) into a blog
post.... Maybe I'll do that this week.

In any case, I hope that provides more insight into why Xe is designed the
way it is and why I'm pushing back so hard on trying to make it more of a
"classic" driver as far as scheduling is concerned.  Are there potential
problems here?  Yes, that's why Xe has been labeled a prototype.  Are such
radical changes necessary to get to said glorious future?  Yes, I think
they are.  Will it be worth it?  I believe so.

> There are two places where this 1:1:1 mapping is causing problems:
> >
> >   1. It creates lots of kthreads. This is what this patch is trying to
> > solve. IDK if it's solving it the best way but that's the goal.
> >
> >   2. There are a far more limited number of communication queues between
> > the kernel and GuC for more meta things like pausing and resuming
> > queues, getting events back from GuC, etc. Unless we're in a weird
> > pressure scenario, the amount of traffic on this queue should be low so
> > we can probably just have one per physical device.  The vast majority of
> > kernel -> GuC communication should be on the individual FW queue rings
> > and maybe smashing in-memory doorbells.
>
> I don't follow your terminology here. I suppose you are talking about
> global GuC CT and context ringbuffers. If so then isn't "far more
> limited" actually one?
>

I thought there could be more than one but I think in practice it's just
the one.

--Jason



> Regards,
>
> Tvrtko
>
> > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > 2 and actually makes managing FW queues harder because we no longer have
> > built-in rate limiting.  Seems like a net loss to me.
> >
> >      >>> Second option perhaps to split out the drm_sched code into
> >     parts which would
> >      >>> lend themselves more to "pick and choose" of its
> functionalities.
> >      >>> Specifically, Xe wants frontend dependency tracking, but not
> >     any scheduling
> >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> >     picking), so
> >      >>> even having all these data structures in memory is a waste.
> >      >>>
> >      >>
> >      >> I don't think that we are wasting memory is a very good argument
> for
> >      >> making intrusive changes to the DRM scheduler.
> >
> >
> > Worse than that, I think the "we could split it up" kind-of misses the
> > point of the way Xe is using drm/scheduler.  It's not just about
> > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > this way provides a clean separation between front-end and back-end.
> > The job of the userspace-facing ioctl code is to shove things on the
> > scheduler.  The job of the run_job callback is to encode the job into
> > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > doorbell.  Everything else happens in terms of managing those queues
> > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > a high level, this is a really clean design.  There are potentially some
> > sticky bits around the dual-use of dma_fence for scheduling and memory
> > management but none of those are solved by breaking the DRM scheduler
> > into chunks or getting rid of the 1:1:1 mapping.
> >
> > If we split it out, we're basically asking the driver to implement a
> > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > It may not be all that much code but also, why?  To save a few bytes of
> > memory per engine?  Each engine already has 32K(ish) worth of context
> > state and a similar size ring to communicate with the FW.  No one is
> > going to notice an extra CPU data structure.
> >
> > I'm not seeing a solid argument against the 1:1:1 design here other than
> > that it doesn't seem like the way DRM scheduler was intended to be
> > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > take advantage of the benefits the DRM scheduler does provide while also
> > mapping it to hardware that was designed for userspace direct-to-FW
> submit.
> >
> > --Jason
> >
> >      >>> With the first option then the end result could be drm_sched
> >     per engine
> >      >>> class (hardware view), which I think fits with the GuC model.
> >     Give all
> >      >>> schedulable contexts (entities) to the GuC and then mostly
> >     forget about
> >      >>> them. Timeslicing and re-ordering and all happens transparently
> >     to the
> >      >>> kernel from that point until completion.
> >      >>>
> >      >>
> >      >> Out-of-order problem still exists here.
> >      >>
> >      >>> Or with the second option you would build on some smaller
> >     refactored
> >      >>> sub-components of drm_sched, by maybe splitting the dependency
> >     tracking from
> >      >>> scheduling (RR/FIFO entity picking code).
> >      >>>
> >      >>> Second option is especially a bit vague and I haven't thought
> >     about the
> >      >>> required mechanics, but it just appeared too obvious the
> >     proposed design has
> >      >>> a bit too much impedance mismatch.
> >      >>>
> >      >>
> >      >> IMO ROI on this is low and again lets see what Boris comes up
> with.
> >      >>
> >      >> Matt
> >      >>
> >      >>> Oh and as a side note, when I went into the drm_sched code base
> >     to remind
> >      >>> myself how things worked, it is quite easy to find some FIXME
> >     comments which
> >      >>> suggest people working on it are unsure of locking desing there
> >     and such. So
> >      >>> perhaps that all needs cleanup too, I mean would benefit from
> >      >>> refactoring/improving work as brainstormed above anyway.
> >      >>>
> >      >>> Regards,
> >      >>>
> >      >>> Tvrtko
> >
>

[-- Attachment #2: Type: text/html, Size: 33105 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 12:19                       ` Tvrtko Ursulin
@ 2023-01-10 15:55                         ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-10 15:55 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, Jason Ekstrand

On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >      >>> to the shared code is in the spirit of the overall idea and
> > > instead
> > >      >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > or really needs to drive a re-factor.  (More on that later.) 
> > > There's only one real issue which is that it fires off potentially a
> > > lot of kthreads. Even that's not that bad given that kthreads are
> > > pretty light and you're not likely to have more kthreads than
> > > userspace threads which are much heavier.  Not ideal, but not the
> > > end of the world either.  Definitely something we can/should
> > > optimize but if we went through with Xe without this patch, it would
> > > probably be mostly ok.
> > > 
> > >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >>
> > >      >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >      >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >      >> will be required for his work either. I am interested to see
> > >     what Boris
> > >      >> comes up with.
> > >      >>
> > >      >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >      >>> has a few problems.
> > >      >>>
> > >      >>> To start with, the pattern of:
> > >      >>>
> > >      >>>    while (not_stopped) {
> > >      >>>     keep picking jobs
> > >      >>>    }
> > >      >>>
> > >      >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >      >>> perfectly with the current kthread design).
> > >      >>
> > >      >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work
> > > item which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more
> > fundamental concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps
> > expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many
> > cycles. That means maintaining just enough concurrency to prevent work
> > processing from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq),
> > in the proposed design the number of instantiated worker threads does
> > not correspond to the number of user threads (as you have elsewhere
> > stated), but pessimistically to the number of active user contexts. That
> > is the number which drives the maximum number of not-runnable jobs that
> > can become runnable at once, and hence spawn that many work items, and
> > in turn unbound worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> 
> To make matters worse, if I follow the code correctly, all these per user
> context worker thread / work items end up contending on the same lock or
> circular buffer, both are one instance per GPU:
> 
> guc_engine_run_job
>  -> submit_engine
>     a) wq_item_append
>         -> wq_wait_for_space
>           -> msleep

a) is dedicated per xe_engine

Also you missed the step of programming the ring which is dedicated per xe_engine

>     b) xe_guc_ct_send
>         -> guc_ct_send
>           -> mutex_lock(&ct->lock);
>           -> later a potential msleep in h2g_has_room

Techincally there is 1 instance per GT not GPU, yes this is shared but
in practice there will always be space in the CT channel so contention
on the lock should be rare.

I haven't read your rather long reply yet, but also FWIW using a
workqueue has suggested by AMD (original authors of the DRM scheduler)
when we ran this design by them.

Matt 

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 15:55                         ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-10 15:55 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >      >>> to the shared code is in the spirit of the overall idea and
> > > instead
> > >      >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > or really needs to drive a re-factor.  (More on that later.) 
> > > There's only one real issue which is that it fires off potentially a
> > > lot of kthreads. Even that's not that bad given that kthreads are
> > > pretty light and you're not likely to have more kthreads than
> > > userspace threads which are much heavier.  Not ideal, but not the
> > > end of the world either.  Definitely something we can/should
> > > optimize but if we went through with Xe without this patch, it would
> > > probably be mostly ok.
> > > 
> > >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >>
> > >      >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >      >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >      >> will be required for his work either. I am interested to see
> > >     what Boris
> > >      >> comes up with.
> > >      >>
> > >      >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >      >>> has a few problems.
> > >      >>>
> > >      >>> To start with, the pattern of:
> > >      >>>
> > >      >>>    while (not_stopped) {
> > >      >>>     keep picking jobs
> > >      >>>    }
> > >      >>>
> > >      >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >      >>> perfectly with the current kthread design).
> > >      >>
> > >      >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work
> > > item which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more
> > fundamental concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps
> > expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many
> > cycles. That means maintaining just enough concurrency to prevent work
> > processing from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq),
> > in the proposed design the number of instantiated worker threads does
> > not correspond to the number of user threads (as you have elsewhere
> > stated), but pessimistically to the number of active user contexts. That
> > is the number which drives the maximum number of not-runnable jobs that
> > can become runnable at once, and hence spawn that many work items, and
> > in turn unbound worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> 
> To make matters worse, if I follow the code correctly, all these per user
> context worker thread / work items end up contending on the same lock or
> circular buffer, both are one instance per GPU:
> 
> guc_engine_run_job
>  -> submit_engine
>     a) wq_item_append
>         -> wq_wait_for_space
>           -> msleep

a) is dedicated per xe_engine

Also you missed the step of programming the ring which is dedicated per xe_engine

>     b) xe_guc_ct_send
>         -> guc_ct_send
>           -> mutex_lock(&ct->lock);
>           -> later a potential msleep in h2g_has_room

Techincally there is 1 instance per GT not GPU, yes this is shared but
in practice there will always be space in the CT channel so contention
on the lock should be rare.

I haven't read your rather long reply yet, but also FWIW using a
workqueue has suggested by AMD (original authors of the DRM scheduler)
when we ran this design by them.

Matt 

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 11:28                     ` Tvrtko Ursulin
@ 2023-01-10 16:39                       ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-10 16:39 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, Jason Ekstrand

On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
> [snip]
> 
> >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >     contexts (per
> >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >     invasive changes
> >      >>> to the shared code is in the spirit of the overall idea and instead
> >      >>> opportunity should be used to look at way to refactor/improve
> >     drm_sched.
> > 
> > 
> > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > really needs to drive a re-factor.  (More on that later.)  There's only
> > one real issue which is that it fires off potentially a lot of kthreads.
> > Even that's not that bad given that kthreads are pretty light and you're
> > not likely to have more kthreads than userspace threads which are much
> > heavier.  Not ideal, but not the end of the world either.  Definitely
> > something we can/should optimize but if we went through with Xe without
> > this patch, it would probably be mostly ok.
> > 
> >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >>
> >      >> I'm not really prepared to make large changes to DRM scheduler
> >     at the
> >      >> moment for Xe as they are not really required nor does Boris
> >     seem they
> >      >> will be required for his work either. I am interested to see
> >     what Boris
> >      >> comes up with.
> >      >>
> >      >>> Even on the low level, the idea to replace drm_sched threads
> >     with workers
> >      >>> has a few problems.
> >      >>>
> >      >>> To start with, the pattern of:
> >      >>>
> >      >>>    while (not_stopped) {
> >      >>>     keep picking jobs
> >      >>>    }
> >      >>>
> >      >>> Feels fundamentally in disagreement with workers (while
> >     obviously fits
> >      >>> perfectly with the current kthread design).
> >      >>
> >      >> The while loop breaks and worker exists if no jobs are ready.
> > 
> > 
> > I'm not very familiar with workqueues. What are you saying would fit
> > better? One scheduling job per work item rather than one big work item
> > which handles all available jobs?
> 
> Yes and no, it indeed IMO does not fit to have a work item which is
> potentially unbound in runtime. But it is a bit moot conceptual mismatch
> because it is a worst case / theoretical, and I think due more fundamental
> concerns.
> 
> If we have to go back to the low level side of things, I've picked this
> random spot to consolidate what I have already mentioned and perhaps expand.
> 
> To start with, let me pull out some thoughts from workqueue.rst:
> 
> """
> Generally, work items are not expected to hog a CPU and consume many cycles.
> That means maintaining just enough concurrency to prevent work processing
> from stalling should be optimal.
> """
> 
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
> 
> Given the unbound queues will be spawned on demand to service all queued
> work items (more interesting when mixing up with the system_unbound_wq), in
> the proposed design the number of instantiated worker threads does not
> correspond to the number of user threads (as you have elsewhere stated), but
> pessimistically to the number of active user contexts. That is the number
> which drives the maximum number of not-runnable jobs that can become
> runnable at once, and hence spawn that many work items, and in turn unbound
> worker threads.
> 
> Several problems there.
> 
> It is fundamentally pointless to have potentially that many more threads
> than the number of CPU cores - it simply creates a scheduling storm.
> 

We can use a different work queue if this is an issue, have a FIXME
which indicates we should allow the user to pass in the work queue.

> Unbound workers have no CPU / cache locality either and no connection with
> the CPU scheduler to optimize scheduling patterns. This may matter either on
> large systems or on small ones. Whereas the current design allows for
> scheduler to notice userspace CPU thread keeps waking up the same drm
> scheduler kernel thread, and so it can keep them on the same CPU, the
> unbound workers lose that ability and so 2nd CPU might be getting woken up
> from low sleep for every submission.
>

I guess I don't understand kthread vs. workqueue scheduling internals.
 
> Hence, apart from being a bit of a impedance mismatch, the proposal has the
> potential to change performance and power patterns and both large and small
> machines.
>

We are going to have to test this out I suppose and play around to see
if this design has any real world impacts. As Jason said, yea probably
will need a bit of help here from others. Will CC relavent parties on
next rev. 
 
> >      >>> Secondly, it probably demands separate workers (not optional),
> >     otherwise
> >      >>> behaviour of shared workqueues has either the potential to
> >     explode number
> >      >>> kernel threads anyway, or add latency.
> >      >>>
> >      >>
> >      >> Right now the system_unbound_wq is used which does have a limit
> >     on the
> >      >> number of threads, right? I do have a FIXME to allow a worker to be
> >      >> passed in similar to TDR.
> >      >>
> >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> >     as 2 GPU
> >      >> schedulers can be pushing jobs into the backend / cleaning up
> >     jobs in
> >      >> parallel.
> >      >>
> >      >
> >      > Thought of one more point here where why in Xe we absolutely want
> >     a 1 to
> >      > 1 ratio between entity and scheduler - the way we implement
> >     timeslicing
> >      > for preempt fences.
> >      >
> >      > Let me try to explain.
> >      >
> >      > Preempt fences are implemented via the generic messaging
> >     interface [1]
> >      > with suspend / resume messages. If a suspend messages is received to
> >      > soon after calling resume (this is per entity) we simply sleep in the
> >      > suspend call thus giving the entity a timeslice. This completely
> >     falls
> >      > apart with a many to 1 relationship as now a entity waiting for a
> >      > timeslice blocks the other entities. Could we work aroudn this,
> >     sure but
> >      > just another bunch of code we'd have to add in Xe. Being to
> >     freely sleep
> >      > in backend without affecting other entities is really, really
> >     nice IMO
> >      > and I bet Xe isn't the only driver that is going to feel this way.
> >      >
> >      > Last thing I'll say regardless of how anyone feels about Xe using
> >     a 1 to
> >      > 1 relationship this patch IMO makes sense as I hope we can all
> >     agree a
> >      > workqueue scales better than kthreads.
> > 
> >     I don't know for sure what will scale better and for what use case,
> >     combination of CPU cores vs number of GPU engines to keep busy vs other
> >     system activity. But I wager someone is bound to ask for some
> >     numbers to
> >     make sure proposal is not negatively affecting any other drivers.
> > 
> > 
> > Then let them ask.  Waving your hands vaguely in the direction of the
> > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > think there's someone in particular who might have a useful opinion on
> > this, throw them in the CC so they don't miss the e-mail thread.
> > 
> > Or are you asking for numbers?  If so, what numbers are you asking for?
> 
> It was a heads up to the Xe team in case people weren't appreciating how the
> proposed change has the potential influence power and performance across the
> board. And nothing in the follow up discussion made me think it was
> considered so I don't think it was redundant to raise it.
> 
> In my experience it is typical that such core changes come with some
> numbers. Which is in case of drm scheduler is tricky and probably requires
> explicitly asking everyone to test (rather than count on "don't miss the
> email thread"). Real products can fail to ship due ten mW here or there.
> Like suddenly an extra core prevented from getting into deep sleep.
> 
> If that was "profoundly unhelpful" so be it.
> 
> > Also, If we're talking about a design that might paint us into an
> > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > talking about switching which kernel threading/task mechanism to use for
> > what's really a very generic problem.  The core Xe design works without
> > this patch (just with more kthreads).  If we land this patch or
> > something like it and get it wrong and it causes a performance problem
> > for someone down the line, we can revisit it.
> 
> For some definition of "it works" - I really wouldn't suggest shipping a
> kthread per user context at any point.
>

Yea, this is why using a workqueue rathre than a kthread was suggested
to me by AMD. I should've put a suggested by on the commit message, need
to dig through my emails and figure out who exactly suggested this.
 
> >     In any case that's a low level question caused by the high level design
> >     decision. So I'd think first focus on the high level - which is the 1:1
> >     mapping of entity to scheduler instance proposal.
> > 
> >     Fundamentally it will be up to the DRM maintainers and the community to
> >     bless your approach. And it is important to stress 1:1 is about
> >     userspace contexts, so I believe unlike any other current scheduler
> >     user. And also important to stress this effectively does not make Xe
> >     _really_ use the scheduler that much.
> > 
> > 
> > I don't think this makes Xe nearly as much of a one-off as you think it
> > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > to do it this way and it seems to be a pretty good mapping for them. I
> > believe this is roughly the plan for nouveau as well.  It's not the way
> > it currently works for anyone because most other groups aren't doing FW
> > scheduling yet.  In the world of FW scheduling and hardware designed to
> > support userspace direct-to-FW submit, I think the design makes perfect
> > sense (see below) and I expect we'll see more drivers move in this
> > direction as those drivers evolve.  (AMD is doing some customish thing
> > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > those details.)
> > 
> >     I can only offer my opinion, which is that the two options mentioned in
> >     this thread (either improve drm scheduler to cope with what is
> >     required,
> >     or split up the code so you can use just the parts of drm_sched which
> >     you want - which is frontend dependency tracking) shouldn't be so
> >     readily dismissed, given how I think the idea was for the new driver to
> >     work less in a silo and more in the community (not do kludges to
> >     workaround stuff because it is thought to be too hard to improve common
> >     code), but fundamentally, "goto previous paragraph" for what I am
> >     concerned.
> > 
> > 
> > Meta comment:  It appears as if you're falling into the standard i915
> > team trap of having an internal discussion about what the community
> > discussion might look like instead of actually having the community
> > discussion.  If you are seriously concerned about interactions with
> > other drivers or whether or setting common direction, the right way to
> > do that is to break a patch or two out into a separate RFC series and
> > tag a handful of driver maintainers.  Trying to predict the questions
> > other people might ask is pointless. Cc them and asking for their input
> > instead.
> 
> I don't follow you here. It's not an internal discussion - I am raising my
> concerns on the design publicly. I am supposed to write a patch to show
> something, but am allowed to comment on a RFC series?
> 
> It is "drm/sched: Convert drm scheduler to use a work queue rather than
> kthread" which should have Cc-ed _everyone_ who use drm scheduler.
>

Yea, will do on next rev.
 
> > 
> >     Regards,
> > 
> >     Tvrtko
> > 
> >     P.S. And as a related side note, there are more areas where drm_sched
> >     could be improved, like for instance priority handling.
> >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> >     get_sched_entity to see how msm works around the drm_sched hardcoded
> >     limit of available priority levels, in order to avoid having to leave a
> >     hw capability unused. I suspect msm would be happier if they could have
> >     all priority levels equal in terms of whether they apply only at the
> >     frontend level or completely throughout the pipeline.
> > 
> >      > [1]
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >      >
> >      >>> What would be interesting to learn is whether the option of
> >     refactoring
> >      >>> drm_sched to deal with out of order completion was considered
> >     and what were
> >      >>> the conclusions.
> >      >>>
> >      >>
> >      >> I coded this up a while back when trying to convert the i915 to
> >     the DRM
> >      >> scheduler it isn't all that hard either. The free flow control
> >     on the
> >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> >     really what
> >      >> sold me on the this design.
> > 
> > 
> > You're not the only one to suggest supporting out-of-order completion.
> > However, it's tricky and breaks a lot of internal assumptions of the
> > scheduler. It also reduces functionality a bit because it can no longer
> > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > yes, it probably could but it becomes a substantially harder problem.)
> > 
> > It also seems like a worse mapping to me.  The goal here is to turn
> > submissions on a userspace-facing engine/queue into submissions to a FW
> > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > description of saying this is a 1:1 mapping between sched/entity doesn't
> > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > Why is that better?
> 
> As I have stated before, what I think what would fit well for Xe is one
> drm_scheduler per engine class. In specific terms on our current hardware,
> one drm scheduler instance for render, compute, blitter, video and video
> enhance. Userspace contexts remain scheduler entities.
>

I disagree.
 
> That way you avoid the whole kthread/kworker story and you have it actually
> use the entity picking code in the scheduler, which may be useful when the
> backend is congested.
>

In practice the backend shouldn't be congested but if it is a mutex
provides fairness probably better than using a shared scheduler. Also
what you are suggesting doesn't make sense at all as the congestion is
per-GT, so if anything we should use 1 scheduler per-GT not per engine
class.
 
> Yes you have to solve the out of order problem so in my mind that is
> something to discuss. What the problem actually is (just TDR?), how tricky
> and why etc.
>

Cleanup of jobs, TDR, replaying jobs, etc... It has decent amount of
impact.
 
> And yes you lose the handy LRCA ring buffer size management so you'd have to
> make those entities not runnable in some other way.
>

Also we lose our preempt fence implemenation too. Again I don't see how
the design you are suggesting is a win.
 
> Regarding the argument you raise below - would any of that make the frontend
> / backend separation worse and why? Do you think it is less natural? If
> neither is true then all remains is that it appears extra work to support
> out of order completion of entities has been discounted in favour of an easy
> but IMO inelegant option.
> 
> > There are two places where this 1:1:1 mapping is causing problems:
> > 
> >   1. It creates lots of kthreads. This is what this patch is trying to
> > solve. IDK if it's solving it the best way but that's the goal.
> > 
> >   2. There are a far more limited number of communication queues between
> > the kernel and GuC for more meta things like pausing and resuming
> > queues, getting events back from GuC, etc. Unless we're in a weird
> > pressure scenario, the amount of traffic on this queue should be low so
> > we can probably just have one per physical device.  The vast majority of
> > kernel -> GuC communication should be on the individual FW queue rings
> > and maybe smashing in-memory doorbells.
> 
> I don't follow your terminology here. I suppose you are talking about global
> GuC CT and context ringbuffers. If so then isn't "far more limited" actually
> one?
> 

We have 1 GuC GT per-GT.

Matt

> Regards,
> 
> Tvrtko
> 
> > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > 2 and actually makes managing FW queues harder because we no longer have
> > built-in rate limiting.  Seems like a net loss to me.
> > 
> >      >>> Second option perhaps to split out the drm_sched code into
> >     parts which would
> >      >>> lend themselves more to "pick and choose" of its functionalities.
> >      >>> Specifically, Xe wants frontend dependency tracking, but not
> >     any scheduling
> >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> >     picking), so
> >      >>> even having all these data structures in memory is a waste.
> >      >>>
> >      >>
> >      >> I don't think that we are wasting memory is a very good argument for
> >      >> making intrusive changes to the DRM scheduler.
> > 
> > 
> > Worse than that, I think the "we could split it up" kind-of misses the
> > point of the way Xe is using drm/scheduler.  It's not just about
> > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > this way provides a clean separation between front-end and back-end.
> > The job of the userspace-facing ioctl code is to shove things on the
> > scheduler.  The job of the run_job callback is to encode the job into
> > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > doorbell.  Everything else happens in terms of managing those queues
> > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > a high level, this is a really clean design.  There are potentially some
> > sticky bits around the dual-use of dma_fence for scheduling and memory
> > management but none of those are solved by breaking the DRM scheduler
> > into chunks or getting rid of the 1:1:1 mapping.
> > 
> > If we split it out, we're basically asking the driver to implement a
> > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > It may not be all that much code but also, why?  To save a few bytes of
> > memory per engine?  Each engine already has 32K(ish) worth of context
> > state and a similar size ring to communicate with the FW.  No one is
> > going to notice an extra CPU data structure.
> > 
> > I'm not seeing a solid argument against the 1:1:1 design here other than
> > that it doesn't seem like the way DRM scheduler was intended to be
> > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > take advantage of the benefits the DRM scheduler does provide while also
> > mapping it to hardware that was designed for userspace direct-to-FW
> > submit.
> > 
> > --Jason
> > 
> >      >>> With the first option then the end result could be drm_sched
> >     per engine
> >      >>> class (hardware view), which I think fits with the GuC model.
> >     Give all
> >      >>> schedulable contexts (entities) to the GuC and then mostly
> >     forget about
> >      >>> them. Timeslicing and re-ordering and all happens transparently
> >     to the
> >      >>> kernel from that point until completion.
> >      >>>
> >      >>
> >      >> Out-of-order problem still exists here.
> >      >>
> >      >>> Or with the second option you would build on some smaller
> >     refactored
> >      >>> sub-components of drm_sched, by maybe splitting the dependency
> >     tracking from
> >      >>> scheduling (RR/FIFO entity picking code).
> >      >>>
> >      >>> Second option is especially a bit vague and I haven't thought
> >     about the
> >      >>> required mechanics, but it just appeared too obvious the
> >     proposed design has
> >      >>> a bit too much impedance mismatch.
> >      >>>
> >      >>
> >      >> IMO ROI on this is low and again lets see what Boris comes up with.
> >      >>
> >      >> Matt
> >      >>
> >      >>> Oh and as a side note, when I went into the drm_sched code base
> >     to remind
> >      >>> myself how things worked, it is quite easy to find some FIXME
> >     comments which
> >      >>> suggest people working on it are unsure of locking desing there
> >     and such. So
> >      >>> perhaps that all needs cleanup too, I mean would benefit from
> >      >>> refactoring/improving work as brainstormed above anyway.
> >      >>>
> >      >>> Regards,
> >      >>>
> >      >>> Tvrtko
> > 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 16:39                       ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-10 16:39 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
> [snip]
> 
> >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >     contexts (per
> >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >     invasive changes
> >      >>> to the shared code is in the spirit of the overall idea and instead
> >      >>> opportunity should be used to look at way to refactor/improve
> >     drm_sched.
> > 
> > 
> > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > really needs to drive a re-factor.  (More on that later.)  There's only
> > one real issue which is that it fires off potentially a lot of kthreads.
> > Even that's not that bad given that kthreads are pretty light and you're
> > not likely to have more kthreads than userspace threads which are much
> > heavier.  Not ideal, but not the end of the world either.  Definitely
> > something we can/should optimize but if we went through with Xe without
> > this patch, it would probably be mostly ok.
> > 
> >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >>
> >      >> I'm not really prepared to make large changes to DRM scheduler
> >     at the
> >      >> moment for Xe as they are not really required nor does Boris
> >     seem they
> >      >> will be required for his work either. I am interested to see
> >     what Boris
> >      >> comes up with.
> >      >>
> >      >>> Even on the low level, the idea to replace drm_sched threads
> >     with workers
> >      >>> has a few problems.
> >      >>>
> >      >>> To start with, the pattern of:
> >      >>>
> >      >>>    while (not_stopped) {
> >      >>>     keep picking jobs
> >      >>>    }
> >      >>>
> >      >>> Feels fundamentally in disagreement with workers (while
> >     obviously fits
> >      >>> perfectly with the current kthread design).
> >      >>
> >      >> The while loop breaks and worker exists if no jobs are ready.
> > 
> > 
> > I'm not very familiar with workqueues. What are you saying would fit
> > better? One scheduling job per work item rather than one big work item
> > which handles all available jobs?
> 
> Yes and no, it indeed IMO does not fit to have a work item which is
> potentially unbound in runtime. But it is a bit moot conceptual mismatch
> because it is a worst case / theoretical, and I think due more fundamental
> concerns.
> 
> If we have to go back to the low level side of things, I've picked this
> random spot to consolidate what I have already mentioned and perhaps expand.
> 
> To start with, let me pull out some thoughts from workqueue.rst:
> 
> """
> Generally, work items are not expected to hog a CPU and consume many cycles.
> That means maintaining just enough concurrency to prevent work processing
> from stalling should be optimal.
> """
> 
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
> 
> Given the unbound queues will be spawned on demand to service all queued
> work items (more interesting when mixing up with the system_unbound_wq), in
> the proposed design the number of instantiated worker threads does not
> correspond to the number of user threads (as you have elsewhere stated), but
> pessimistically to the number of active user contexts. That is the number
> which drives the maximum number of not-runnable jobs that can become
> runnable at once, and hence spawn that many work items, and in turn unbound
> worker threads.
> 
> Several problems there.
> 
> It is fundamentally pointless to have potentially that many more threads
> than the number of CPU cores - it simply creates a scheduling storm.
> 

We can use a different work queue if this is an issue, have a FIXME
which indicates we should allow the user to pass in the work queue.

> Unbound workers have no CPU / cache locality either and no connection with
> the CPU scheduler to optimize scheduling patterns. This may matter either on
> large systems or on small ones. Whereas the current design allows for
> scheduler to notice userspace CPU thread keeps waking up the same drm
> scheduler kernel thread, and so it can keep them on the same CPU, the
> unbound workers lose that ability and so 2nd CPU might be getting woken up
> from low sleep for every submission.
>

I guess I don't understand kthread vs. workqueue scheduling internals.
 
> Hence, apart from being a bit of a impedance mismatch, the proposal has the
> potential to change performance and power patterns and both large and small
> machines.
>

We are going to have to test this out I suppose and play around to see
if this design has any real world impacts. As Jason said, yea probably
will need a bit of help here from others. Will CC relavent parties on
next rev. 
 
> >      >>> Secondly, it probably demands separate workers (not optional),
> >     otherwise
> >      >>> behaviour of shared workqueues has either the potential to
> >     explode number
> >      >>> kernel threads anyway, or add latency.
> >      >>>
> >      >>
> >      >> Right now the system_unbound_wq is used which does have a limit
> >     on the
> >      >> number of threads, right? I do have a FIXME to allow a worker to be
> >      >> passed in similar to TDR.
> >      >>
> >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> >     as 2 GPU
> >      >> schedulers can be pushing jobs into the backend / cleaning up
> >     jobs in
> >      >> parallel.
> >      >>
> >      >
> >      > Thought of one more point here where why in Xe we absolutely want
> >     a 1 to
> >      > 1 ratio between entity and scheduler - the way we implement
> >     timeslicing
> >      > for preempt fences.
> >      >
> >      > Let me try to explain.
> >      >
> >      > Preempt fences are implemented via the generic messaging
> >     interface [1]
> >      > with suspend / resume messages. If a suspend messages is received to
> >      > soon after calling resume (this is per entity) we simply sleep in the
> >      > suspend call thus giving the entity a timeslice. This completely
> >     falls
> >      > apart with a many to 1 relationship as now a entity waiting for a
> >      > timeslice blocks the other entities. Could we work aroudn this,
> >     sure but
> >      > just another bunch of code we'd have to add in Xe. Being to
> >     freely sleep
> >      > in backend without affecting other entities is really, really
> >     nice IMO
> >      > and I bet Xe isn't the only driver that is going to feel this way.
> >      >
> >      > Last thing I'll say regardless of how anyone feels about Xe using
> >     a 1 to
> >      > 1 relationship this patch IMO makes sense as I hope we can all
> >     agree a
> >      > workqueue scales better than kthreads.
> > 
> >     I don't know for sure what will scale better and for what use case,
> >     combination of CPU cores vs number of GPU engines to keep busy vs other
> >     system activity. But I wager someone is bound to ask for some
> >     numbers to
> >     make sure proposal is not negatively affecting any other drivers.
> > 
> > 
> > Then let them ask.  Waving your hands vaguely in the direction of the
> > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > think there's someone in particular who might have a useful opinion on
> > this, throw them in the CC so they don't miss the e-mail thread.
> > 
> > Or are you asking for numbers?  If so, what numbers are you asking for?
> 
> It was a heads up to the Xe team in case people weren't appreciating how the
> proposed change has the potential influence power and performance across the
> board. And nothing in the follow up discussion made me think it was
> considered so I don't think it was redundant to raise it.
> 
> In my experience it is typical that such core changes come with some
> numbers. Which is in case of drm scheduler is tricky and probably requires
> explicitly asking everyone to test (rather than count on "don't miss the
> email thread"). Real products can fail to ship due ten mW here or there.
> Like suddenly an extra core prevented from getting into deep sleep.
> 
> If that was "profoundly unhelpful" so be it.
> 
> > Also, If we're talking about a design that might paint us into an
> > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > talking about switching which kernel threading/task mechanism to use for
> > what's really a very generic problem.  The core Xe design works without
> > this patch (just with more kthreads).  If we land this patch or
> > something like it and get it wrong and it causes a performance problem
> > for someone down the line, we can revisit it.
> 
> For some definition of "it works" - I really wouldn't suggest shipping a
> kthread per user context at any point.
>

Yea, this is why using a workqueue rathre than a kthread was suggested
to me by AMD. I should've put a suggested by on the commit message, need
to dig through my emails and figure out who exactly suggested this.
 
> >     In any case that's a low level question caused by the high level design
> >     decision. So I'd think first focus on the high level - which is the 1:1
> >     mapping of entity to scheduler instance proposal.
> > 
> >     Fundamentally it will be up to the DRM maintainers and the community to
> >     bless your approach. And it is important to stress 1:1 is about
> >     userspace contexts, so I believe unlike any other current scheduler
> >     user. And also important to stress this effectively does not make Xe
> >     _really_ use the scheduler that much.
> > 
> > 
> > I don't think this makes Xe nearly as much of a one-off as you think it
> > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > to do it this way and it seems to be a pretty good mapping for them. I
> > believe this is roughly the plan for nouveau as well.  It's not the way
> > it currently works for anyone because most other groups aren't doing FW
> > scheduling yet.  In the world of FW scheduling and hardware designed to
> > support userspace direct-to-FW submit, I think the design makes perfect
> > sense (see below) and I expect we'll see more drivers move in this
> > direction as those drivers evolve.  (AMD is doing some customish thing
> > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > those details.)
> > 
> >     I can only offer my opinion, which is that the two options mentioned in
> >     this thread (either improve drm scheduler to cope with what is
> >     required,
> >     or split up the code so you can use just the parts of drm_sched which
> >     you want - which is frontend dependency tracking) shouldn't be so
> >     readily dismissed, given how I think the idea was for the new driver to
> >     work less in a silo and more in the community (not do kludges to
> >     workaround stuff because it is thought to be too hard to improve common
> >     code), but fundamentally, "goto previous paragraph" for what I am
> >     concerned.
> > 
> > 
> > Meta comment:  It appears as if you're falling into the standard i915
> > team trap of having an internal discussion about what the community
> > discussion might look like instead of actually having the community
> > discussion.  If you are seriously concerned about interactions with
> > other drivers or whether or setting common direction, the right way to
> > do that is to break a patch or two out into a separate RFC series and
> > tag a handful of driver maintainers.  Trying to predict the questions
> > other people might ask is pointless. Cc them and asking for their input
> > instead.
> 
> I don't follow you here. It's not an internal discussion - I am raising my
> concerns on the design publicly. I am supposed to write a patch to show
> something, but am allowed to comment on a RFC series?
> 
> It is "drm/sched: Convert drm scheduler to use a work queue rather than
> kthread" which should have Cc-ed _everyone_ who use drm scheduler.
>

Yea, will do on next rev.
 
> > 
> >     Regards,
> > 
> >     Tvrtko
> > 
> >     P.S. And as a related side note, there are more areas where drm_sched
> >     could be improved, like for instance priority handling.
> >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> >     get_sched_entity to see how msm works around the drm_sched hardcoded
> >     limit of available priority levels, in order to avoid having to leave a
> >     hw capability unused. I suspect msm would be happier if they could have
> >     all priority levels equal in terms of whether they apply only at the
> >     frontend level or completely throughout the pipeline.
> > 
> >      > [1]
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >      >
> >      >>> What would be interesting to learn is whether the option of
> >     refactoring
> >      >>> drm_sched to deal with out of order completion was considered
> >     and what were
> >      >>> the conclusions.
> >      >>>
> >      >>
> >      >> I coded this up a while back when trying to convert the i915 to
> >     the DRM
> >      >> scheduler it isn't all that hard either. The free flow control
> >     on the
> >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> >     really what
> >      >> sold me on the this design.
> > 
> > 
> > You're not the only one to suggest supporting out-of-order completion.
> > However, it's tricky and breaks a lot of internal assumptions of the
> > scheduler. It also reduces functionality a bit because it can no longer
> > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > yes, it probably could but it becomes a substantially harder problem.)
> > 
> > It also seems like a worse mapping to me.  The goal here is to turn
> > submissions on a userspace-facing engine/queue into submissions to a FW
> > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > description of saying this is a 1:1 mapping between sched/entity doesn't
> > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > Why is that better?
> 
> As I have stated before, what I think what would fit well for Xe is one
> drm_scheduler per engine class. In specific terms on our current hardware,
> one drm scheduler instance for render, compute, blitter, video and video
> enhance. Userspace contexts remain scheduler entities.
>

I disagree.
 
> That way you avoid the whole kthread/kworker story and you have it actually
> use the entity picking code in the scheduler, which may be useful when the
> backend is congested.
>

In practice the backend shouldn't be congested but if it is a mutex
provides fairness probably better than using a shared scheduler. Also
what you are suggesting doesn't make sense at all as the congestion is
per-GT, so if anything we should use 1 scheduler per-GT not per engine
class.
 
> Yes you have to solve the out of order problem so in my mind that is
> something to discuss. What the problem actually is (just TDR?), how tricky
> and why etc.
>

Cleanup of jobs, TDR, replaying jobs, etc... It has decent amount of
impact.
 
> And yes you lose the handy LRCA ring buffer size management so you'd have to
> make those entities not runnable in some other way.
>

Also we lose our preempt fence implemenation too. Again I don't see how
the design you are suggesting is a win.
 
> Regarding the argument you raise below - would any of that make the frontend
> / backend separation worse and why? Do you think it is less natural? If
> neither is true then all remains is that it appears extra work to support
> out of order completion of entities has been discounted in favour of an easy
> but IMO inelegant option.
> 
> > There are two places where this 1:1:1 mapping is causing problems:
> > 
> >   1. It creates lots of kthreads. This is what this patch is trying to
> > solve. IDK if it's solving it the best way but that's the goal.
> > 
> >   2. There are a far more limited number of communication queues between
> > the kernel and GuC for more meta things like pausing and resuming
> > queues, getting events back from GuC, etc. Unless we're in a weird
> > pressure scenario, the amount of traffic on this queue should be low so
> > we can probably just have one per physical device.  The vast majority of
> > kernel -> GuC communication should be on the individual FW queue rings
> > and maybe smashing in-memory doorbells.
> 
> I don't follow your terminology here. I suppose you are talking about global
> GuC CT and context ringbuffers. If so then isn't "far more limited" actually
> one?
> 

We have 1 GuC GT per-GT.

Matt

> Regards,
> 
> Tvrtko
> 
> > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > 2 and actually makes managing FW queues harder because we no longer have
> > built-in rate limiting.  Seems like a net loss to me.
> > 
> >      >>> Second option perhaps to split out the drm_sched code into
> >     parts which would
> >      >>> lend themselves more to "pick and choose" of its functionalities.
> >      >>> Specifically, Xe wants frontend dependency tracking, but not
> >     any scheduling
> >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> >     picking), so
> >      >>> even having all these data structures in memory is a waste.
> >      >>>
> >      >>
> >      >> I don't think that we are wasting memory is a very good argument for
> >      >> making intrusive changes to the DRM scheduler.
> > 
> > 
> > Worse than that, I think the "we could split it up" kind-of misses the
> > point of the way Xe is using drm/scheduler.  It's not just about
> > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > this way provides a clean separation between front-end and back-end.
> > The job of the userspace-facing ioctl code is to shove things on the
> > scheduler.  The job of the run_job callback is to encode the job into
> > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > doorbell.  Everything else happens in terms of managing those queues
> > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > a high level, this is a really clean design.  There are potentially some
> > sticky bits around the dual-use of dma_fence for scheduling and memory
> > management but none of those are solved by breaking the DRM scheduler
> > into chunks or getting rid of the 1:1:1 mapping.
> > 
> > If we split it out, we're basically asking the driver to implement a
> > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > It may not be all that much code but also, why?  To save a few bytes of
> > memory per engine?  Each engine already has 32K(ish) worth of context
> > state and a similar size ring to communicate with the FW.  No one is
> > going to notice an extra CPU data structure.
> > 
> > I'm not seeing a solid argument against the 1:1:1 design here other than
> > that it doesn't seem like the way DRM scheduler was intended to be
> > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > take advantage of the benefits the DRM scheduler does provide while also
> > mapping it to hardware that was designed for userspace direct-to-FW
> > submit.
> > 
> > --Jason
> > 
> >      >>> With the first option then the end result could be drm_sched
> >     per engine
> >      >>> class (hardware view), which I think fits with the GuC model.
> >     Give all
> >      >>> schedulable contexts (entities) to the GuC and then mostly
> >     forget about
> >      >>> them. Timeslicing and re-ordering and all happens transparently
> >     to the
> >      >>> kernel from that point until completion.
> >      >>>
> >      >>
> >      >> Out-of-order problem still exists here.
> >      >>
> >      >>> Or with the second option you would build on some smaller
> >     refactored
> >      >>> sub-components of drm_sched, by maybe splitting the dependency
> >     tracking from
> >      >>> scheduling (RR/FIFO entity picking code).
> >      >>>
> >      >>> Second option is especially a bit vague and I haven't thought
> >     about the
> >      >>> required mechanics, but it just appeared too obvious the
> >     proposed design has
> >      >>> a bit too much impedance mismatch.
> >      >>>
> >      >>
> >      >> IMO ROI on this is low and again lets see what Boris comes up with.
> >      >>
> >      >> Matt
> >      >>
> >      >>> Oh and as a side note, when I went into the drm_sched code base
> >     to remind
> >      >>> myself how things worked, it is quite easy to find some FIXME
> >     comments which
> >      >>> suggest people working on it are unsure of locking desing there
> >     and such. So
> >      >>> perhaps that all needs cleanup too, I mean would benefit from
> >      >>> refactoring/improving work as brainstormed above anyway.
> >      >>>
> >      >>> Regards,
> >      >>>
> >      >>> Tvrtko
> > 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 15:55                         ` Matthew Brost
@ 2023-01-10 16:50                           ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-10 16:50 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, Jason Ekstrand


On 10/01/2023 15:55, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>
>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>
>>> [snip]
>>>
>>>>       >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>      contexts (per
>>>>       >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>      invasive changes
>>>>       >>> to the shared code is in the spirit of the overall idea and
>>>> instead
>>>>       >>> opportunity should be used to look at way to refactor/improve
>>>>      drm_sched.
>>>>
>>>>
>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>> or really needs to drive a re-factor.  (More on that later.)
>>>> There's only one real issue which is that it fires off potentially a
>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>> pretty light and you're not likely to have more kthreads than
>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>> end of the world either.  Definitely something we can/should
>>>> optimize but if we went through with Xe without this patch, it would
>>>> probably be mostly ok.
>>>>
>>>>       >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>       >>
>>>>       >> I'm not really prepared to make large changes to DRM scheduler
>>>>      at the
>>>>       >> moment for Xe as they are not really required nor does Boris
>>>>      seem they
>>>>       >> will be required for his work either. I am interested to see
>>>>      what Boris
>>>>       >> comes up with.
>>>>       >>
>>>>       >>> Even on the low level, the idea to replace drm_sched threads
>>>>      with workers
>>>>       >>> has a few problems.
>>>>       >>>
>>>>       >>> To start with, the pattern of:
>>>>       >>>
>>>>       >>>    while (not_stopped) {
>>>>       >>>     keep picking jobs
>>>>       >>>    }
>>>>       >>>
>>>>       >>> Feels fundamentally in disagreement with workers (while
>>>>      obviously fits
>>>>       >>> perfectly with the current kthread design).
>>>>       >>
>>>>       >> The while loop breaks and worker exists if no jobs are ready.
>>>>
>>>>
>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>> better? One scheduling job per work item rather than one big work
>>>> item which handles all available jobs?
>>>
>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>> because it is a worst case / theoretical, and I think due more
>>> fundamental concerns.
>>>
>>> If we have to go back to the low level side of things, I've picked this
>>> random spot to consolidate what I have already mentioned and perhaps
>>> expand.
>>>
>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>
>>> """
>>> Generally, work items are not expected to hog a CPU and consume many
>>> cycles. That means maintaining just enough concurrency to prevent work
>>> processing from stalling should be optimal.
>>> """
>>>
>>> For unbound queues:
>>> """
>>> The responsibility of regulating concurrency level is on the users.
>>> """
>>>
>>> Given the unbound queues will be spawned on demand to service all queued
>>> work items (more interesting when mixing up with the system_unbound_wq),
>>> in the proposed design the number of instantiated worker threads does
>>> not correspond to the number of user threads (as you have elsewhere
>>> stated), but pessimistically to the number of active user contexts. That
>>> is the number which drives the maximum number of not-runnable jobs that
>>> can become runnable at once, and hence spawn that many work items, and
>>> in turn unbound worker threads.
>>>
>>> Several problems there.
>>>
>>> It is fundamentally pointless to have potentially that many more threads
>>> than the number of CPU cores - it simply creates a scheduling storm.
>>
>> To make matters worse, if I follow the code correctly, all these per user
>> context worker thread / work items end up contending on the same lock or
>> circular buffer, both are one instance per GPU:
>>
>> guc_engine_run_job
>>   -> submit_engine
>>      a) wq_item_append
>>          -> wq_wait_for_space
>>            -> msleep
> 
> a) is dedicated per xe_engine

Hah true, what its for then? I thought throttling the LRCA ring is done via:

   drm_sched_init(&ge->sched, &drm_sched_ops,
		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,

Is there something more to throttle other than the ring? It is 
throttling something using msleeps..

> Also you missed the step of programming the ring which is dedicated per xe_engine

I was trying to quickly find places which serialize on something in the 
backend, ringbuffer emission did not seem to do that but maybe I missed 
something.

> 
>>      b) xe_guc_ct_send
>>          -> guc_ct_send
>>            -> mutex_lock(&ct->lock);
>>            -> later a potential msleep in h2g_has_room
> 
> Techincally there is 1 instance per GT not GPU, yes this is shared but
> in practice there will always be space in the CT channel so contention
> on the lock should be rare.

Yeah I used the term GPU to be more understandable to outside audience.

I am somewhat disappointed that the Xe opportunity hasn't been used to 
improve upon the CT communication bottlenecks. I mean those backoff 
sleeps and lock contention. I wish there would be a single thread in 
charge of the CT channel and internal users (other parts of the driver) 
would be able to send their requests to it in a more efficient manner, 
with less lock contention and centralized backoff.

> I haven't read your rather long reply yet, but also FWIW using a
> workqueue has suggested by AMD (original authors of the DRM scheduler)
> when we ran this design by them.

Commit message says nothing about that. ;)

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 16:50                           ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-10 16:50 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel


On 10/01/2023 15:55, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>
>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>
>>> [snip]
>>>
>>>>       >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>      contexts (per
>>>>       >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>      invasive changes
>>>>       >>> to the shared code is in the spirit of the overall idea and
>>>> instead
>>>>       >>> opportunity should be used to look at way to refactor/improve
>>>>      drm_sched.
>>>>
>>>>
>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>> or really needs to drive a re-factor.  (More on that later.)
>>>> There's only one real issue which is that it fires off potentially a
>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>> pretty light and you're not likely to have more kthreads than
>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>> end of the world either.  Definitely something we can/should
>>>> optimize but if we went through with Xe without this patch, it would
>>>> probably be mostly ok.
>>>>
>>>>       >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>       >>
>>>>       >> I'm not really prepared to make large changes to DRM scheduler
>>>>      at the
>>>>       >> moment for Xe as they are not really required nor does Boris
>>>>      seem they
>>>>       >> will be required for his work either. I am interested to see
>>>>      what Boris
>>>>       >> comes up with.
>>>>       >>
>>>>       >>> Even on the low level, the idea to replace drm_sched threads
>>>>      with workers
>>>>       >>> has a few problems.
>>>>       >>>
>>>>       >>> To start with, the pattern of:
>>>>       >>>
>>>>       >>>    while (not_stopped) {
>>>>       >>>     keep picking jobs
>>>>       >>>    }
>>>>       >>>
>>>>       >>> Feels fundamentally in disagreement with workers (while
>>>>      obviously fits
>>>>       >>> perfectly with the current kthread design).
>>>>       >>
>>>>       >> The while loop breaks and worker exists if no jobs are ready.
>>>>
>>>>
>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>> better? One scheduling job per work item rather than one big work
>>>> item which handles all available jobs?
>>>
>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>> because it is a worst case / theoretical, and I think due more
>>> fundamental concerns.
>>>
>>> If we have to go back to the low level side of things, I've picked this
>>> random spot to consolidate what I have already mentioned and perhaps
>>> expand.
>>>
>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>
>>> """
>>> Generally, work items are not expected to hog a CPU and consume many
>>> cycles. That means maintaining just enough concurrency to prevent work
>>> processing from stalling should be optimal.
>>> """
>>>
>>> For unbound queues:
>>> """
>>> The responsibility of regulating concurrency level is on the users.
>>> """
>>>
>>> Given the unbound queues will be spawned on demand to service all queued
>>> work items (more interesting when mixing up with the system_unbound_wq),
>>> in the proposed design the number of instantiated worker threads does
>>> not correspond to the number of user threads (as you have elsewhere
>>> stated), but pessimistically to the number of active user contexts. That
>>> is the number which drives the maximum number of not-runnable jobs that
>>> can become runnable at once, and hence spawn that many work items, and
>>> in turn unbound worker threads.
>>>
>>> Several problems there.
>>>
>>> It is fundamentally pointless to have potentially that many more threads
>>> than the number of CPU cores - it simply creates a scheduling storm.
>>
>> To make matters worse, if I follow the code correctly, all these per user
>> context worker thread / work items end up contending on the same lock or
>> circular buffer, both are one instance per GPU:
>>
>> guc_engine_run_job
>>   -> submit_engine
>>      a) wq_item_append
>>          -> wq_wait_for_space
>>            -> msleep
> 
> a) is dedicated per xe_engine

Hah true, what its for then? I thought throttling the LRCA ring is done via:

   drm_sched_init(&ge->sched, &drm_sched_ops,
		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,

Is there something more to throttle other than the ring? It is 
throttling something using msleeps..

> Also you missed the step of programming the ring which is dedicated per xe_engine

I was trying to quickly find places which serialize on something in the 
backend, ringbuffer emission did not seem to do that but maybe I missed 
something.

> 
>>      b) xe_guc_ct_send
>>          -> guc_ct_send
>>            -> mutex_lock(&ct->lock);
>>            -> later a potential msleep in h2g_has_room
> 
> Techincally there is 1 instance per GT not GPU, yes this is shared but
> in practice there will always be space in the CT channel so contention
> on the lock should be rare.

Yeah I used the term GPU to be more understandable to outside audience.

I am somewhat disappointed that the Xe opportunity hasn't been used to 
improve upon the CT communication bottlenecks. I mean those backoff 
sleeps and lock contention. I wish there would be a single thread in 
charge of the CT channel and internal users (other parts of the driver) 
would be able to send their requests to it in a more efficient manner, 
with less lock contention and centralized backoff.

> I haven't read your rather long reply yet, but also FWIW using a
> workqueue has suggested by AMD (original authors of the DRM scheduler)
> when we ran this design by them.

Commit message says nothing about that. ;)

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 16:50                           ` Tvrtko Ursulin
@ 2023-01-10 19:01                             ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-10 19:01 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 15:55, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >       >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >       >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >       >>> to the shared code is in the spirit of the overall idea and
> > > > > instead
> > > > >       >>> opportunity should be used to look at way to refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > There's only one real issue which is that it fires off potentially a
> > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > pretty light and you're not likely to have more kthreads than
> > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > end of the world either.  Definitely something we can/should
> > > > > optimize but if we went through with Xe without this patch, it would
> > > > > probably be mostly ok.
> > > > > 
> > > > >       >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >       >>
> > > > >       >> I'm not really prepared to make large changes to DRM scheduler
> > > > >      at the
> > > > >       >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >       >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >       >> comes up with.
> > > > >       >>
> > > > >       >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >       >>> has a few problems.
> > > > >       >>>
> > > > >       >>> To start with, the pattern of:
> > > > >       >>>
> > > > >       >>>    while (not_stopped) {
> > > > >       >>>     keep picking jobs
> > > > >       >>>    }
> > > > >       >>>
> > > > >       >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >       >>> perfectly with the current kthread design).
> > > > >       >>
> > > > >       >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work
> > > > > item which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more
> > > > fundamental concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps
> > > > expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many
> > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > processing from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > in the proposed design the number of instantiated worker threads does
> > > > not correspond to the number of user threads (as you have elsewhere
> > > > stated), but pessimistically to the number of active user contexts. That
> > > > is the number which drives the maximum number of not-runnable jobs that
> > > > can become runnable at once, and hence spawn that many work items, and
> > > > in turn unbound worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > 
> > > To make matters worse, if I follow the code correctly, all these per user
> > > context worker thread / work items end up contending on the same lock or
> > > circular buffer, both are one instance per GPU:
> > > 
> > > guc_engine_run_job
> > >   -> submit_engine
> > >      a) wq_item_append
> > >          -> wq_wait_for_space
> > >            -> msleep
> > 
> > a) is dedicated per xe_engine
> 
> Hah true, what its for then? I thought throttling the LRCA ring is done via:
> 

This is a per guc_id 'work queue' which is used for parallel submission
(e.g. multiple LRC tail values need to written atomically by the GuC).
Again in practice there should always be space.

>   drm_sched_init(&ge->sched, &drm_sched_ops,
> 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
> 
> Is there something more to throttle other than the ring? It is throttling
> something using msleeps..
> 
> > Also you missed the step of programming the ring which is dedicated per xe_engine
> 
> I was trying to quickly find places which serialize on something in the
> backend, ringbuffer emission did not seem to do that but maybe I missed
> something.
>

xe_ring_ops vfunc emit_job is called to write the ring.
 
> > 
> > >      b) xe_guc_ct_send
> > >          -> guc_ct_send
> > >            -> mutex_lock(&ct->lock);
> > >            -> later a potential msleep in h2g_has_room
> > 
> > Techincally there is 1 instance per GT not GPU, yes this is shared but
> > in practice there will always be space in the CT channel so contention
> > on the lock should be rare.
> 
> Yeah I used the term GPU to be more understandable to outside audience.
> 
> I am somewhat disappointed that the Xe opportunity hasn't been used to
> improve upon the CT communication bottlenecks. I mean those backoff sleeps
> and lock contention. I wish there would be a single thread in charge of the
> CT channel and internal users (other parts of the driver) would be able to
> send their requests to it in a more efficient manner, with less lock
> contention and centralized backoff.
>

Well the CT backend was more or less a complete rewrite. Mutexes
actually work rather well to ensure fairness compared to the spin locks
used in the i915. This code was pretty heavily reviewed by Daniel and
both of us landed a big mutex for all of the CT code compared to the 3
or 4 spin locks used in the i915.
 
> > I haven't read your rather long reply yet, but also FWIW using a
> > workqueue has suggested by AMD (original authors of the DRM scheduler)
> > when we ran this design by them.
> 
> Commit message says nothing about that. ;)
>

Yea I missed that, will fix in the next rev. Just dug through my emails
and Christian suggested a work queue and Andrey also gave some input on
the DRM scheduler design.

Also in the next will likely update the run_wq to be passed in by the
user.

Matt

> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-10 19:01                             ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-10 19:01 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, Jason Ekstrand

On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 15:55, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >       >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >       >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >       >>> to the shared code is in the spirit of the overall idea and
> > > > > instead
> > > > >       >>> opportunity should be used to look at way to refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > There's only one real issue which is that it fires off potentially a
> > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > pretty light and you're not likely to have more kthreads than
> > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > end of the world either.  Definitely something we can/should
> > > > > optimize but if we went through with Xe without this patch, it would
> > > > > probably be mostly ok.
> > > > > 
> > > > >       >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >       >>
> > > > >       >> I'm not really prepared to make large changes to DRM scheduler
> > > > >      at the
> > > > >       >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >       >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >       >> comes up with.
> > > > >       >>
> > > > >       >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >       >>> has a few problems.
> > > > >       >>>
> > > > >       >>> To start with, the pattern of:
> > > > >       >>>
> > > > >       >>>    while (not_stopped) {
> > > > >       >>>     keep picking jobs
> > > > >       >>>    }
> > > > >       >>>
> > > > >       >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >       >>> perfectly with the current kthread design).
> > > > >       >>
> > > > >       >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work
> > > > > item which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more
> > > > fundamental concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps
> > > > expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many
> > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > processing from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > in the proposed design the number of instantiated worker threads does
> > > > not correspond to the number of user threads (as you have elsewhere
> > > > stated), but pessimistically to the number of active user contexts. That
> > > > is the number which drives the maximum number of not-runnable jobs that
> > > > can become runnable at once, and hence spawn that many work items, and
> > > > in turn unbound worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > 
> > > To make matters worse, if I follow the code correctly, all these per user
> > > context worker thread / work items end up contending on the same lock or
> > > circular buffer, both are one instance per GPU:
> > > 
> > > guc_engine_run_job
> > >   -> submit_engine
> > >      a) wq_item_append
> > >          -> wq_wait_for_space
> > >            -> msleep
> > 
> > a) is dedicated per xe_engine
> 
> Hah true, what its for then? I thought throttling the LRCA ring is done via:
> 

This is a per guc_id 'work queue' which is used for parallel submission
(e.g. multiple LRC tail values need to written atomically by the GuC).
Again in practice there should always be space.

>   drm_sched_init(&ge->sched, &drm_sched_ops,
> 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
> 
> Is there something more to throttle other than the ring? It is throttling
> something using msleeps..
> 
> > Also you missed the step of programming the ring which is dedicated per xe_engine
> 
> I was trying to quickly find places which serialize on something in the
> backend, ringbuffer emission did not seem to do that but maybe I missed
> something.
>

xe_ring_ops vfunc emit_job is called to write the ring.
 
> > 
> > >      b) xe_guc_ct_send
> > >          -> guc_ct_send
> > >            -> mutex_lock(&ct->lock);
> > >            -> later a potential msleep in h2g_has_room
> > 
> > Techincally there is 1 instance per GT not GPU, yes this is shared but
> > in practice there will always be space in the CT channel so contention
> > on the lock should be rare.
> 
> Yeah I used the term GPU to be more understandable to outside audience.
> 
> I am somewhat disappointed that the Xe opportunity hasn't been used to
> improve upon the CT communication bottlenecks. I mean those backoff sleeps
> and lock contention. I wish there would be a single thread in charge of the
> CT channel and internal users (other parts of the driver) would be able to
> send their requests to it in a more efficient manner, with less lock
> contention and centralized backoff.
>

Well the CT backend was more or less a complete rewrite. Mutexes
actually work rather well to ensure fairness compared to the spin locks
used in the i915. This code was pretty heavily reviewed by Daniel and
both of us landed a big mutex for all of the CT code compared to the 3
or 4 spin locks used in the i915.
 
> > I haven't read your rather long reply yet, but also FWIW using a
> > workqueue has suggested by AMD (original authors of the DRM scheduler)
> > when we ran this design by them.
> 
> Commit message says nothing about that. ;)
>

Yea I missed that, will fix in the next rev. Just dug through my emails
and Christian suggested a work queue and Andrey also gave some input on
the DRM scheduler design.

Also in the next will likely update the run_wq to be passed in by the
user.

Matt

> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 16:39                       ` Matthew Brost
@ 2023-01-11  1:13                         ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11  1:13 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, Jason Ekstrand, dri-devel

On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >      >>> to the shared code is in the spirit of the overall idea and instead
> > >      >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > one real issue which is that it fires off potentially a lot of kthreads.
> > > Even that's not that bad given that kthreads are pretty light and you're
> > > not likely to have more kthreads than userspace threads which are much
> > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > something we can/should optimize but if we went through with Xe without
> > > this patch, it would probably be mostly ok.
> > > 
> > >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >>
> > >      >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >      >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >      >> will be required for his work either. I am interested to see
> > >     what Boris
> > >      >> comes up with.
> > >      >>
> > >      >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >      >>> has a few problems.
> > >      >>>
> > >      >>> To start with, the pattern of:
> > >      >>>
> > >      >>>    while (not_stopped) {
> > >      >>>     keep picking jobs
> > >      >>>    }
> > >      >>>
> > >      >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >      >>> perfectly with the current kthread design).
> > >      >>
> > >      >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work item
> > > which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more fundamental
> > concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many cycles.
> > That means maintaining just enough concurrency to prevent work processing
> > from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq), in
> > the proposed design the number of instantiated worker threads does not
> > correspond to the number of user threads (as you have elsewhere stated), but
> > pessimistically to the number of active user contexts. That is the number
> > which drives the maximum number of not-runnable jobs that can become
> > runnable at once, and hence spawn that many work items, and in turn unbound
> > worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> > 
> 
> We can use a different work queue if this is an issue, have a FIXME
> which indicates we should allow the user to pass in the work queue.
> 
> > Unbound workers have no CPU / cache locality either and no connection with
> > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > large systems or on small ones. Whereas the current design allows for
> > scheduler to notice userspace CPU thread keeps waking up the same drm
> > scheduler kernel thread, and so it can keep them on the same CPU, the
> > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > from low sleep for every submission.
> >
> 
> I guess I don't understand kthread vs. workqueue scheduling internals.
>  

Looked into this and we are not using unbound workers rather we are just
using the system_wq which is indeed bound. Again we can change this so a
user can just pass in worker too. After doing a of research bound
workers allows the scheduler to use locality too avoid that exact
problem your reading.

TL;DR I'm not buying any of these arguments although it is possible I am
missing something.

Matt 

> > Hence, apart from being a bit of a impedance mismatch, the proposal has the
> > potential to change performance and power patterns and both large and small
> > machines.
> >
> 
> We are going to have to test this out I suppose and play around to see
> if this design has any real world impacts. As Jason said, yea probably
> will need a bit of help here from others. Will CC relavent parties on
> next rev. 
>  
> > >      >>> Secondly, it probably demands separate workers (not optional),
> > >     otherwise
> > >      >>> behaviour of shared workqueues has either the potential to
> > >     explode number
> > >      >>> kernel threads anyway, or add latency.
> > >      >>>
> > >      >>
> > >      >> Right now the system_unbound_wq is used which does have a limit
> > >     on the
> > >      >> number of threads, right? I do have a FIXME to allow a worker to be
> > >      >> passed in similar to TDR.
> > >      >>
> > >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> > >     as 2 GPU
> > >      >> schedulers can be pushing jobs into the backend / cleaning up
> > >     jobs in
> > >      >> parallel.
> > >      >>
> > >      >
> > >      > Thought of one more point here where why in Xe we absolutely want
> > >     a 1 to
> > >      > 1 ratio between entity and scheduler - the way we implement
> > >     timeslicing
> > >      > for preempt fences.
> > >      >
> > >      > Let me try to explain.
> > >      >
> > >      > Preempt fences are implemented via the generic messaging
> > >     interface [1]
> > >      > with suspend / resume messages. If a suspend messages is received to
> > >      > soon after calling resume (this is per entity) we simply sleep in the
> > >      > suspend call thus giving the entity a timeslice. This completely
> > >     falls
> > >      > apart with a many to 1 relationship as now a entity waiting for a
> > >      > timeslice blocks the other entities. Could we work aroudn this,
> > >     sure but
> > >      > just another bunch of code we'd have to add in Xe. Being to
> > >     freely sleep
> > >      > in backend without affecting other entities is really, really
> > >     nice IMO
> > >      > and I bet Xe isn't the only driver that is going to feel this way.
> > >      >
> > >      > Last thing I'll say regardless of how anyone feels about Xe using
> > >     a 1 to
> > >      > 1 relationship this patch IMO makes sense as I hope we can all
> > >     agree a
> > >      > workqueue scales better than kthreads.
> > > 
> > >     I don't know for sure what will scale better and for what use case,
> > >     combination of CPU cores vs number of GPU engines to keep busy vs other
> > >     system activity. But I wager someone is bound to ask for some
> > >     numbers to
> > >     make sure proposal is not negatively affecting any other drivers.
> > > 
> > > 
> > > Then let them ask.  Waving your hands vaguely in the direction of the
> > > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > > think there's someone in particular who might have a useful opinion on
> > > this, throw them in the CC so they don't miss the e-mail thread.
> > > 
> > > Or are you asking for numbers?  If so, what numbers are you asking for?
> > 
> > It was a heads up to the Xe team in case people weren't appreciating how the
> > proposed change has the potential influence power and performance across the
> > board. And nothing in the follow up discussion made me think it was
> > considered so I don't think it was redundant to raise it.
> > 
> > In my experience it is typical that such core changes come with some
> > numbers. Which is in case of drm scheduler is tricky and probably requires
> > explicitly asking everyone to test (rather than count on "don't miss the
> > email thread"). Real products can fail to ship due ten mW here or there.
> > Like suddenly an extra core prevented from getting into deep sleep.
> > 
> > If that was "profoundly unhelpful" so be it.
> > 
> > > Also, If we're talking about a design that might paint us into an
> > > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > > talking about switching which kernel threading/task mechanism to use for
> > > what's really a very generic problem.  The core Xe design works without
> > > this patch (just with more kthreads).  If we land this patch or
> > > something like it and get it wrong and it causes a performance problem
> > > for someone down the line, we can revisit it.
> > 
> > For some definition of "it works" - I really wouldn't suggest shipping a
> > kthread per user context at any point.
> >
> 
> Yea, this is why using a workqueue rathre than a kthread was suggested
> to me by AMD. I should've put a suggested by on the commit message, need
> to dig through my emails and figure out who exactly suggested this.
>  
> > >     In any case that's a low level question caused by the high level design
> > >     decision. So I'd think first focus on the high level - which is the 1:1
> > >     mapping of entity to scheduler instance proposal.
> > > 
> > >     Fundamentally it will be up to the DRM maintainers and the community to
> > >     bless your approach. And it is important to stress 1:1 is about
> > >     userspace contexts, so I believe unlike any other current scheduler
> > >     user. And also important to stress this effectively does not make Xe
> > >     _really_ use the scheduler that much.
> > > 
> > > 
> > > I don't think this makes Xe nearly as much of a one-off as you think it
> > > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > > to do it this way and it seems to be a pretty good mapping for them. I
> > > believe this is roughly the plan for nouveau as well.  It's not the way
> > > it currently works for anyone because most other groups aren't doing FW
> > > scheduling yet.  In the world of FW scheduling and hardware designed to
> > > support userspace direct-to-FW submit, I think the design makes perfect
> > > sense (see below) and I expect we'll see more drivers move in this
> > > direction as those drivers evolve.  (AMD is doing some customish thing
> > > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > > those details.)
> > > 
> > >     I can only offer my opinion, which is that the two options mentioned in
> > >     this thread (either improve drm scheduler to cope with what is
> > >     required,
> > >     or split up the code so you can use just the parts of drm_sched which
> > >     you want - which is frontend dependency tracking) shouldn't be so
> > >     readily dismissed, given how I think the idea was for the new driver to
> > >     work less in a silo and more in the community (not do kludges to
> > >     workaround stuff because it is thought to be too hard to improve common
> > >     code), but fundamentally, "goto previous paragraph" for what I am
> > >     concerned.
> > > 
> > > 
> > > Meta comment:  It appears as if you're falling into the standard i915
> > > team trap of having an internal discussion about what the community
> > > discussion might look like instead of actually having the community
> > > discussion.  If you are seriously concerned about interactions with
> > > other drivers or whether or setting common direction, the right way to
> > > do that is to break a patch or two out into a separate RFC series and
> > > tag a handful of driver maintainers.  Trying to predict the questions
> > > other people might ask is pointless. Cc them and asking for their input
> > > instead.
> > 
> > I don't follow you here. It's not an internal discussion - I am raising my
> > concerns on the design publicly. I am supposed to write a patch to show
> > something, but am allowed to comment on a RFC series?
> > 
> > It is "drm/sched: Convert drm scheduler to use a work queue rather than
> > kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> >
> 
> Yea, will do on next rev.
>  
> > > 
> > >     Regards,
> > > 
> > >     Tvrtko
> > > 
> > >     P.S. And as a related side note, there are more areas where drm_sched
> > >     could be improved, like for instance priority handling.
> > >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> > >     get_sched_entity to see how msm works around the drm_sched hardcoded
> > >     limit of available priority levels, in order to avoid having to leave a
> > >     hw capability unused. I suspect msm would be happier if they could have
> > >     all priority levels equal in terms of whether they apply only at the
> > >     frontend level or completely throughout the pipeline.
> > > 
> > >      > [1]
> > >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> > >      >
> > >      >>> What would be interesting to learn is whether the option of
> > >     refactoring
> > >      >>> drm_sched to deal with out of order completion was considered
> > >     and what were
> > >      >>> the conclusions.
> > >      >>>
> > >      >>
> > >      >> I coded this up a while back when trying to convert the i915 to
> > >     the DRM
> > >      >> scheduler it isn't all that hard either. The free flow control
> > >     on the
> > >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> > >     really what
> > >      >> sold me on the this design.
> > > 
> > > 
> > > You're not the only one to suggest supporting out-of-order completion.
> > > However, it's tricky and breaks a lot of internal assumptions of the
> > > scheduler. It also reduces functionality a bit because it can no longer
> > > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > > yes, it probably could but it becomes a substantially harder problem.)
> > > 
> > > It also seems like a worse mapping to me.  The goal here is to turn
> > > submissions on a userspace-facing engine/queue into submissions to a FW
> > > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > > description of saying this is a 1:1 mapping between sched/entity doesn't
> > > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > > Why is that better?
> > 
> > As I have stated before, what I think what would fit well for Xe is one
> > drm_scheduler per engine class. In specific terms on our current hardware,
> > one drm scheduler instance for render, compute, blitter, video and video
> > enhance. Userspace contexts remain scheduler entities.
> >
> 
> I disagree.
>  
> > That way you avoid the whole kthread/kworker story and you have it actually
> > use the entity picking code in the scheduler, which may be useful when the
> > backend is congested.
> >
> 
> In practice the backend shouldn't be congested but if it is a mutex
> provides fairness probably better than using a shared scheduler. Also
> what you are suggesting doesn't make sense at all as the congestion is
> per-GT, so if anything we should use 1 scheduler per-GT not per engine
> class.
>  
> > Yes you have to solve the out of order problem so in my mind that is
> > something to discuss. What the problem actually is (just TDR?), how tricky
> > and why etc.
> >
> 
> Cleanup of jobs, TDR, replaying jobs, etc... It has decent amount of
> impact.
>  
> > And yes you lose the handy LRCA ring buffer size management so you'd have to
> > make those entities not runnable in some other way.
> >
> 
> Also we lose our preempt fence implemenation too. Again I don't see how
> the design you are suggesting is a win.
>  
> > Regarding the argument you raise below - would any of that make the frontend
> > / backend separation worse and why? Do you think it is less natural? If
> > neither is true then all remains is that it appears extra work to support
> > out of order completion of entities has been discounted in favour of an easy
> > but IMO inelegant option.
> > 
> > > There are two places where this 1:1:1 mapping is causing problems:
> > > 
> > >   1. It creates lots of kthreads. This is what this patch is trying to
> > > solve. IDK if it's solving it the best way but that's the goal.
> > > 
> > >   2. There are a far more limited number of communication queues between
> > > the kernel and GuC for more meta things like pausing and resuming
> > > queues, getting events back from GuC, etc. Unless we're in a weird
> > > pressure scenario, the amount of traffic on this queue should be low so
> > > we can probably just have one per physical device.  The vast majority of
> > > kernel -> GuC communication should be on the individual FW queue rings
> > > and maybe smashing in-memory doorbells.
> > 
> > I don't follow your terminology here. I suppose you are talking about global
> > GuC CT and context ringbuffers. If so then isn't "far more limited" actually
> > one?
> > 
> 
> We have 1 GuC GT per-GT.
> 
> Matt
> 
> > Regards,
> > 
> > Tvrtko
> > 
> > > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > > 2 and actually makes managing FW queues harder because we no longer have
> > > built-in rate limiting.  Seems like a net loss to me.
> > > 
> > >      >>> Second option perhaps to split out the drm_sched code into
> > >     parts which would
> > >      >>> lend themselves more to "pick and choose" of its functionalities.
> > >      >>> Specifically, Xe wants frontend dependency tracking, but not
> > >     any scheduling
> > >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> > >     picking), so
> > >      >>> even having all these data structures in memory is a waste.
> > >      >>>
> > >      >>
> > >      >> I don't think that we are wasting memory is a very good argument for
> > >      >> making intrusive changes to the DRM scheduler.
> > > 
> > > 
> > > Worse than that, I think the "we could split it up" kind-of misses the
> > > point of the way Xe is using drm/scheduler.  It's not just about
> > > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > > this way provides a clean separation between front-end and back-end.
> > > The job of the userspace-facing ioctl code is to shove things on the
> > > scheduler.  The job of the run_job callback is to encode the job into
> > > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > > doorbell.  Everything else happens in terms of managing those queues
> > > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > > a high level, this is a really clean design.  There are potentially some
> > > sticky bits around the dual-use of dma_fence for scheduling and memory
> > > management but none of those are solved by breaking the DRM scheduler
> > > into chunks or getting rid of the 1:1:1 mapping.
> > > 
> > > If we split it out, we're basically asking the driver to implement a
> > > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > > It may not be all that much code but also, why?  To save a few bytes of
> > > memory per engine?  Each engine already has 32K(ish) worth of context
> > > state and a similar size ring to communicate with the FW.  No one is
> > > going to notice an extra CPU data structure.
> > > 
> > > I'm not seeing a solid argument against the 1:1:1 design here other than
> > > that it doesn't seem like the way DRM scheduler was intended to be
> > > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > > take advantage of the benefits the DRM scheduler does provide while also
> > > mapping it to hardware that was designed for userspace direct-to-FW
> > > submit.
> > > 
> > > --Jason
> > > 
> > >      >>> With the first option then the end result could be drm_sched
> > >     per engine
> > >      >>> class (hardware view), which I think fits with the GuC model.
> > >     Give all
> > >      >>> schedulable contexts (entities) to the GuC and then mostly
> > >     forget about
> > >      >>> them. Timeslicing and re-ordering and all happens transparently
> > >     to the
> > >      >>> kernel from that point until completion.
> > >      >>>
> > >      >>
> > >      >> Out-of-order problem still exists here.
> > >      >>
> > >      >>> Or with the second option you would build on some smaller
> > >     refactored
> > >      >>> sub-components of drm_sched, by maybe splitting the dependency
> > >     tracking from
> > >      >>> scheduling (RR/FIFO entity picking code).
> > >      >>>
> > >      >>> Second option is especially a bit vague and I haven't thought
> > >     about the
> > >      >>> required mechanics, but it just appeared too obvious the
> > >     proposed design has
> > >      >>> a bit too much impedance mismatch.
> > >      >>>
> > >      >>
> > >      >> IMO ROI on this is low and again lets see what Boris comes up with.
> > >      >>
> > >      >> Matt
> > >      >>
> > >      >>> Oh and as a side note, when I went into the drm_sched code base
> > >     to remind
> > >      >>> myself how things worked, it is quite easy to find some FIXME
> > >     comments which
> > >      >>> suggest people working on it are unsure of locking desing there
> > >     and such. So
> > >      >>> perhaps that all needs cleanup too, I mean would benefit from
> > >      >>> refactoring/improving work as brainstormed above anyway.
> > >      >>>
> > >      >>> Regards,
> > >      >>>
> > >      >>> Tvrtko
> > > 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11  1:13                         ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11  1:13 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >      >>> to the shared code is in the spirit of the overall idea and instead
> > >      >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > one real issue which is that it fires off potentially a lot of kthreads.
> > > Even that's not that bad given that kthreads are pretty light and you're
> > > not likely to have more kthreads than userspace threads which are much
> > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > something we can/should optimize but if we went through with Xe without
> > > this patch, it would probably be mostly ok.
> > > 
> > >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >>
> > >      >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >      >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >      >> will be required for his work either. I am interested to see
> > >     what Boris
> > >      >> comes up with.
> > >      >>
> > >      >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >      >>> has a few problems.
> > >      >>>
> > >      >>> To start with, the pattern of:
> > >      >>>
> > >      >>>    while (not_stopped) {
> > >      >>>     keep picking jobs
> > >      >>>    }
> > >      >>>
> > >      >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >      >>> perfectly with the current kthread design).
> > >      >>
> > >      >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work item
> > > which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more fundamental
> > concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many cycles.
> > That means maintaining just enough concurrency to prevent work processing
> > from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq), in
> > the proposed design the number of instantiated worker threads does not
> > correspond to the number of user threads (as you have elsewhere stated), but
> > pessimistically to the number of active user contexts. That is the number
> > which drives the maximum number of not-runnable jobs that can become
> > runnable at once, and hence spawn that many work items, and in turn unbound
> > worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> > 
> 
> We can use a different work queue if this is an issue, have a FIXME
> which indicates we should allow the user to pass in the work queue.
> 
> > Unbound workers have no CPU / cache locality either and no connection with
> > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > large systems or on small ones. Whereas the current design allows for
> > scheduler to notice userspace CPU thread keeps waking up the same drm
> > scheduler kernel thread, and so it can keep them on the same CPU, the
> > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > from low sleep for every submission.
> >
> 
> I guess I don't understand kthread vs. workqueue scheduling internals.
>  

Looked into this and we are not using unbound workers rather we are just
using the system_wq which is indeed bound. Again we can change this so a
user can just pass in worker too. After doing a of research bound
workers allows the scheduler to use locality too avoid that exact
problem your reading.

TL;DR I'm not buying any of these arguments although it is possible I am
missing something.

Matt 

> > Hence, apart from being a bit of a impedance mismatch, the proposal has the
> > potential to change performance and power patterns and both large and small
> > machines.
> >
> 
> We are going to have to test this out I suppose and play around to see
> if this design has any real world impacts. As Jason said, yea probably
> will need a bit of help here from others. Will CC relavent parties on
> next rev. 
>  
> > >      >>> Secondly, it probably demands separate workers (not optional),
> > >     otherwise
> > >      >>> behaviour of shared workqueues has either the potential to
> > >     explode number
> > >      >>> kernel threads anyway, or add latency.
> > >      >>>
> > >      >>
> > >      >> Right now the system_unbound_wq is used which does have a limit
> > >     on the
> > >      >> number of threads, right? I do have a FIXME to allow a worker to be
> > >      >> passed in similar to TDR.
> > >      >>
> > >      >> WRT to latency, the 1:1 ratio could actually have lower latency
> > >     as 2 GPU
> > >      >> schedulers can be pushing jobs into the backend / cleaning up
> > >     jobs in
> > >      >> parallel.
> > >      >>
> > >      >
> > >      > Thought of one more point here where why in Xe we absolutely want
> > >     a 1 to
> > >      > 1 ratio between entity and scheduler - the way we implement
> > >     timeslicing
> > >      > for preempt fences.
> > >      >
> > >      > Let me try to explain.
> > >      >
> > >      > Preempt fences are implemented via the generic messaging
> > >     interface [1]
> > >      > with suspend / resume messages. If a suspend messages is received to
> > >      > soon after calling resume (this is per entity) we simply sleep in the
> > >      > suspend call thus giving the entity a timeslice. This completely
> > >     falls
> > >      > apart with a many to 1 relationship as now a entity waiting for a
> > >      > timeslice blocks the other entities. Could we work aroudn this,
> > >     sure but
> > >      > just another bunch of code we'd have to add in Xe. Being to
> > >     freely sleep
> > >      > in backend without affecting other entities is really, really
> > >     nice IMO
> > >      > and I bet Xe isn't the only driver that is going to feel this way.
> > >      >
> > >      > Last thing I'll say regardless of how anyone feels about Xe using
> > >     a 1 to
> > >      > 1 relationship this patch IMO makes sense as I hope we can all
> > >     agree a
> > >      > workqueue scales better than kthreads.
> > > 
> > >     I don't know for sure what will scale better and for what use case,
> > >     combination of CPU cores vs number of GPU engines to keep busy vs other
> > >     system activity. But I wager someone is bound to ask for some
> > >     numbers to
> > >     make sure proposal is not negatively affecting any other drivers.
> > > 
> > > 
> > > Then let them ask.  Waving your hands vaguely in the direction of the
> > > rest of DRM and saying "Uh, someone (not me) might object" is profoundly
> > > unhelpful.  Sure, someone might.  That's why it's on dri-devel.  If you
> > > think there's someone in particular who might have a useful opinion on
> > > this, throw them in the CC so they don't miss the e-mail thread.
> > > 
> > > Or are you asking for numbers?  If so, what numbers are you asking for?
> > 
> > It was a heads up to the Xe team in case people weren't appreciating how the
> > proposed change has the potential influence power and performance across the
> > board. And nothing in the follow up discussion made me think it was
> > considered so I don't think it was redundant to raise it.
> > 
> > In my experience it is typical that such core changes come with some
> > numbers. Which is in case of drm scheduler is tricky and probably requires
> > explicitly asking everyone to test (rather than count on "don't miss the
> > email thread"). Real products can fail to ship due ten mW here or there.
> > Like suddenly an extra core prevented from getting into deep sleep.
> > 
> > If that was "profoundly unhelpful" so be it.
> > 
> > > Also, If we're talking about a design that might paint us into an
> > > Intel-HW-specific hole, that would be one thing.  But we're not.  We're
> > > talking about switching which kernel threading/task mechanism to use for
> > > what's really a very generic problem.  The core Xe design works without
> > > this patch (just with more kthreads).  If we land this patch or
> > > something like it and get it wrong and it causes a performance problem
> > > for someone down the line, we can revisit it.
> > 
> > For some definition of "it works" - I really wouldn't suggest shipping a
> > kthread per user context at any point.
> >
> 
> Yea, this is why using a workqueue rathre than a kthread was suggested
> to me by AMD. I should've put a suggested by on the commit message, need
> to dig through my emails and figure out who exactly suggested this.
>  
> > >     In any case that's a low level question caused by the high level design
> > >     decision. So I'd think first focus on the high level - which is the 1:1
> > >     mapping of entity to scheduler instance proposal.
> > > 
> > >     Fundamentally it will be up to the DRM maintainers and the community to
> > >     bless your approach. And it is important to stress 1:1 is about
> > >     userspace contexts, so I believe unlike any other current scheduler
> > >     user. And also important to stress this effectively does not make Xe
> > >     _really_ use the scheduler that much.
> > > 
> > > 
> > > I don't think this makes Xe nearly as much of a one-off as you think it
> > > does.  I've already told the Asahi team working on Apple M1/2 hardware
> > > to do it this way and it seems to be a pretty good mapping for them. I
> > > believe this is roughly the plan for nouveau as well.  It's not the way
> > > it currently works for anyone because most other groups aren't doing FW
> > > scheduling yet.  In the world of FW scheduling and hardware designed to
> > > support userspace direct-to-FW submit, I think the design makes perfect
> > > sense (see below) and I expect we'll see more drivers move in this
> > > direction as those drivers evolve.  (AMD is doing some customish thing
> > > for how with gpu_scheduler on the front-end somehow. I've not dug into
> > > those details.)
> > > 
> > >     I can only offer my opinion, which is that the two options mentioned in
> > >     this thread (either improve drm scheduler to cope with what is
> > >     required,
> > >     or split up the code so you can use just the parts of drm_sched which
> > >     you want - which is frontend dependency tracking) shouldn't be so
> > >     readily dismissed, given how I think the idea was for the new driver to
> > >     work less in a silo and more in the community (not do kludges to
> > >     workaround stuff because it is thought to be too hard to improve common
> > >     code), but fundamentally, "goto previous paragraph" for what I am
> > >     concerned.
> > > 
> > > 
> > > Meta comment:  It appears as if you're falling into the standard i915
> > > team trap of having an internal discussion about what the community
> > > discussion might look like instead of actually having the community
> > > discussion.  If you are seriously concerned about interactions with
> > > other drivers or whether or setting common direction, the right way to
> > > do that is to break a patch or two out into a separate RFC series and
> > > tag a handful of driver maintainers.  Trying to predict the questions
> > > other people might ask is pointless. Cc them and asking for their input
> > > instead.
> > 
> > I don't follow you here. It's not an internal discussion - I am raising my
> > concerns on the design publicly. I am supposed to write a patch to show
> > something, but am allowed to comment on a RFC series?
> > 
> > It is "drm/sched: Convert drm scheduler to use a work queue rather than
> > kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> >
> 
> Yea, will do on next rev.
>  
> > > 
> > >     Regards,
> > > 
> > >     Tvrtko
> > > 
> > >     P.S. And as a related side note, there are more areas where drm_sched
> > >     could be improved, like for instance priority handling.
> > >     Take a look at msm_submitqueue_create / msm_gpu_convert_priority /
> > >     get_sched_entity to see how msm works around the drm_sched hardcoded
> > >     limit of available priority levels, in order to avoid having to leave a
> > >     hw capability unused. I suspect msm would be happier if they could have
> > >     all priority levels equal in terms of whether they apply only at the
> > >     frontend level or completely throughout the pipeline.
> > > 
> > >      > [1]
> > >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> > >      >
> > >      >>> What would be interesting to learn is whether the option of
> > >     refactoring
> > >      >>> drm_sched to deal with out of order completion was considered
> > >     and what were
> > >      >>> the conclusions.
> > >      >>>
> > >      >>
> > >      >> I coded this up a while back when trying to convert the i915 to
> > >     the DRM
> > >      >> scheduler it isn't all that hard either. The free flow control
> > >     on the
> > >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> > >     really what
> > >      >> sold me on the this design.
> > > 
> > > 
> > > You're not the only one to suggest supporting out-of-order completion.
> > > However, it's tricky and breaks a lot of internal assumptions of the
> > > scheduler. It also reduces functionality a bit because it can no longer
> > > automatically rate-limit HW/FW queues which are often fixed-size.  (Ok,
> > > yes, it probably could but it becomes a substantially harder problem.)
> > > 
> > > It also seems like a worse mapping to me.  The goal here is to turn
> > > submissions on a userspace-facing engine/queue into submissions to a FW
> > > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > > description of saying this is a 1:1 mapping between sched/entity doesn't
> > > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1 mapping?
> > > Why is that better?
> > 
> > As I have stated before, what I think what would fit well for Xe is one
> > drm_scheduler per engine class. In specific terms on our current hardware,
> > one drm scheduler instance for render, compute, blitter, video and video
> > enhance. Userspace contexts remain scheduler entities.
> >
> 
> I disagree.
>  
> > That way you avoid the whole kthread/kworker story and you have it actually
> > use the entity picking code in the scheduler, which may be useful when the
> > backend is congested.
> >
> 
> In practice the backend shouldn't be congested but if it is a mutex
> provides fairness probably better than using a shared scheduler. Also
> what you are suggesting doesn't make sense at all as the congestion is
> per-GT, so if anything we should use 1 scheduler per-GT not per engine
> class.
>  
> > Yes you have to solve the out of order problem so in my mind that is
> > something to discuss. What the problem actually is (just TDR?), how tricky
> > and why etc.
> >
> 
> Cleanup of jobs, TDR, replaying jobs, etc... It has decent amount of
> impact.
>  
> > And yes you lose the handy LRCA ring buffer size management so you'd have to
> > make those entities not runnable in some other way.
> >
> 
> Also we lose our preempt fence implemenation too. Again I don't see how
> the design you are suggesting is a win.
>  
> > Regarding the argument you raise below - would any of that make the frontend
> > / backend separation worse and why? Do you think it is less natural? If
> > neither is true then all remains is that it appears extra work to support
> > out of order completion of entities has been discounted in favour of an easy
> > but IMO inelegant option.
> > 
> > > There are two places where this 1:1:1 mapping is causing problems:
> > > 
> > >   1. It creates lots of kthreads. This is what this patch is trying to
> > > solve. IDK if it's solving it the best way but that's the goal.
> > > 
> > >   2. There are a far more limited number of communication queues between
> > > the kernel and GuC for more meta things like pausing and resuming
> > > queues, getting events back from GuC, etc. Unless we're in a weird
> > > pressure scenario, the amount of traffic on this queue should be low so
> > > we can probably just have one per physical device.  The vast majority of
> > > kernel -> GuC communication should be on the individual FW queue rings
> > > and maybe smashing in-memory doorbells.
> > 
> > I don't follow your terminology here. I suppose you are talking about global
> > GuC CT and context ringbuffers. If so then isn't "far more limited" actually
> > one?
> > 
> 
> We have 1 GuC GT per-GT.
> 
> Matt
> 
> > Regards,
> > 
> > Tvrtko
> > 
> > > Doing out-of-order completion sort-of solves the 1 but does nothing for
> > > 2 and actually makes managing FW queues harder because we no longer have
> > > built-in rate limiting.  Seems like a net loss to me.
> > > 
> > >      >>> Second option perhaps to split out the drm_sched code into
> > >     parts which would
> > >      >>> lend themselves more to "pick and choose" of its functionalities.
> > >      >>> Specifically, Xe wants frontend dependency tracking, but not
> > >     any scheduling
> > >      >>> really (neither least busy drm_sched, neither FIFO/RQ entity
> > >     picking), so
> > >      >>> even having all these data structures in memory is a waste.
> > >      >>>
> > >      >>
> > >      >> I don't think that we are wasting memory is a very good argument for
> > >      >> making intrusive changes to the DRM scheduler.
> > > 
> > > 
> > > Worse than that, I think the "we could split it up" kind-of misses the
> > > point of the way Xe is using drm/scheduler.  It's not just about
> > > re-using a tiny bit of dependency tracking code.  Using the scheduler in
> > > this way provides a clean separation between front-end and back-end.
> > > The job of the userspace-facing ioctl code is to shove things on the
> > > scheduler.  The job of the run_job callback is to encode the job into
> > > the FW queue format, stick it in the FW queue ring, and maybe smash a
> > > doorbell.  Everything else happens in terms of managing those queues
> > > side-band.  The gpu_scheduler code manages the front-end queues and Xe
> > > manages the FW queues via the Kernel <-> GuC communication rings.  From
> > > a high level, this is a really clean design.  There are potentially some
> > > sticky bits around the dual-use of dma_fence for scheduling and memory
> > > management but none of those are solved by breaking the DRM scheduler
> > > into chunks or getting rid of the 1:1:1 mapping.
> > > 
> > > If we split it out, we're basically asking the driver to implement a
> > > bunch of kthread or workqueue stuff, all the ring rate-limiting, etc.
> > > It may not be all that much code but also, why?  To save a few bytes of
> > > memory per engine?  Each engine already has 32K(ish) worth of context
> > > state and a similar size ring to communicate with the FW.  No one is
> > > going to notice an extra CPU data structure.
> > > 
> > > I'm not seeing a solid argument against the 1:1:1 design here other than
> > > that it doesn't seem like the way DRM scheduler was intended to be
> > > used.  I won't argue that.  It's not.  But it is a fairly natural way to
> > > take advantage of the benefits the DRM scheduler does provide while also
> > > mapping it to hardware that was designed for userspace direct-to-FW
> > > submit.
> > > 
> > > --Jason
> > > 
> > >      >>> With the first option then the end result could be drm_sched
> > >     per engine
> > >      >>> class (hardware view), which I think fits with the GuC model.
> > >     Give all
> > >      >>> schedulable contexts (entities) to the GuC and then mostly
> > >     forget about
> > >      >>> them. Timeslicing and re-ordering and all happens transparently
> > >     to the
> > >      >>> kernel from that point until completion.
> > >      >>>
> > >      >>
> > >      >> Out-of-order problem still exists here.
> > >      >>
> > >      >>> Or with the second option you would build on some smaller
> > >     refactored
> > >      >>> sub-components of drm_sched, by maybe splitting the dependency
> > >     tracking from
> > >      >>> scheduling (RR/FIFO entity picking code).
> > >      >>>
> > >      >>> Second option is especially a bit vague and I haven't thought
> > >     about the
> > >      >>> required mechanics, but it just appeared too obvious the
> > >     proposed design has
> > >      >>> a bit too much impedance mismatch.
> > >      >>>
> > >      >>
> > >      >> IMO ROI on this is low and again lets see what Boris comes up with.
> > >      >>
> > >      >> Matt
> > >      >>
> > >      >>> Oh and as a side note, when I went into the drm_sched code base
> > >     to remind
> > >      >>> myself how things worked, it is quite easy to find some FIXME
> > >     comments which
> > >      >>> suggest people working on it are unsure of locking desing there
> > >     and such. So
> > >      >>> perhaps that all needs cleanup too, I mean would benefit from
> > >      >>> refactoring/improving work as brainstormed above anyway.
> > >      >>>
> > >      >>> Regards,
> > >      >>>
> > >      >>> Tvrtko
> > > 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 14:08                       ` Jason Ekstrand
@ 2023-01-11  8:50                         ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-11  8:50 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Matthew Brost, intel-gfx, dri-devel


On 10/01/2023 14:08, Jason Ekstrand wrote:
> On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin 
> <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>> 
> wrote:
> 
> 
> 
>     On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
>     [snip]
> 
>      >      >>> AFAICT it proposes to have 1:1 between *userspace* created
>      >     contexts (per
>      >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>      >     invasive changes
>      >      >>> to the shared code is in the spirit of the overall idea
>     and instead
>      >      >>> opportunity should be used to look at way to
>     refactor/improve
>      >     drm_sched.
>      >
>      >
>      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
>     all or
>      > really needs to drive a re-factor.  (More on that later.) 
>     There's only
>      > one real issue which is that it fires off potentially a lot of
>     kthreads.
>      > Even that's not that bad given that kthreads are pretty light and
>     you're
>      > not likely to have more kthreads than userspace threads which are
>     much
>      > heavier.  Not ideal, but not the end of the world either. 
>     Definitely
>      > something we can/should optimize but if we went through with Xe
>     without
>      > this patch, it would probably be mostly ok.
>      >
>      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>      >      >>
>      >      >> I'm not really prepared to make large changes to DRM
>     scheduler
>      >     at the
>      >      >> moment for Xe as they are not really required nor does Boris
>      >     seem they
>      >      >> will be required for his work either. I am interested to see
>      >     what Boris
>      >      >> comes up with.
>      >      >>
>      >      >>> Even on the low level, the idea to replace drm_sched threads
>      >     with workers
>      >      >>> has a few problems.
>      >      >>>
>      >      >>> To start with, the pattern of:
>      >      >>>
>      >      >>>    while (not_stopped) {
>      >      >>>     keep picking jobs
>      >      >>>    }
>      >      >>>
>      >      >>> Feels fundamentally in disagreement with workers (while
>      >     obviously fits
>      >      >>> perfectly with the current kthread design).
>      >      >>
>      >      >> The while loop breaks and worker exists if no jobs are ready.
>      >
>      >
>      > I'm not very familiar with workqueues. What are you saying would fit
>      > better? One scheduling job per work item rather than one big work
>     item
>      > which handles all available jobs?
> 
>     Yes and no, it indeed IMO does not fit to have a work item which is
>     potentially unbound in runtime. But it is a bit moot conceptual
>     mismatch
>     because it is a worst case / theoretical, and I think due more
>     fundamental concerns.
> 
>     If we have to go back to the low level side of things, I've picked this
>     random spot to consolidate what I have already mentioned and perhaps
>     expand.
> 
>     To start with, let me pull out some thoughts from workqueue.rst:
> 
>     """
>     Generally, work items are not expected to hog a CPU and consume many
>     cycles. That means maintaining just enough concurrency to prevent work
>     processing from stalling should be optimal.
>     """
> 
>     For unbound queues:
>     """
>     The responsibility of regulating concurrency level is on the users.
>     """
> 
>     Given the unbound queues will be spawned on demand to service all
>     queued
>     work items (more interesting when mixing up with the
>     system_unbound_wq),
>     in the proposed design the number of instantiated worker threads does
>     not correspond to the number of user threads (as you have elsewhere
>     stated), but pessimistically to the number of active user contexts.
> 
> 
> Those are pretty much the same in practice.  Rather, user threads is 
> typically an upper bound on the number of contexts.  Yes, a single user 
> thread could have a bunch of contexts but basically nothing does that 
> except IGT.  In real-world usage, it's at most one context per user thread.

Typically is the key here. But I am not sure it is good enough. Consider 
this example - Intel Flex 170:

  * Delivers up to 36 streams 1080p60 transcode throughput per card.
  * When scaled to 10 cards in a 4U server configuration, it can support 
up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.

One transcode stream from my experience typically is 3-4 GPU contexts 
(buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single 
CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by 
60fps = 8640 jobs submitted and completed per second.

144 active contexts in the proposed scheme means possibly means 144 
kernel worker threads spawned (driven by 36 transcode CPU threads). (I 
don't think the pools would scale down given all are constantly pinged 
at 60fps.)

And then each of 144 threads goes to grab the single GuC CT mutex. First 
threads are being made schedulable, then put to sleep as mutex 
contention is hit, then woken again as mutexes are getting released, 
rinse, repeat.

(And yes this backend contention is there regardless of 1:1:1, it would 
require a different re-design to solve that. But it is just a question 
whether there are 144 contending threads, or just 6 with the thread per 
engine class scheme.)

Then multiply all by 10 for a 4U server use case and you get 1440 worker 
kthreads, yes 10 more CT locks, but contending on how many CPU cores? 
Just so they can grab a timeslice and maybe content on a mutex as the 
next step.

This example is where it would hurt on large systems. Imagine only an 
even wider media transcode card...

Second example is only a single engine class used (3d desktop?) but with 
a bunch of not-runnable jobs queued and waiting on a fence to signal. 
Implicit or explicit dependencies doesn't matter. Then the fence signals 
and call backs run. N work items get scheduled, but they all submit to 
the same HW engine. So we end up with:

         /-- wi1 --\
        / ..     .. \
  cb --+---  wi.. ---+-- rq1 -- .. -- rqN
        \ ..    ..  /
         \-- wiN --/


All that we have achieved is waking up N CPUs to contend on the same 
lock and effectively insert the job into the same single HW queue. I 
don't see any positives there.

This example I think can particularly hurt small / low power devices 
because of needless waking up of many cores for no benefit. Granted, I 
don't have a good feel on how common this pattern is in practice.

> 
>     That
>     is the number which drives the maximum number of not-runnable jobs that
>     can become runnable at once, and hence spawn that many work items, and
>     in turn unbound worker threads.
> 
>     Several problems there.
> 
>     It is fundamentally pointless to have potentially that many more
>     threads
>     than the number of CPU cores - it simply creates a scheduling storm.
> 
>     Unbound workers have no CPU / cache locality either and no connection
>     with the CPU scheduler to optimize scheduling patterns. This may matter
>     either on large systems or on small ones. Whereas the current design
>     allows for scheduler to notice userspace CPU thread keeps waking up the
>     same drm scheduler kernel thread, and so it can keep them on the same
>     CPU, the unbound workers lose that ability and so 2nd CPU might be
>     getting woken up from low sleep for every submission.
> 
>     Hence, apart from being a bit of a impedance mismatch, the proposal has
>     the potential to change performance and power patterns and both large
>     and small machines.
> 
> 
> Ok, thanks for explaining the issue you're seeing in more detail.  Yes, 
> deferred kwork does appear to mismatch somewhat with what the scheduler 
> needs or at least how it's worked in the past.  How much impact will 
> that mismatch have?  Unclear.
> 
>      >      >>> Secondly, it probably demands separate workers (not
>     optional),
>      >     otherwise
>      >      >>> behaviour of shared workqueues has either the potential to
>      >     explode number
>      >      >>> kernel threads anyway, or add latency.
>      >      >>>
>      >      >>
>      >      >> Right now the system_unbound_wq is used which does have a
>     limit
>      >     on the
>      >      >> number of threads, right? I do have a FIXME to allow a
>     worker to be
>      >      >> passed in similar to TDR.
>      >      >>
>      >      >> WRT to latency, the 1:1 ratio could actually have lower
>     latency
>      >     as 2 GPU
>      >      >> schedulers can be pushing jobs into the backend / cleaning up
>      >     jobs in
>      >      >> parallel.
>      >      >>
>      >      >
>      >      > Thought of one more point here where why in Xe we
>     absolutely want
>      >     a 1 to
>      >      > 1 ratio between entity and scheduler - the way we implement
>      >     timeslicing
>      >      > for preempt fences.
>      >      >
>      >      > Let me try to explain.
>      >      >
>      >      > Preempt fences are implemented via the generic messaging
>      >     interface [1]
>      >      > with suspend / resume messages. If a suspend messages is
>     received to
>      >      > soon after calling resume (this is per entity) we simply
>     sleep in the
>      >      > suspend call thus giving the entity a timeslice. This
>     completely
>      >     falls
>      >      > apart with a many to 1 relationship as now a entity
>     waiting for a
>      >      > timeslice blocks the other entities. Could we work aroudn
>     this,
>      >     sure but
>      >      > just another bunch of code we'd have to add in Xe. Being to
>      >     freely sleep
>      >      > in backend without affecting other entities is really, really
>      >     nice IMO
>      >      > and I bet Xe isn't the only driver that is going to feel
>     this way.
>      >      >
>      >      > Last thing I'll say regardless of how anyone feels about
>     Xe using
>      >     a 1 to
>      >      > 1 relationship this patch IMO makes sense as I hope we can all
>      >     agree a
>      >      > workqueue scales better than kthreads.
>      >
>      >     I don't know for sure what will scale better and for what use
>     case,
>      >     combination of CPU cores vs number of GPU engines to keep
>     busy vs other
>      >     system activity. But I wager someone is bound to ask for some
>      >     numbers to
>      >     make sure proposal is not negatively affecting any other drivers.
>      >
>      >
>      > Then let them ask.  Waving your hands vaguely in the direction of
>     the
>      > rest of DRM and saying "Uh, someone (not me) might object" is
>     profoundly
>      > unhelpful.  Sure, someone might.  That's why it's on dri-devel. 
>     If you
>      > think there's someone in particular who might have a useful
>     opinion on
>      > this, throw them in the CC so they don't miss the e-mail thread.
>      >
>      > Or are you asking for numbers?  If so, what numbers are you
>     asking for?
> 
>     It was a heads up to the Xe team in case people weren't appreciating
>     how
>     the proposed change has the potential influence power and performance
>     across the board. And nothing in the follow up discussion made me think
>     it was considered so I don't think it was redundant to raise it.
> 
>     In my experience it is typical that such core changes come with some
>     numbers. Which is in case of drm scheduler is tricky and probably
>     requires explicitly asking everyone to test (rather than count on
>     "don't
>     miss the email thread"). Real products can fail to ship due ten mW here
>     or there. Like suddenly an extra core prevented from getting into deep
>     sleep.
> 
>     If that was "profoundly unhelpful" so be it.
> 
> 
> With your above explanation, it makes more sense what you're asking.  
> It's still not something Matt is likely to be able to provide on his 
> own.  We need to tag some other folks and ask them to test it out.  We 
> could play around a bit with it on Xe but it's not exactly production 
> grade yet and is going to hit this differently from most.  Likely 
> candidates are probably AMD and Freedreno.

Whoever is setup to check out power and performance would be good to 
give it a spin, yes.

PS. I don't think I was asking Matt to test with other devices. To start 
with I think Xe is a team effort. I was asking for more background on 
the design decision since patch 4/20 does not say anything on that 
angle, nor later in the thread it was IMO sufficiently addressed.

>      > Also, If we're talking about a design that might paint us into an
>      > Intel-HW-specific hole, that would be one thing.  But we're not. 
>     We're
>      > talking about switching which kernel threading/task mechanism to
>     use for
>      > what's really a very generic problem.  The core Xe design works
>     without
>      > this patch (just with more kthreads).  If we land this patch or
>      > something like it and get it wrong and it causes a performance
>     problem
>      > for someone down the line, we can revisit it.
> 
>     For some definition of "it works" - I really wouldn't suggest
>     shipping a
>     kthread per user context at any point.
> 
> 
> You have yet to elaborate on why. What resources is it consuming that's 
> going to be a problem? Are you anticipating CPU affinity problems? Or 
> does it just seem wasteful?

Well I don't know, commit message says the approach does not scale. :)

> I think I largely agree that it's probably unnecessary/wasteful but 
> reducing the number of kthreads seems like a tractable problem to solve 
> regardless of where we put the gpu_scheduler object.  Is this the right 
> solution?  Maybe not.  It was also proposed at one point that we could 
> split the scheduler into two pieces: A scheduler which owns the kthread, 
> and a back-end which targets some HW ring thing where you can have 
> multiple back-ends per scheduler.  That's certainly more invasive from a 
> DRM scheduler internal API PoV but would solve the kthread problem in a 
> way that's more similar to what we have now.
> 
>      >     In any case that's a low level question caused by the high
>     level design
>      >     decision. So I'd think first focus on the high level - which
>     is the 1:1
>      >     mapping of entity to scheduler instance proposal.
>      >
>      >     Fundamentally it will be up to the DRM maintainers and the
>     community to
>      >     bless your approach. And it is important to stress 1:1 is about
>      >     userspace contexts, so I believe unlike any other current
>     scheduler
>      >     user. And also important to stress this effectively does not
>     make Xe
>      >     _really_ use the scheduler that much.
>      >
>      >
>      > I don't think this makes Xe nearly as much of a one-off as you
>     think it
>      > does.  I've already told the Asahi team working on Apple M1/2
>     hardware
>      > to do it this way and it seems to be a pretty good mapping for
>     them. I
>      > believe this is roughly the plan for nouveau as well.  It's not
>     the way
>      > it currently works for anyone because most other groups aren't
>     doing FW
>      > scheduling yet.  In the world of FW scheduling and hardware
>     designed to
>      > support userspace direct-to-FW submit, I think the design makes
>     perfect
>      > sense (see below) and I expect we'll see more drivers move in this
>      > direction as those drivers evolve.  (AMD is doing some customish
>     thing
>      > for how with gpu_scheduler on the front-end somehow. I've not dug
>     into
>      > those details.)
>      >
>      >     I can only offer my opinion, which is that the two options
>     mentioned in
>      >     this thread (either improve drm scheduler to cope with what is
>      >     required,
>      >     or split up the code so you can use just the parts of
>     drm_sched which
>      >     you want - which is frontend dependency tracking) shouldn't be so
>      >     readily dismissed, given how I think the idea was for the new
>     driver to
>      >     work less in a silo and more in the community (not do kludges to
>      >     workaround stuff because it is thought to be too hard to
>     improve common
>      >     code), but fundamentally, "goto previous paragraph" for what I am
>      >     concerned.
>      >
>      >
>      > Meta comment:  It appears as if you're falling into the standard
>     i915
>      > team trap of having an internal discussion about what the community
>      > discussion might look like instead of actually having the community
>      > discussion.  If you are seriously concerned about interactions with
>      > other drivers or whether or setting common direction, the right
>     way to
>      > do that is to break a patch or two out into a separate RFC series
>     and
>      > tag a handful of driver maintainers.  Trying to predict the
>     questions
>      > other people might ask is pointless. Cc them and asking for their
>     input
>      > instead.
> 
>     I don't follow you here. It's not an internal discussion - I am raising
>     my concerns on the design publicly. I am supposed to write a patch to
>     show something, but am allowed to comment on a RFC series?
> 
> 
> I may have misread your tone a bit.  It felt a bit like too many 
> discussions I've had in the past where people are trying to predict what 
> others will say instead of just asking them.  Reading it again, I was 
> probably jumping to conclusions a bit.  Sorry about that.

Okay no problem, thanks. In any case we don't have to keep discussing 
it, since I wrote one or two emails ago it is fundamentally on the 
maintainers and community to ack the approach. I only felt like RFC did 
not explain the potential downsides sufficiently so I wanted to probe 
that area a bit.

>     It is "drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> 
> 
> Yeah, it probably should have.  I think that's mostly what I've been 
> trying to say.
> 
>      >
>      >     Regards,
>      >
>      >     Tvrtko
>      >
>      >     P.S. And as a related side note, there are more areas where
>     drm_sched
>      >     could be improved, like for instance priority handling.
>      >     Take a look at msm_submitqueue_create /
>     msm_gpu_convert_priority /
>      >     get_sched_entity to see how msm works around the drm_sched
>     hardcoded
>      >     limit of available priority levels, in order to avoid having
>     to leave a
>      >     hw capability unused. I suspect msm would be happier if they
>     could have
>      >     all priority levels equal in terms of whether they apply only
>     at the
>      >     frontend level or completely throughout the pipeline.
>      >
>      >      > [1]
>      >
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>      >   
>       <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
>      >      >
>      >      >>> What would be interesting to learn is whether the option of
>      >     refactoring
>      >      >>> drm_sched to deal with out of order completion was
>     considered
>      >     and what were
>      >      >>> the conclusions.
>      >      >>>
>      >      >>
>      >      >> I coded this up a while back when trying to convert the
>     i915 to
>      >     the DRM
>      >      >> scheduler it isn't all that hard either. The free flow
>     control
>      >     on the
>      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
>      >     really what
>      >      >> sold me on the this design.
>      >
>      >
>      > You're not the only one to suggest supporting out-of-order
>     completion.
>      > However, it's tricky and breaks a lot of internal assumptions of the
>      > scheduler. It also reduces functionality a bit because it can no
>     longer
>      > automatically rate-limit HW/FW queues which are often
>     fixed-size.  (Ok,
>      > yes, it probably could but it becomes a substantially harder
>     problem.)
>      >
>      > It also seems like a worse mapping to me.  The goal here is to turn
>      > submissions on a userspace-facing engine/queue into submissions
>     to a FW
>      > queue submissions, sorting out any dma_fence dependencies.  Matt's
>      > description of saying this is a 1:1 mapping between sched/entity
>     doesn't
>      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
>      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
>     mapping?
>      > Why is that better?
> 
>     As I have stated before, what I think what would fit well for Xe is one
>     drm_scheduler per engine class. In specific terms on our current
>     hardware, one drm scheduler instance for render, compute, blitter,
>     video
>     and video enhance. Userspace contexts remain scheduler entities.
> 
> 
> And this is where we fairly strongly disagree.  More in a bit.
> 
>     That way you avoid the whole kthread/kworker story and you have it
>     actually use the entity picking code in the scheduler, which may be
>     useful when the backend is congested.
> 
> 
> What back-end congestion are you referring to here?  Running out of FW 
> queue IDs?  Something else?

CT channel, number of context ids.

> 
>     Yes you have to solve the out of order problem so in my mind that is
>     something to discuss. What the problem actually is (just TDR?), how
>     tricky and why etc.
> 
>     And yes you lose the handy LRCA ring buffer size management so you'd
>     have to make those entities not runnable in some other way.
> 
>     Regarding the argument you raise below - would any of that make the
>     frontend / backend separation worse and why? Do you think it is less
>     natural? If neither is true then all remains is that it appears extra
>     work to support out of order completion of entities has been discounted
>     in favour of an easy but IMO inelegant option.
> 
> 
> Broadly speaking, the kernel needs to stop thinking about GPU scheduling 
> in terms of scheduling jobs and start thinking in terms of scheduling 
> contexts/engines.  There is still some need for scheduling individual 
> jobs but that is only for the purpose of delaying them as needed to 
> resolve dma_fence dependencies.  Once dependencies are resolved, they 
> get shoved onto the context/engine queue and from there the kernel only 
> really manages whole contexts/engines.  This is a major architectural 
> shift, entirely different from the way i915 scheduling works.  It's also 
> different from the historical usage of DRM scheduler which I think is 
> why this all looks a bit funny.
> 
> To justify this architectural shift, let's look at where we're headed.  
> In the glorious future...
> 
>   1. Userspace submits directly to firmware queues.  The kernel has no 
> visibility whatsoever into individual jobs.  At most it can pause/resume 
> FW contexts as needed to handle eviction and memory management.
> 
>   2. Because of 1, apart from handing out the FW queue IDs at the 
> beginning, the kernel can't really juggle them that much.  Depending on 
> FW design, it may be able to pause a client, give its IDs to another, 
> and then resume it later when IDs free up.  What it's not doing is 
> juggling IDs on a job-by-job basis like i915 currently is.
> 
>   3. Long-running compute jobs may not complete for days.  This means 
> that memory management needs to happen in terms of pause/resume of 
> entire contexts/engines using the memory rather than based on waiting 
> for individual jobs to complete or pausing individual jobs until the 
> memory is available.
> 
>   4. Synchronization happens via userspace memory fences (UMF) and the 
> kernel is mostly unaware of most dependencies and when a context/engine 
> is or is not runnable.  Instead, it keeps as many of them minimally 
> active (memory is available, even if it's in system RAM) as possible and 
> lets the FW sort out dependencies.  (There may need to be some facility 
> for sleeping a context until a memory change similar to futex() or 
> poll() for userspace threads.  There are some details TBD.)
> 
> Are there potential problems that will need to be solved here?  Yes.  Is 
> it a good design?  Well, Microsoft has been living in this future for 
> half a decade or better and it's working quite well for them.  It's also 
> the way all modern game consoles work.  It really is just Linux that's 
> stuck with the same old job model we've had since the monumental shift 
> to DRI2.
> 
> To that end, one of the core goals of the Xe project was to make the 
> driver internally behave as close to the above model as possible while 
> keeping the old-school job model as a very thin layer on top.  As the 
> broader ecosystem problems (window-system support for UMF, for instance) 
> are solved, that layer can be peeled back.  The core driver will already 
> be ready for it.
> 
> To that end, the point of the DRM scheduler in Xe isn't to schedule 
> jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies 
> and stuff jobs into their respective context/engine queue once they're 
> ready.  All the actual scheduling happens in firmware and any scheduling 
> the kernel does to deal with contention, oversubscriptions, too many 
> contexts, etc. is between contexts/engines, not individual jobs.  Sure, 
> the individual job visibility is nice, but if we design around it, we'll 
> never get to the glorious future.
> 
> I really need to turn the above (with a bit more detail) into a blog 
> post.... Maybe I'll do that this week.
> 
> In any case, I hope that provides more insight into why Xe is designed 
> the way it is and why I'm pushing back so hard on trying to make it more 
> of a "classic" driver as far as scheduling is concerned.  Are there 
> potential problems here?  Yes, that's why Xe has been labeled a 
> prototype.  Are such radical changes necessary to get to said glorious 
> future?  Yes, I think they are.  Will it be worth it?  I believe so.

Right, that's all solid I think. My takeaway is that frontend priority 
sorting and that stuff isn't needed and that is okay. And that there are 
multiple options to maybe improve drm scheduler, like the fore mentioned 
making it deal with out of order, or split into functional components, 
or split frontend/backend what you suggested. For most of them cost vs 
benefit is more or less not completely clear, neither how much effort 
was invested to look into them.

One thing I missed from this explanation is how drm_scheduler per engine 
class interferes with the high level concepts. And I did not manage to 
pick up on what exactly is the TDR problem in that case. Maybe the two 
are one and the same.

Bottom line is I still have the concern that conversion to kworkers has 
an opportunity to regress. Possibly more opportunity for some Xe use 
cases than to affect other vendors, since they would still be using per 
physical engine / queue scheduler instances.

And to put my money where my mouth is I will try to put testing Xe 
inside the full blown ChromeOS environment in my team plans. It would 
probably also be beneficial if Xe team could take a look at real world 
behaviour of the extreme transcode use cases too. If the stack is ready 
for that and all. It would be better to know earlier rather than later 
if there is a fundamental issue.

For the patch at hand, and the cover letter, it certainly feels it would 
benefit to record the past design discussion had with AMD folks, to 
explicitly copy other drivers, and to record the theoretical pros and 
cons of threads vs unbound workers as I have tried to highlight them.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11  8:50                         ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-11  8:50 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel


On 10/01/2023 14:08, Jason Ekstrand wrote:
> On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin 
> <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>> 
> wrote:
> 
> 
> 
>     On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
>     [snip]
> 
>      >      >>> AFAICT it proposes to have 1:1 between *userspace* created
>      >     contexts (per
>      >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
>      >     invasive changes
>      >      >>> to the shared code is in the spirit of the overall idea
>     and instead
>      >      >>> opportunity should be used to look at way to
>     refactor/improve
>      >     drm_sched.
>      >
>      >
>      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
>     all or
>      > really needs to drive a re-factor.  (More on that later.) 
>     There's only
>      > one real issue which is that it fires off potentially a lot of
>     kthreads.
>      > Even that's not that bad given that kthreads are pretty light and
>     you're
>      > not likely to have more kthreads than userspace threads which are
>     much
>      > heavier.  Not ideal, but not the end of the world either. 
>     Definitely
>      > something we can/should optimize but if we went through with Xe
>     without
>      > this patch, it would probably be mostly ok.
>      >
>      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
>      >      >>
>      >      >> I'm not really prepared to make large changes to DRM
>     scheduler
>      >     at the
>      >      >> moment for Xe as they are not really required nor does Boris
>      >     seem they
>      >      >> will be required for his work either. I am interested to see
>      >     what Boris
>      >      >> comes up with.
>      >      >>
>      >      >>> Even on the low level, the idea to replace drm_sched threads
>      >     with workers
>      >      >>> has a few problems.
>      >      >>>
>      >      >>> To start with, the pattern of:
>      >      >>>
>      >      >>>    while (not_stopped) {
>      >      >>>     keep picking jobs
>      >      >>>    }
>      >      >>>
>      >      >>> Feels fundamentally in disagreement with workers (while
>      >     obviously fits
>      >      >>> perfectly with the current kthread design).
>      >      >>
>      >      >> The while loop breaks and worker exists if no jobs are ready.
>      >
>      >
>      > I'm not very familiar with workqueues. What are you saying would fit
>      > better? One scheduling job per work item rather than one big work
>     item
>      > which handles all available jobs?
> 
>     Yes and no, it indeed IMO does not fit to have a work item which is
>     potentially unbound in runtime. But it is a bit moot conceptual
>     mismatch
>     because it is a worst case / theoretical, and I think due more
>     fundamental concerns.
> 
>     If we have to go back to the low level side of things, I've picked this
>     random spot to consolidate what I have already mentioned and perhaps
>     expand.
> 
>     To start with, let me pull out some thoughts from workqueue.rst:
> 
>     """
>     Generally, work items are not expected to hog a CPU and consume many
>     cycles. That means maintaining just enough concurrency to prevent work
>     processing from stalling should be optimal.
>     """
> 
>     For unbound queues:
>     """
>     The responsibility of regulating concurrency level is on the users.
>     """
> 
>     Given the unbound queues will be spawned on demand to service all
>     queued
>     work items (more interesting when mixing up with the
>     system_unbound_wq),
>     in the proposed design the number of instantiated worker threads does
>     not correspond to the number of user threads (as you have elsewhere
>     stated), but pessimistically to the number of active user contexts.
> 
> 
> Those are pretty much the same in practice.  Rather, user threads is 
> typically an upper bound on the number of contexts.  Yes, a single user 
> thread could have a bunch of contexts but basically nothing does that 
> except IGT.  In real-world usage, it's at most one context per user thread.

Typically is the key here. But I am not sure it is good enough. Consider 
this example - Intel Flex 170:

  * Delivers up to 36 streams 1080p60 transcode throughput per card.
  * When scaled to 10 cards in a 4U server configuration, it can support 
up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.

One transcode stream from my experience typically is 3-4 GPU contexts 
(buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single 
CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by 
60fps = 8640 jobs submitted and completed per second.

144 active contexts in the proposed scheme means possibly means 144 
kernel worker threads spawned (driven by 36 transcode CPU threads). (I 
don't think the pools would scale down given all are constantly pinged 
at 60fps.)

And then each of 144 threads goes to grab the single GuC CT mutex. First 
threads are being made schedulable, then put to sleep as mutex 
contention is hit, then woken again as mutexes are getting released, 
rinse, repeat.

(And yes this backend contention is there regardless of 1:1:1, it would 
require a different re-design to solve that. But it is just a question 
whether there are 144 contending threads, or just 6 with the thread per 
engine class scheme.)

Then multiply all by 10 for a 4U server use case and you get 1440 worker 
kthreads, yes 10 more CT locks, but contending on how many CPU cores? 
Just so they can grab a timeslice and maybe content on a mutex as the 
next step.

This example is where it would hurt on large systems. Imagine only an 
even wider media transcode card...

Second example is only a single engine class used (3d desktop?) but with 
a bunch of not-runnable jobs queued and waiting on a fence to signal. 
Implicit or explicit dependencies doesn't matter. Then the fence signals 
and call backs run. N work items get scheduled, but they all submit to 
the same HW engine. So we end up with:

         /-- wi1 --\
        / ..     .. \
  cb --+---  wi.. ---+-- rq1 -- .. -- rqN
        \ ..    ..  /
         \-- wiN --/


All that we have achieved is waking up N CPUs to contend on the same 
lock and effectively insert the job into the same single HW queue. I 
don't see any positives there.

This example I think can particularly hurt small / low power devices 
because of needless waking up of many cores for no benefit. Granted, I 
don't have a good feel on how common this pattern is in practice.

> 
>     That
>     is the number which drives the maximum number of not-runnable jobs that
>     can become runnable at once, and hence spawn that many work items, and
>     in turn unbound worker threads.
> 
>     Several problems there.
> 
>     It is fundamentally pointless to have potentially that many more
>     threads
>     than the number of CPU cores - it simply creates a scheduling storm.
> 
>     Unbound workers have no CPU / cache locality either and no connection
>     with the CPU scheduler to optimize scheduling patterns. This may matter
>     either on large systems or on small ones. Whereas the current design
>     allows for scheduler to notice userspace CPU thread keeps waking up the
>     same drm scheduler kernel thread, and so it can keep them on the same
>     CPU, the unbound workers lose that ability and so 2nd CPU might be
>     getting woken up from low sleep for every submission.
> 
>     Hence, apart from being a bit of a impedance mismatch, the proposal has
>     the potential to change performance and power patterns and both large
>     and small machines.
> 
> 
> Ok, thanks for explaining the issue you're seeing in more detail.  Yes, 
> deferred kwork does appear to mismatch somewhat with what the scheduler 
> needs or at least how it's worked in the past.  How much impact will 
> that mismatch have?  Unclear.
> 
>      >      >>> Secondly, it probably demands separate workers (not
>     optional),
>      >     otherwise
>      >      >>> behaviour of shared workqueues has either the potential to
>      >     explode number
>      >      >>> kernel threads anyway, or add latency.
>      >      >>>
>      >      >>
>      >      >> Right now the system_unbound_wq is used which does have a
>     limit
>      >     on the
>      >      >> number of threads, right? I do have a FIXME to allow a
>     worker to be
>      >      >> passed in similar to TDR.
>      >      >>
>      >      >> WRT to latency, the 1:1 ratio could actually have lower
>     latency
>      >     as 2 GPU
>      >      >> schedulers can be pushing jobs into the backend / cleaning up
>      >     jobs in
>      >      >> parallel.
>      >      >>
>      >      >
>      >      > Thought of one more point here where why in Xe we
>     absolutely want
>      >     a 1 to
>      >      > 1 ratio between entity and scheduler - the way we implement
>      >     timeslicing
>      >      > for preempt fences.
>      >      >
>      >      > Let me try to explain.
>      >      >
>      >      > Preempt fences are implemented via the generic messaging
>      >     interface [1]
>      >      > with suspend / resume messages. If a suspend messages is
>     received to
>      >      > soon after calling resume (this is per entity) we simply
>     sleep in the
>      >      > suspend call thus giving the entity a timeslice. This
>     completely
>      >     falls
>      >      > apart with a many to 1 relationship as now a entity
>     waiting for a
>      >      > timeslice blocks the other entities. Could we work aroudn
>     this,
>      >     sure but
>      >      > just another bunch of code we'd have to add in Xe. Being to
>      >     freely sleep
>      >      > in backend without affecting other entities is really, really
>      >     nice IMO
>      >      > and I bet Xe isn't the only driver that is going to feel
>     this way.
>      >      >
>      >      > Last thing I'll say regardless of how anyone feels about
>     Xe using
>      >     a 1 to
>      >      > 1 relationship this patch IMO makes sense as I hope we can all
>      >     agree a
>      >      > workqueue scales better than kthreads.
>      >
>      >     I don't know for sure what will scale better and for what use
>     case,
>      >     combination of CPU cores vs number of GPU engines to keep
>     busy vs other
>      >     system activity. But I wager someone is bound to ask for some
>      >     numbers to
>      >     make sure proposal is not negatively affecting any other drivers.
>      >
>      >
>      > Then let them ask.  Waving your hands vaguely in the direction of
>     the
>      > rest of DRM and saying "Uh, someone (not me) might object" is
>     profoundly
>      > unhelpful.  Sure, someone might.  That's why it's on dri-devel. 
>     If you
>      > think there's someone in particular who might have a useful
>     opinion on
>      > this, throw them in the CC so they don't miss the e-mail thread.
>      >
>      > Or are you asking for numbers?  If so, what numbers are you
>     asking for?
> 
>     It was a heads up to the Xe team in case people weren't appreciating
>     how
>     the proposed change has the potential influence power and performance
>     across the board. And nothing in the follow up discussion made me think
>     it was considered so I don't think it was redundant to raise it.
> 
>     In my experience it is typical that such core changes come with some
>     numbers. Which is in case of drm scheduler is tricky and probably
>     requires explicitly asking everyone to test (rather than count on
>     "don't
>     miss the email thread"). Real products can fail to ship due ten mW here
>     or there. Like suddenly an extra core prevented from getting into deep
>     sleep.
> 
>     If that was "profoundly unhelpful" so be it.
> 
> 
> With your above explanation, it makes more sense what you're asking.  
> It's still not something Matt is likely to be able to provide on his 
> own.  We need to tag some other folks and ask them to test it out.  We 
> could play around a bit with it on Xe but it's not exactly production 
> grade yet and is going to hit this differently from most.  Likely 
> candidates are probably AMD and Freedreno.

Whoever is setup to check out power and performance would be good to 
give it a spin, yes.

PS. I don't think I was asking Matt to test with other devices. To start 
with I think Xe is a team effort. I was asking for more background on 
the design decision since patch 4/20 does not say anything on that 
angle, nor later in the thread it was IMO sufficiently addressed.

>      > Also, If we're talking about a design that might paint us into an
>      > Intel-HW-specific hole, that would be one thing.  But we're not. 
>     We're
>      > talking about switching which kernel threading/task mechanism to
>     use for
>      > what's really a very generic problem.  The core Xe design works
>     without
>      > this patch (just with more kthreads).  If we land this patch or
>      > something like it and get it wrong and it causes a performance
>     problem
>      > for someone down the line, we can revisit it.
> 
>     For some definition of "it works" - I really wouldn't suggest
>     shipping a
>     kthread per user context at any point.
> 
> 
> You have yet to elaborate on why. What resources is it consuming that's 
> going to be a problem? Are you anticipating CPU affinity problems? Or 
> does it just seem wasteful?

Well I don't know, commit message says the approach does not scale. :)

> I think I largely agree that it's probably unnecessary/wasteful but 
> reducing the number of kthreads seems like a tractable problem to solve 
> regardless of where we put the gpu_scheduler object.  Is this the right 
> solution?  Maybe not.  It was also proposed at one point that we could 
> split the scheduler into two pieces: A scheduler which owns the kthread, 
> and a back-end which targets some HW ring thing where you can have 
> multiple back-ends per scheduler.  That's certainly more invasive from a 
> DRM scheduler internal API PoV but would solve the kthread problem in a 
> way that's more similar to what we have now.
> 
>      >     In any case that's a low level question caused by the high
>     level design
>      >     decision. So I'd think first focus on the high level - which
>     is the 1:1
>      >     mapping of entity to scheduler instance proposal.
>      >
>      >     Fundamentally it will be up to the DRM maintainers and the
>     community to
>      >     bless your approach. And it is important to stress 1:1 is about
>      >     userspace contexts, so I believe unlike any other current
>     scheduler
>      >     user. And also important to stress this effectively does not
>     make Xe
>      >     _really_ use the scheduler that much.
>      >
>      >
>      > I don't think this makes Xe nearly as much of a one-off as you
>     think it
>      > does.  I've already told the Asahi team working on Apple M1/2
>     hardware
>      > to do it this way and it seems to be a pretty good mapping for
>     them. I
>      > believe this is roughly the plan for nouveau as well.  It's not
>     the way
>      > it currently works for anyone because most other groups aren't
>     doing FW
>      > scheduling yet.  In the world of FW scheduling and hardware
>     designed to
>      > support userspace direct-to-FW submit, I think the design makes
>     perfect
>      > sense (see below) and I expect we'll see more drivers move in this
>      > direction as those drivers evolve.  (AMD is doing some customish
>     thing
>      > for how with gpu_scheduler on the front-end somehow. I've not dug
>     into
>      > those details.)
>      >
>      >     I can only offer my opinion, which is that the two options
>     mentioned in
>      >     this thread (either improve drm scheduler to cope with what is
>      >     required,
>      >     or split up the code so you can use just the parts of
>     drm_sched which
>      >     you want - which is frontend dependency tracking) shouldn't be so
>      >     readily dismissed, given how I think the idea was for the new
>     driver to
>      >     work less in a silo and more in the community (not do kludges to
>      >     workaround stuff because it is thought to be too hard to
>     improve common
>      >     code), but fundamentally, "goto previous paragraph" for what I am
>      >     concerned.
>      >
>      >
>      > Meta comment:  It appears as if you're falling into the standard
>     i915
>      > team trap of having an internal discussion about what the community
>      > discussion might look like instead of actually having the community
>      > discussion.  If you are seriously concerned about interactions with
>      > other drivers or whether or setting common direction, the right
>     way to
>      > do that is to break a patch or two out into a separate RFC series
>     and
>      > tag a handful of driver maintainers.  Trying to predict the
>     questions
>      > other people might ask is pointless. Cc them and asking for their
>     input
>      > instead.
> 
>     I don't follow you here. It's not an internal discussion - I am raising
>     my concerns on the design publicly. I am supposed to write a patch to
>     show something, but am allowed to comment on a RFC series?
> 
> 
> I may have misread your tone a bit.  It felt a bit like too many 
> discussions I've had in the past where people are trying to predict what 
> others will say instead of just asking them.  Reading it again, I was 
> probably jumping to conclusions a bit.  Sorry about that.

Okay no problem, thanks. In any case we don't have to keep discussing 
it, since I wrote one or two emails ago it is fundamentally on the 
maintainers and community to ack the approach. I only felt like RFC did 
not explain the potential downsides sufficiently so I wanted to probe 
that area a bit.

>     It is "drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> 
> 
> Yeah, it probably should have.  I think that's mostly what I've been 
> trying to say.
> 
>      >
>      >     Regards,
>      >
>      >     Tvrtko
>      >
>      >     P.S. And as a related side note, there are more areas where
>     drm_sched
>      >     could be improved, like for instance priority handling.
>      >     Take a look at msm_submitqueue_create /
>     msm_gpu_convert_priority /
>      >     get_sched_entity to see how msm works around the drm_sched
>     hardcoded
>      >     limit of available priority levels, in order to avoid having
>     to leave a
>      >     hw capability unused. I suspect msm would be happier if they
>     could have
>      >     all priority levels equal in terms of whether they apply only
>     at the
>      >     frontend level or completely throughout the pipeline.
>      >
>      >      > [1]
>      >
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>      >   
>       <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
>      >      >
>      >      >>> What would be interesting to learn is whether the option of
>      >     refactoring
>      >      >>> drm_sched to deal with out of order completion was
>     considered
>      >     and what were
>      >      >>> the conclusions.
>      >      >>>
>      >      >>
>      >      >> I coded this up a while back when trying to convert the
>     i915 to
>      >     the DRM
>      >      >> scheduler it isn't all that hard either. The free flow
>     control
>      >     on the
>      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
>      >     really what
>      >      >> sold me on the this design.
>      >
>      >
>      > You're not the only one to suggest supporting out-of-order
>     completion.
>      > However, it's tricky and breaks a lot of internal assumptions of the
>      > scheduler. It also reduces functionality a bit because it can no
>     longer
>      > automatically rate-limit HW/FW queues which are often
>     fixed-size.  (Ok,
>      > yes, it probably could but it becomes a substantially harder
>     problem.)
>      >
>      > It also seems like a worse mapping to me.  The goal here is to turn
>      > submissions on a userspace-facing engine/queue into submissions
>     to a FW
>      > queue submissions, sorting out any dma_fence dependencies.  Matt's
>      > description of saying this is a 1:1 mapping between sched/entity
>     doesn't
>      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
>      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
>     mapping?
>      > Why is that better?
> 
>     As I have stated before, what I think what would fit well for Xe is one
>     drm_scheduler per engine class. In specific terms on our current
>     hardware, one drm scheduler instance for render, compute, blitter,
>     video
>     and video enhance. Userspace contexts remain scheduler entities.
> 
> 
> And this is where we fairly strongly disagree.  More in a bit.
> 
>     That way you avoid the whole kthread/kworker story and you have it
>     actually use the entity picking code in the scheduler, which may be
>     useful when the backend is congested.
> 
> 
> What back-end congestion are you referring to here?  Running out of FW 
> queue IDs?  Something else?

CT channel, number of context ids.

> 
>     Yes you have to solve the out of order problem so in my mind that is
>     something to discuss. What the problem actually is (just TDR?), how
>     tricky and why etc.
> 
>     And yes you lose the handy LRCA ring buffer size management so you'd
>     have to make those entities not runnable in some other way.
> 
>     Regarding the argument you raise below - would any of that make the
>     frontend / backend separation worse and why? Do you think it is less
>     natural? If neither is true then all remains is that it appears extra
>     work to support out of order completion of entities has been discounted
>     in favour of an easy but IMO inelegant option.
> 
> 
> Broadly speaking, the kernel needs to stop thinking about GPU scheduling 
> in terms of scheduling jobs and start thinking in terms of scheduling 
> contexts/engines.  There is still some need for scheduling individual 
> jobs but that is only for the purpose of delaying them as needed to 
> resolve dma_fence dependencies.  Once dependencies are resolved, they 
> get shoved onto the context/engine queue and from there the kernel only 
> really manages whole contexts/engines.  This is a major architectural 
> shift, entirely different from the way i915 scheduling works.  It's also 
> different from the historical usage of DRM scheduler which I think is 
> why this all looks a bit funny.
> 
> To justify this architectural shift, let's look at where we're headed.  
> In the glorious future...
> 
>   1. Userspace submits directly to firmware queues.  The kernel has no 
> visibility whatsoever into individual jobs.  At most it can pause/resume 
> FW contexts as needed to handle eviction and memory management.
> 
>   2. Because of 1, apart from handing out the FW queue IDs at the 
> beginning, the kernel can't really juggle them that much.  Depending on 
> FW design, it may be able to pause a client, give its IDs to another, 
> and then resume it later when IDs free up.  What it's not doing is 
> juggling IDs on a job-by-job basis like i915 currently is.
> 
>   3. Long-running compute jobs may not complete for days.  This means 
> that memory management needs to happen in terms of pause/resume of 
> entire contexts/engines using the memory rather than based on waiting 
> for individual jobs to complete or pausing individual jobs until the 
> memory is available.
> 
>   4. Synchronization happens via userspace memory fences (UMF) and the 
> kernel is mostly unaware of most dependencies and when a context/engine 
> is or is not runnable.  Instead, it keeps as many of them minimally 
> active (memory is available, even if it's in system RAM) as possible and 
> lets the FW sort out dependencies.  (There may need to be some facility 
> for sleeping a context until a memory change similar to futex() or 
> poll() for userspace threads.  There are some details TBD.)
> 
> Are there potential problems that will need to be solved here?  Yes.  Is 
> it a good design?  Well, Microsoft has been living in this future for 
> half a decade or better and it's working quite well for them.  It's also 
> the way all modern game consoles work.  It really is just Linux that's 
> stuck with the same old job model we've had since the monumental shift 
> to DRI2.
> 
> To that end, one of the core goals of the Xe project was to make the 
> driver internally behave as close to the above model as possible while 
> keeping the old-school job model as a very thin layer on top.  As the 
> broader ecosystem problems (window-system support for UMF, for instance) 
> are solved, that layer can be peeled back.  The core driver will already 
> be ready for it.
> 
> To that end, the point of the DRM scheduler in Xe isn't to schedule 
> jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies 
> and stuff jobs into their respective context/engine queue once they're 
> ready.  All the actual scheduling happens in firmware and any scheduling 
> the kernel does to deal with contention, oversubscriptions, too many 
> contexts, etc. is between contexts/engines, not individual jobs.  Sure, 
> the individual job visibility is nice, but if we design around it, we'll 
> never get to the glorious future.
> 
> I really need to turn the above (with a bit more detail) into a blog 
> post.... Maybe I'll do that this week.
> 
> In any case, I hope that provides more insight into why Xe is designed 
> the way it is and why I'm pushing back so hard on trying to make it more 
> of a "classic" driver as far as scheduling is concerned.  Are there 
> potential problems here?  Yes, that's why Xe has been labeled a 
> prototype.  Are such radical changes necessary to get to said glorious 
> future?  Yes, I think they are.  Will it be worth it?  I believe so.

Right, that's all solid I think. My takeaway is that frontend priority 
sorting and that stuff isn't needed and that is okay. And that there are 
multiple options to maybe improve drm scheduler, like the fore mentioned 
making it deal with out of order, or split into functional components, 
or split frontend/backend what you suggested. For most of them cost vs 
benefit is more or less not completely clear, neither how much effort 
was invested to look into them.

One thing I missed from this explanation is how drm_scheduler per engine 
class interferes with the high level concepts. And I did not manage to 
pick up on what exactly is the TDR problem in that case. Maybe the two 
are one and the same.

Bottom line is I still have the concern that conversion to kworkers has 
an opportunity to regress. Possibly more opportunity for some Xe use 
cases than to affect other vendors, since they would still be using per 
physical engine / queue scheduler instances.

And to put my money where my mouth is I will try to put testing Xe 
inside the full blown ChromeOS environment in my team plans. It would 
probably also be beneficial if Xe team could take a look at real world 
behaviour of the extreme transcode use cases too. If the stack is ready 
for that and all. It would be better to know earlier rather than later 
if there is a fundamental issue.

For the patch at hand, and the cover letter, it certainly feels it would 
benefit to record the past design discussion had with AMD folks, to 
explicitly copy other drivers, and to record the theoretical pros and 
cons of threads vs unbound workers as I have tried to highlight them.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11  1:13                         ` Matthew Brost
@ 2023-01-11  9:09                           ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-11  9:09 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, Jason Ekstrand, dri-devel


On 11/01/2023 01:13, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
>> On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>
>>> [snip]
>>>
>>>>       >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>      contexts (per
>>>>       >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>      invasive changes
>>>>       >>> to the shared code is in the spirit of the overall idea and instead
>>>>       >>> opportunity should be used to look at way to refactor/improve
>>>>      drm_sched.
>>>>
>>>>
>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
>>>> really needs to drive a re-factor.  (More on that later.)  There's only
>>>> one real issue which is that it fires off potentially a lot of kthreads.
>>>> Even that's not that bad given that kthreads are pretty light and you're
>>>> not likely to have more kthreads than userspace threads which are much
>>>> heavier.  Not ideal, but not the end of the world either.  Definitely
>>>> something we can/should optimize but if we went through with Xe without
>>>> this patch, it would probably be mostly ok.
>>>>
>>>>       >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>       >>
>>>>       >> I'm not really prepared to make large changes to DRM scheduler
>>>>      at the
>>>>       >> moment for Xe as they are not really required nor does Boris
>>>>      seem they
>>>>       >> will be required for his work either. I am interested to see
>>>>      what Boris
>>>>       >> comes up with.
>>>>       >>
>>>>       >>> Even on the low level, the idea to replace drm_sched threads
>>>>      with workers
>>>>       >>> has a few problems.
>>>>       >>>
>>>>       >>> To start with, the pattern of:
>>>>       >>>
>>>>       >>>    while (not_stopped) {
>>>>       >>>     keep picking jobs
>>>>       >>>    }
>>>>       >>>
>>>>       >>> Feels fundamentally in disagreement with workers (while
>>>>      obviously fits
>>>>       >>> perfectly with the current kthread design).
>>>>       >>
>>>>       >> The while loop breaks and worker exists if no jobs are ready.
>>>>
>>>>
>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>> better? One scheduling job per work item rather than one big work item
>>>> which handles all available jobs?
>>>
>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>> because it is a worst case / theoretical, and I think due more fundamental
>>> concerns.
>>>
>>> If we have to go back to the low level side of things, I've picked this
>>> random spot to consolidate what I have already mentioned and perhaps expand.
>>>
>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>
>>> """
>>> Generally, work items are not expected to hog a CPU and consume many cycles.
>>> That means maintaining just enough concurrency to prevent work processing
>>> from stalling should be optimal.
>>> """
>>>
>>> For unbound queues:
>>> """
>>> The responsibility of regulating concurrency level is on the users.
>>> """
>>>
>>> Given the unbound queues will be spawned on demand to service all queued
>>> work items (more interesting when mixing up with the system_unbound_wq), in
>>> the proposed design the number of instantiated worker threads does not
>>> correspond to the number of user threads (as you have elsewhere stated), but
>>> pessimistically to the number of active user contexts. That is the number
>>> which drives the maximum number of not-runnable jobs that can become
>>> runnable at once, and hence spawn that many work items, and in turn unbound
>>> worker threads.
>>>
>>> Several problems there.
>>>
>>> It is fundamentally pointless to have potentially that many more threads
>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>
>>
>> We can use a different work queue if this is an issue, have a FIXME
>> which indicates we should allow the user to pass in the work queue.
>>
>>> Unbound workers have no CPU / cache locality either and no connection with
>>> the CPU scheduler to optimize scheduling patterns. This may matter either on
>>> large systems or on small ones. Whereas the current design allows for
>>> scheduler to notice userspace CPU thread keeps waking up the same drm
>>> scheduler kernel thread, and so it can keep them on the same CPU, the
>>> unbound workers lose that ability and so 2nd CPU might be getting woken up
>>> from low sleep for every submission.
>>>
>>
>> I guess I don't understand kthread vs. workqueue scheduling internals.
>>   
> 
> Looked into this and we are not using unbound workers rather we are just
> using the system_wq which is indeed bound. Again we can change this so a
> user can just pass in worker too. After doing a of research bound
> workers allows the scheduler to use locality too avoid that exact
> problem your reading.
> 
> TL;DR I'm not buying any of these arguments although it is possible I am
> missing something.

Well you told me it's using unbound.. message id 
Y7dEjcuc1arHBTGu@DUT025-TGLU.fm.intel.com:

"""
Right now the system_unbound_wq is used which does have a limit on the
number of threads, right? I do have a FIXME to allow a worker to be
passed in similar to TDR.
"""

With bound workers you will indeed get CPU locality. I am not sure what 
it will do in terms of concurrency. If it will serialize work items to 
fewer spawned workers that will be good for the CT contention issue, but 
may negatively affect latency. And possibly preemption / time slicing 
decisions since the order of submitting to the backend will not be in 
the order of context priority, hence high prio may be submitted right 
after low and immediately trigger preemption.

Anyway, since you are not buying any arguments on paper perhaps you are 
more open towards testing. If you would adapt gem_wsim for Xe you would 
be able to spawn N simulated transcode sessions on any Gen11+ machine 
and try it out.

For example:

gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600

That will run you 36 parallel transcoding sessions streams for 600 
frames each. No client setup needed whatsoever apart from compiling IGT.

In the past that was quite a handy tool to identify scheduling issues, 
or validate changes against. All workloads with the media prefix have 
actually been hand crafted by looking at what real media pipelines do 
with real data. Few years back at least.

It could show you real world behaviour of the kworkers approach and it 
could also enable you to cross reference any power and performance 
changes relative to i915. Background story there is that media servers 
like to fit N streams to a server and if a change comes along which 
suddenly makes only N-1 stream fit before dropping out of realtime, 
that's a big problem.

If you will believe me there is value in that kind of testing I am happy 
to help you add Xe support to the tool, time permitting so possibly 
guidance only at the moment.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11  9:09                           ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-11  9:09 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel


On 11/01/2023 01:13, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
>> On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
>>>
>>>
>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>
>>> [snip]
>>>
>>>>       >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>      contexts (per
>>>>       >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>      invasive changes
>>>>       >>> to the shared code is in the spirit of the overall idea and instead
>>>>       >>> opportunity should be used to look at way to refactor/improve
>>>>      drm_sched.
>>>>
>>>>
>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
>>>> really needs to drive a re-factor.  (More on that later.)  There's only
>>>> one real issue which is that it fires off potentially a lot of kthreads.
>>>> Even that's not that bad given that kthreads are pretty light and you're
>>>> not likely to have more kthreads than userspace threads which are much
>>>> heavier.  Not ideal, but not the end of the world either.  Definitely
>>>> something we can/should optimize but if we went through with Xe without
>>>> this patch, it would probably be mostly ok.
>>>>
>>>>       >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>       >>
>>>>       >> I'm not really prepared to make large changes to DRM scheduler
>>>>      at the
>>>>       >> moment for Xe as they are not really required nor does Boris
>>>>      seem they
>>>>       >> will be required for his work either. I am interested to see
>>>>      what Boris
>>>>       >> comes up with.
>>>>       >>
>>>>       >>> Even on the low level, the idea to replace drm_sched threads
>>>>      with workers
>>>>       >>> has a few problems.
>>>>       >>>
>>>>       >>> To start with, the pattern of:
>>>>       >>>
>>>>       >>>    while (not_stopped) {
>>>>       >>>     keep picking jobs
>>>>       >>>    }
>>>>       >>>
>>>>       >>> Feels fundamentally in disagreement with workers (while
>>>>      obviously fits
>>>>       >>> perfectly with the current kthread design).
>>>>       >>
>>>>       >> The while loop breaks and worker exists if no jobs are ready.
>>>>
>>>>
>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>> better? One scheduling job per work item rather than one big work item
>>>> which handles all available jobs?
>>>
>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>> because it is a worst case / theoretical, and I think due more fundamental
>>> concerns.
>>>
>>> If we have to go back to the low level side of things, I've picked this
>>> random spot to consolidate what I have already mentioned and perhaps expand.
>>>
>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>
>>> """
>>> Generally, work items are not expected to hog a CPU and consume many cycles.
>>> That means maintaining just enough concurrency to prevent work processing
>>> from stalling should be optimal.
>>> """
>>>
>>> For unbound queues:
>>> """
>>> The responsibility of regulating concurrency level is on the users.
>>> """
>>>
>>> Given the unbound queues will be spawned on demand to service all queued
>>> work items (more interesting when mixing up with the system_unbound_wq), in
>>> the proposed design the number of instantiated worker threads does not
>>> correspond to the number of user threads (as you have elsewhere stated), but
>>> pessimistically to the number of active user contexts. That is the number
>>> which drives the maximum number of not-runnable jobs that can become
>>> runnable at once, and hence spawn that many work items, and in turn unbound
>>> worker threads.
>>>
>>> Several problems there.
>>>
>>> It is fundamentally pointless to have potentially that many more threads
>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>
>>
>> We can use a different work queue if this is an issue, have a FIXME
>> which indicates we should allow the user to pass in the work queue.
>>
>>> Unbound workers have no CPU / cache locality either and no connection with
>>> the CPU scheduler to optimize scheduling patterns. This may matter either on
>>> large systems or on small ones. Whereas the current design allows for
>>> scheduler to notice userspace CPU thread keeps waking up the same drm
>>> scheduler kernel thread, and so it can keep them on the same CPU, the
>>> unbound workers lose that ability and so 2nd CPU might be getting woken up
>>> from low sleep for every submission.
>>>
>>
>> I guess I don't understand kthread vs. workqueue scheduling internals.
>>   
> 
> Looked into this and we are not using unbound workers rather we are just
> using the system_wq which is indeed bound. Again we can change this so a
> user can just pass in worker too. After doing a of research bound
> workers allows the scheduler to use locality too avoid that exact
> problem your reading.
> 
> TL;DR I'm not buying any of these arguments although it is possible I am
> missing something.

Well you told me it's using unbound.. message id 
Y7dEjcuc1arHBTGu@DUT025-TGLU.fm.intel.com:

"""
Right now the system_unbound_wq is used which does have a limit on the
number of threads, right? I do have a FIXME to allow a worker to be
passed in similar to TDR.
"""

With bound workers you will indeed get CPU locality. I am not sure what 
it will do in terms of concurrency. If it will serialize work items to 
fewer spawned workers that will be good for the CT contention issue, but 
may negatively affect latency. And possibly preemption / time slicing 
decisions since the order of submitting to the backend will not be in 
the order of context priority, hence high prio may be submitted right 
after low and immediately trigger preemption.

Anyway, since you are not buying any arguments on paper perhaps you are 
more open towards testing. If you would adapt gem_wsim for Xe you would 
be able to spawn N simulated transcode sessions on any Gen11+ machine 
and try it out.

For example:

gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600

That will run you 36 parallel transcoding sessions streams for 600 
frames each. No client setup needed whatsoever apart from compiling IGT.

In the past that was quite a handy tool to identify scheduling issues, 
or validate changes against. All workloads with the media prefix have 
actually been hand crafted by looking at what real media pipelines do 
with real data. Few years back at least.

It could show you real world behaviour of the kworkers approach and it 
could also enable you to cross reference any power and performance 
changes relative to i915. Background story there is that media servers 
like to fit N streams to a server and if a change comes along which 
suddenly makes only N-1 stream fit before dropping out of realtime, 
that's a big problem.

If you will believe me there is value in that kind of testing I am happy 
to help you add Xe support to the tool, time permitting so possibly 
guidance only at the moment.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10 19:01                             ` Matthew Brost
@ 2023-01-11  9:17                               ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-11  9:17 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, Jason Ekstrand


On 10/01/2023 19:01, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
>>
>> On 10/01/2023 15:55, Matthew Brost wrote:
>>> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>>>
>>>>>
>>>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>        >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>>>       contexts (per
>>>>>>        >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>>>       invasive changes
>>>>>>        >>> to the shared code is in the spirit of the overall idea and
>>>>>> instead
>>>>>>        >>> opportunity should be used to look at way to refactor/improve
>>>>>>       drm_sched.
>>>>>>
>>>>>>
>>>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>>>> or really needs to drive a re-factor.  (More on that later.)
>>>>>> There's only one real issue which is that it fires off potentially a
>>>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>>>> pretty light and you're not likely to have more kthreads than
>>>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>>>> end of the world either.  Definitely something we can/should
>>>>>> optimize but if we went through with Xe without this patch, it would
>>>>>> probably be mostly ok.
>>>>>>
>>>>>>        >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>>>        >>
>>>>>>        >> I'm not really prepared to make large changes to DRM scheduler
>>>>>>       at the
>>>>>>        >> moment for Xe as they are not really required nor does Boris
>>>>>>       seem they
>>>>>>        >> will be required for his work either. I am interested to see
>>>>>>       what Boris
>>>>>>        >> comes up with.
>>>>>>        >>
>>>>>>        >>> Even on the low level, the idea to replace drm_sched threads
>>>>>>       with workers
>>>>>>        >>> has a few problems.
>>>>>>        >>>
>>>>>>        >>> To start with, the pattern of:
>>>>>>        >>>
>>>>>>        >>>    while (not_stopped) {
>>>>>>        >>>     keep picking jobs
>>>>>>        >>>    }
>>>>>>        >>>
>>>>>>        >>> Feels fundamentally in disagreement with workers (while
>>>>>>       obviously fits
>>>>>>        >>> perfectly with the current kthread design).
>>>>>>        >>
>>>>>>        >> The while loop breaks and worker exists if no jobs are ready.
>>>>>>
>>>>>>
>>>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>>>> better? One scheduling job per work item rather than one big work
>>>>>> item which handles all available jobs?
>>>>>
>>>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>>>> because it is a worst case / theoretical, and I think due more
>>>>> fundamental concerns.
>>>>>
>>>>> If we have to go back to the low level side of things, I've picked this
>>>>> random spot to consolidate what I have already mentioned and perhaps
>>>>> expand.
>>>>>
>>>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>>>
>>>>> """
>>>>> Generally, work items are not expected to hog a CPU and consume many
>>>>> cycles. That means maintaining just enough concurrency to prevent work
>>>>> processing from stalling should be optimal.
>>>>> """
>>>>>
>>>>> For unbound queues:
>>>>> """
>>>>> The responsibility of regulating concurrency level is on the users.
>>>>> """
>>>>>
>>>>> Given the unbound queues will be spawned on demand to service all queued
>>>>> work items (more interesting when mixing up with the system_unbound_wq),
>>>>> in the proposed design the number of instantiated worker threads does
>>>>> not correspond to the number of user threads (as you have elsewhere
>>>>> stated), but pessimistically to the number of active user contexts. That
>>>>> is the number which drives the maximum number of not-runnable jobs that
>>>>> can become runnable at once, and hence spawn that many work items, and
>>>>> in turn unbound worker threads.
>>>>>
>>>>> Several problems there.
>>>>>
>>>>> It is fundamentally pointless to have potentially that many more threads
>>>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>>
>>>> To make matters worse, if I follow the code correctly, all these per user
>>>> context worker thread / work items end up contending on the same lock or
>>>> circular buffer, both are one instance per GPU:
>>>>
>>>> guc_engine_run_job
>>>>    -> submit_engine
>>>>       a) wq_item_append
>>>>           -> wq_wait_for_space
>>>>             -> msleep
>>>
>>> a) is dedicated per xe_engine
>>
>> Hah true, what its for then? I thought throttling the LRCA ring is done via:
>>
> 
> This is a per guc_id 'work queue' which is used for parallel submission
> (e.g. multiple LRC tail values need to written atomically by the GuC).
> Again in practice there should always be space.

Speaking of guc id, where does blocking when none are available happen 
in the non parallel case?

>>    drm_sched_init(&ge->sched, &drm_sched_ops,
>> 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
>>
>> Is there something more to throttle other than the ring? It is throttling
>> something using msleeps..
>>
>>> Also you missed the step of programming the ring which is dedicated per xe_engine
>>
>> I was trying to quickly find places which serialize on something in the
>> backend, ringbuffer emission did not seem to do that but maybe I missed
>> something.
>>
> 
> xe_ring_ops vfunc emit_job is called to write the ring.

Right but does it serialize between different contexts, I didn't spot 
that it does in which case it wasn't relevant to the sub story.

>>>
>>>>       b) xe_guc_ct_send
>>>>           -> guc_ct_send
>>>>             -> mutex_lock(&ct->lock);
>>>>             -> later a potential msleep in h2g_has_room
>>>
>>> Techincally there is 1 instance per GT not GPU, yes this is shared but
>>> in practice there will always be space in the CT channel so contention
>>> on the lock should be rare.
>>
>> Yeah I used the term GPU to be more understandable to outside audience.
>>
>> I am somewhat disappointed that the Xe opportunity hasn't been used to
>> improve upon the CT communication bottlenecks. I mean those backoff sleeps
>> and lock contention. I wish there would be a single thread in charge of the
>> CT channel and internal users (other parts of the driver) would be able to
>> send their requests to it in a more efficient manner, with less lock
>> contention and centralized backoff.
>>
> 
> Well the CT backend was more or less a complete rewrite. Mutexes
> actually work rather well to ensure fairness compared to the spin locks
> used in the i915. This code was pretty heavily reviewed by Daniel and
> both of us landed a big mutex for all of the CT code compared to the 3
> or 4 spin locks used in the i915.

Are the "nb" sends gone? But that aside, I wasn't meaning just the 
locking but the high level approach. Never  mind.

>>> I haven't read your rather long reply yet, but also FWIW using a
>>> workqueue has suggested by AMD (original authors of the DRM scheduler)
>>> when we ran this design by them.
>>
>> Commit message says nothing about that. ;)
>>
> 
> Yea I missed that, will fix in the next rev. Just dug through my emails
> and Christian suggested a work queue and Andrey also gave some input on
> the DRM scheduler design.
> 
> Also in the next will likely update the run_wq to be passed in by the
> user.

Yes, and IMO that may need to be non-optional.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11  9:17                               ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-11  9:17 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel


On 10/01/2023 19:01, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
>>
>> On 10/01/2023 15:55, Matthew Brost wrote:
>>> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>>>
>>>>>
>>>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>        >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>>>       contexts (per
>>>>>>        >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>>>       invasive changes
>>>>>>        >>> to the shared code is in the spirit of the overall idea and
>>>>>> instead
>>>>>>        >>> opportunity should be used to look at way to refactor/improve
>>>>>>       drm_sched.
>>>>>>
>>>>>>
>>>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>>>> or really needs to drive a re-factor.  (More on that later.)
>>>>>> There's only one real issue which is that it fires off potentially a
>>>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>>>> pretty light and you're not likely to have more kthreads than
>>>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>>>> end of the world either.  Definitely something we can/should
>>>>>> optimize but if we went through with Xe without this patch, it would
>>>>>> probably be mostly ok.
>>>>>>
>>>>>>        >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>>>        >>
>>>>>>        >> I'm not really prepared to make large changes to DRM scheduler
>>>>>>       at the
>>>>>>        >> moment for Xe as they are not really required nor does Boris
>>>>>>       seem they
>>>>>>        >> will be required for his work either. I am interested to see
>>>>>>       what Boris
>>>>>>        >> comes up with.
>>>>>>        >>
>>>>>>        >>> Even on the low level, the idea to replace drm_sched threads
>>>>>>       with workers
>>>>>>        >>> has a few problems.
>>>>>>        >>>
>>>>>>        >>> To start with, the pattern of:
>>>>>>        >>>
>>>>>>        >>>    while (not_stopped) {
>>>>>>        >>>     keep picking jobs
>>>>>>        >>>    }
>>>>>>        >>>
>>>>>>        >>> Feels fundamentally in disagreement with workers (while
>>>>>>       obviously fits
>>>>>>        >>> perfectly with the current kthread design).
>>>>>>        >>
>>>>>>        >> The while loop breaks and worker exists if no jobs are ready.
>>>>>>
>>>>>>
>>>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>>>> better? One scheduling job per work item rather than one big work
>>>>>> item which handles all available jobs?
>>>>>
>>>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>>>> because it is a worst case / theoretical, and I think due more
>>>>> fundamental concerns.
>>>>>
>>>>> If we have to go back to the low level side of things, I've picked this
>>>>> random spot to consolidate what I have already mentioned and perhaps
>>>>> expand.
>>>>>
>>>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>>>
>>>>> """
>>>>> Generally, work items are not expected to hog a CPU and consume many
>>>>> cycles. That means maintaining just enough concurrency to prevent work
>>>>> processing from stalling should be optimal.
>>>>> """
>>>>>
>>>>> For unbound queues:
>>>>> """
>>>>> The responsibility of regulating concurrency level is on the users.
>>>>> """
>>>>>
>>>>> Given the unbound queues will be spawned on demand to service all queued
>>>>> work items (more interesting when mixing up with the system_unbound_wq),
>>>>> in the proposed design the number of instantiated worker threads does
>>>>> not correspond to the number of user threads (as you have elsewhere
>>>>> stated), but pessimistically to the number of active user contexts. That
>>>>> is the number which drives the maximum number of not-runnable jobs that
>>>>> can become runnable at once, and hence spawn that many work items, and
>>>>> in turn unbound worker threads.
>>>>>
>>>>> Several problems there.
>>>>>
>>>>> It is fundamentally pointless to have potentially that many more threads
>>>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>>
>>>> To make matters worse, if I follow the code correctly, all these per user
>>>> context worker thread / work items end up contending on the same lock or
>>>> circular buffer, both are one instance per GPU:
>>>>
>>>> guc_engine_run_job
>>>>    -> submit_engine
>>>>       a) wq_item_append
>>>>           -> wq_wait_for_space
>>>>             -> msleep
>>>
>>> a) is dedicated per xe_engine
>>
>> Hah true, what its for then? I thought throttling the LRCA ring is done via:
>>
> 
> This is a per guc_id 'work queue' which is used for parallel submission
> (e.g. multiple LRC tail values need to written atomically by the GuC).
> Again in practice there should always be space.

Speaking of guc id, where does blocking when none are available happen 
in the non parallel case?

>>    drm_sched_init(&ge->sched, &drm_sched_ops,
>> 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
>>
>> Is there something more to throttle other than the ring? It is throttling
>> something using msleeps..
>>
>>> Also you missed the step of programming the ring which is dedicated per xe_engine
>>
>> I was trying to quickly find places which serialize on something in the
>> backend, ringbuffer emission did not seem to do that but maybe I missed
>> something.
>>
> 
> xe_ring_ops vfunc emit_job is called to write the ring.

Right but does it serialize between different contexts, I didn't spot 
that it does in which case it wasn't relevant to the sub story.

>>>
>>>>       b) xe_guc_ct_send
>>>>           -> guc_ct_send
>>>>             -> mutex_lock(&ct->lock);
>>>>             -> later a potential msleep in h2g_has_room
>>>
>>> Techincally there is 1 instance per GT not GPU, yes this is shared but
>>> in practice there will always be space in the CT channel so contention
>>> on the lock should be rare.
>>
>> Yeah I used the term GPU to be more understandable to outside audience.
>>
>> I am somewhat disappointed that the Xe opportunity hasn't been used to
>> improve upon the CT communication bottlenecks. I mean those backoff sleeps
>> and lock contention. I wish there would be a single thread in charge of the
>> CT channel and internal users (other parts of the driver) would be able to
>> send their requests to it in a more efficient manner, with less lock
>> contention and centralized backoff.
>>
> 
> Well the CT backend was more or less a complete rewrite. Mutexes
> actually work rather well to ensure fairness compared to the spin locks
> used in the i915. This code was pretty heavily reviewed by Daniel and
> both of us landed a big mutex for all of the CT code compared to the 3
> or 4 spin locks used in the i915.

Are the "nb" sends gone? But that aside, I wasn't meaning just the 
locking but the high level approach. Never  mind.

>>> I haven't read your rather long reply yet, but also FWIW using a
>>> workqueue has suggested by AMD (original authors of the DRM scheduler)
>>> when we ran this design by them.
>>
>> Commit message says nothing about that. ;)
>>
> 
> Yea I missed that, will fix in the next rev. Just dug through my emails
> and Christian suggested a work queue and Andrey also gave some input on
> the DRM scheduler design.
> 
> Also in the next will likely update the run_wq to be passed in by the
> user.

Yes, and IMO that may need to be non-optional.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11  9:09                           ` Tvrtko Ursulin
@ 2023-01-11 17:52                             ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 17:52 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, Jason Ekstrand, dri-devel

On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:
> 
> On 11/01/2023 01:13, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
> > > On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >       >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >       >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >       >>> to the shared code is in the spirit of the overall idea and instead
> > > > >       >>> opportunity should be used to look at way to refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > > > one real issue which is that it fires off potentially a lot of kthreads.
> > > > > Even that's not that bad given that kthreads are pretty light and you're
> > > > > not likely to have more kthreads than userspace threads which are much
> > > > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > > > something we can/should optimize but if we went through with Xe without
> > > > > this patch, it would probably be mostly ok.
> > > > > 
> > > > >       >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >       >>
> > > > >       >> I'm not really prepared to make large changes to DRM scheduler
> > > > >      at the
> > > > >       >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >       >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >       >> comes up with.
> > > > >       >>
> > > > >       >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >       >>> has a few problems.
> > > > >       >>>
> > > > >       >>> To start with, the pattern of:
> > > > >       >>>
> > > > >       >>>    while (not_stopped) {
> > > > >       >>>     keep picking jobs
> > > > >       >>>    }
> > > > >       >>>
> > > > >       >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >       >>> perfectly with the current kthread design).
> > > > >       >>
> > > > >       >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work item
> > > > > which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more fundamental
> > > > concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many cycles.
> > > > That means maintaining just enough concurrency to prevent work processing
> > > > from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq), in
> > > > the proposed design the number of instantiated worker threads does not
> > > > correspond to the number of user threads (as you have elsewhere stated), but
> > > > pessimistically to the number of active user contexts. That is the number
> > > > which drives the maximum number of not-runnable jobs that can become
> > > > runnable at once, and hence spawn that many work items, and in turn unbound
> > > > worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > 
> > > 
> > > We can use a different work queue if this is an issue, have a FIXME
> > > which indicates we should allow the user to pass in the work queue.
> > > 
> > > > Unbound workers have no CPU / cache locality either and no connection with
> > > > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > > > large systems or on small ones. Whereas the current design allows for
> > > > scheduler to notice userspace CPU thread keeps waking up the same drm
> > > > scheduler kernel thread, and so it can keep them on the same CPU, the
> > > > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > > > from low sleep for every submission.
> > > > 
> > > 
> > > I guess I don't understand kthread vs. workqueue scheduling internals.
> > 
> > Looked into this and we are not using unbound workers rather we are just
> > using the system_wq which is indeed bound. Again we can change this so a
> > user can just pass in worker too. After doing a of research bound
> > workers allows the scheduler to use locality too avoid that exact
> > problem your reading.
> > 
> > TL;DR I'm not buying any of these arguments although it is possible I am
> > missing something.
> 
> Well you told me it's using unbound.. message id
> Y7dEjcuc1arHBTGu@DUT025-TGLU.fm.intel.com:
> 
> """
> Right now the system_unbound_wq is used which does have a limit on the
> number of threads, right? I do have a FIXME to allow a worker to be
> passed in similar to TDR.
> """
> 

Yea, my mistake. A quick look at the shows we are using system_wq (same
as TDR).

> With bound workers you will indeed get CPU locality. I am not sure what it
> will do in terms of concurrency. If it will serialize work items to fewer
> spawned workers that will be good for the CT contention issue, but may
> negatively affect latency. And possibly preemption / time slicing decisions
> since the order of submitting to the backend will not be in the order of
> context priority, hence high prio may be submitted right after low and
> immediately trigger preemption.
>

We should probably use system_highpri_wq for high priority contexts
(xe_engine).
 
> Anyway, since you are not buying any arguments on paper perhaps you are more
> open towards testing. If you would adapt gem_wsim for Xe you would be able
> to spawn N simulated transcode sessions on any Gen11+ machine and try it
> out.
> 
> For example:
> 
> gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600
> 
> That will run you 36 parallel transcoding sessions streams for 600 frames
> each. No client setup needed whatsoever apart from compiling IGT.
> 
> In the past that was quite a handy tool to identify scheduling issues, or
> validate changes against. All workloads with the media prefix have actually
> been hand crafted by looking at what real media pipelines do with real data.
> Few years back at least.
> 

Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
to use UMD benchmarks to determine if there are performance problems as
in the i915 we had tons microbenchmarks / IGT benchmarks that we found
meant absolutely nothing. Can't say if this benchmark falls into that
category.

We VK and compute benchmarks running and haven't found any major issues
yet. The media UMD hasn't been ported because of the VM bind dependency
so I can't say if there are any issues with the media UMD + Xe.

What I can do hack up xe_exec_threads to really hammer Xe - change it to
128x xe_engines + 8k execs per thread. Each exec is super simple, it
just stores a dword. It creates a thread per hardware engine, so on TGL
this is 5x threads.

Results below:
root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.215s)
root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
  40960  491520 7401728
root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
    645    7095   82457

So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
time to run 40960 execs. That seems to indicate we do not have a
scheduling problem.

This is 8 core (or at least 8 threads) TGL:

root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
...
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 140
model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping        : 1
microcode       : 0x3a
cpu MHz         : 2344.098
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
...

Enough data to be convinced there is not issue with this design? I can
also hack up Xe to use less GPU schedulers /w a kthreads but again that
isn't trivial and doesn't seem necessary based on these results.

> It could show you real world behaviour of the kworkers approach and it could
> also enable you to cross reference any power and performance changes
> relative to i915. Background story there is that media servers like to fit N
> streams to a server and if a change comes along which suddenly makes only
> N-1 stream fit before dropping out of realtime, that's a big problem.
> 
> If you will believe me there is value in that kind of testing I am happy to
> help you add Xe support to the tool, time permitting so possibly guidance
> only at the moment.

If we want to port the tool I wont stop you and provide support if you
struggle with the uAPI but based on my results above I don't think this
is necessary.

Matt

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 17:52                             ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 17:52 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:
> 
> On 11/01/2023 01:13, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:39:00PM +0000, Matthew Brost wrote:
> > > On Tue, Jan 10, 2023 at 11:28:08AM +0000, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >       >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >       >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >       >>> to the shared code is in the spirit of the overall idea and instead
> > > > >       >>> opportunity should be used to look at way to refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > > > one real issue which is that it fires off potentially a lot of kthreads.
> > > > > Even that's not that bad given that kthreads are pretty light and you're
> > > > > not likely to have more kthreads than userspace threads which are much
> > > > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > > > something we can/should optimize but if we went through with Xe without
> > > > > this patch, it would probably be mostly ok.
> > > > > 
> > > > >       >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >       >>
> > > > >       >> I'm not really prepared to make large changes to DRM scheduler
> > > > >      at the
> > > > >       >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >       >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >       >> comes up with.
> > > > >       >>
> > > > >       >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >       >>> has a few problems.
> > > > >       >>>
> > > > >       >>> To start with, the pattern of:
> > > > >       >>>
> > > > >       >>>    while (not_stopped) {
> > > > >       >>>     keep picking jobs
> > > > >       >>>    }
> > > > >       >>>
> > > > >       >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >       >>> perfectly with the current kthread design).
> > > > >       >>
> > > > >       >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work item
> > > > > which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more fundamental
> > > > concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many cycles.
> > > > That means maintaining just enough concurrency to prevent work processing
> > > > from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq), in
> > > > the proposed design the number of instantiated worker threads does not
> > > > correspond to the number of user threads (as you have elsewhere stated), but
> > > > pessimistically to the number of active user contexts. That is the number
> > > > which drives the maximum number of not-runnable jobs that can become
> > > > runnable at once, and hence spawn that many work items, and in turn unbound
> > > > worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > 
> > > 
> > > We can use a different work queue if this is an issue, have a FIXME
> > > which indicates we should allow the user to pass in the work queue.
> > > 
> > > > Unbound workers have no CPU / cache locality either and no connection with
> > > > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > > > large systems or on small ones. Whereas the current design allows for
> > > > scheduler to notice userspace CPU thread keeps waking up the same drm
> > > > scheduler kernel thread, and so it can keep them on the same CPU, the
> > > > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > > > from low sleep for every submission.
> > > > 
> > > 
> > > I guess I don't understand kthread vs. workqueue scheduling internals.
> > 
> > Looked into this and we are not using unbound workers rather we are just
> > using the system_wq which is indeed bound. Again we can change this so a
> > user can just pass in worker too. After doing a of research bound
> > workers allows the scheduler to use locality too avoid that exact
> > problem your reading.
> > 
> > TL;DR I'm not buying any of these arguments although it is possible I am
> > missing something.
> 
> Well you told me it's using unbound.. message id
> Y7dEjcuc1arHBTGu@DUT025-TGLU.fm.intel.com:
> 
> """
> Right now the system_unbound_wq is used which does have a limit on the
> number of threads, right? I do have a FIXME to allow a worker to be
> passed in similar to TDR.
> """
> 

Yea, my mistake. A quick look at the shows we are using system_wq (same
as TDR).

> With bound workers you will indeed get CPU locality. I am not sure what it
> will do in terms of concurrency. If it will serialize work items to fewer
> spawned workers that will be good for the CT contention issue, but may
> negatively affect latency. And possibly preemption / time slicing decisions
> since the order of submitting to the backend will not be in the order of
> context priority, hence high prio may be submitted right after low and
> immediately trigger preemption.
>

We should probably use system_highpri_wq for high priority contexts
(xe_engine).
 
> Anyway, since you are not buying any arguments on paper perhaps you are more
> open towards testing. If you would adapt gem_wsim for Xe you would be able
> to spawn N simulated transcode sessions on any Gen11+ machine and try it
> out.
> 
> For example:
> 
> gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600
> 
> That will run you 36 parallel transcoding sessions streams for 600 frames
> each. No client setup needed whatsoever apart from compiling IGT.
> 
> In the past that was quite a handy tool to identify scheduling issues, or
> validate changes against. All workloads with the media prefix have actually
> been hand crafted by looking at what real media pipelines do with real data.
> Few years back at least.
> 

Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
to use UMD benchmarks to determine if there are performance problems as
in the i915 we had tons microbenchmarks / IGT benchmarks that we found
meant absolutely nothing. Can't say if this benchmark falls into that
category.

We VK and compute benchmarks running and haven't found any major issues
yet. The media UMD hasn't been ported because of the VM bind dependency
so I can't say if there are any issues with the media UMD + Xe.

What I can do hack up xe_exec_threads to really hammer Xe - change it to
128x xe_engines + 8k execs per thread. Each exec is super simple, it
just stores a dword. It creates a thread per hardware engine, so on TGL
this is 5x threads.

Results below:
root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.215s)
root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
  40960  491520 7401728
root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
    645    7095   82457

So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
time to run 40960 execs. That seems to indicate we do not have a
scheduling problem.

This is 8 core (or at least 8 threads) TGL:

root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
...
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 140
model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping        : 1
microcode       : 0x3a
cpu MHz         : 2344.098
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
...

Enough data to be convinced there is not issue with this design? I can
also hack up Xe to use less GPU schedulers /w a kthreads but again that
isn't trivial and doesn't seem necessary based on these results.

> It could show you real world behaviour of the kworkers approach and it could
> also enable you to cross reference any power and performance changes
> relative to i915. Background story there is that media servers like to fit N
> streams to a server and if a change comes along which suddenly makes only
> N-1 stream fit before dropping out of realtime, that's a big problem.
> 
> If you will believe me there is value in that kind of testing I am happy to
> help you add Xe support to the tool, time permitting so possibly guidance
> only at the moment.

If we want to port the tool I wont stop you and provide support if you
struggle with the uAPI but based on my results above I don't think this
is necessary.

Matt

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11  9:17                               ` Tvrtko Ursulin
@ 2023-01-11 18:07                                 ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 18:07 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, Jason Ekstrand

On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 19:01, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 15:55, Matthew Brost wrote:
> > > > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > 
> > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > > > 
> > > > > > [snip]
> > > > > > 
> > > > > > >        >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > > > >       contexts (per
> > > > > > >        >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > > > >       invasive changes
> > > > > > >        >>> to the shared code is in the spirit of the overall idea and
> > > > > > > instead
> > > > > > >        >>> opportunity should be used to look at way to refactor/improve
> > > > > > >       drm_sched.
> > > > > > > 
> > > > > > > 
> > > > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > > > There's only one real issue which is that it fires off potentially a
> > > > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > > > pretty light and you're not likely to have more kthreads than
> > > > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > > > end of the world either.  Definitely something we can/should
> > > > > > > optimize but if we went through with Xe without this patch, it would
> > > > > > > probably be mostly ok.
> > > > > > > 
> > > > > > >        >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > > > >        >>
> > > > > > >        >> I'm not really prepared to make large changes to DRM scheduler
> > > > > > >       at the
> > > > > > >        >> moment for Xe as they are not really required nor does Boris
> > > > > > >       seem they
> > > > > > >        >> will be required for his work either. I am interested to see
> > > > > > >       what Boris
> > > > > > >        >> comes up with.
> > > > > > >        >>
> > > > > > >        >>> Even on the low level, the idea to replace drm_sched threads
> > > > > > >       with workers
> > > > > > >        >>> has a few problems.
> > > > > > >        >>>
> > > > > > >        >>> To start with, the pattern of:
> > > > > > >        >>>
> > > > > > >        >>>    while (not_stopped) {
> > > > > > >        >>>     keep picking jobs
> > > > > > >        >>>    }
> > > > > > >        >>>
> > > > > > >        >>> Feels fundamentally in disagreement with workers (while
> > > > > > >       obviously fits
> > > > > > >        >>> perfectly with the current kthread design).
> > > > > > >        >>
> > > > > > >        >> The while loop breaks and worker exists if no jobs are ready.
> > > > > > > 
> > > > > > > 
> > > > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > > > better? One scheduling job per work item rather than one big work
> > > > > > > item which handles all available jobs?
> > > > > > 
> > > > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > > > because it is a worst case / theoretical, and I think due more
> > > > > > fundamental concerns.
> > > > > > 
> > > > > > If we have to go back to the low level side of things, I've picked this
> > > > > > random spot to consolidate what I have already mentioned and perhaps
> > > > > > expand.
> > > > > > 
> > > > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > > > 
> > > > > > """
> > > > > > Generally, work items are not expected to hog a CPU and consume many
> > > > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > > > processing from stalling should be optimal.
> > > > > > """
> > > > > > 
> > > > > > For unbound queues:
> > > > > > """
> > > > > > The responsibility of regulating concurrency level is on the users.
> > > > > > """
> > > > > > 
> > > > > > Given the unbound queues will be spawned on demand to service all queued
> > > > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > > > in the proposed design the number of instantiated worker threads does
> > > > > > not correspond to the number of user threads (as you have elsewhere
> > > > > > stated), but pessimistically to the number of active user contexts. That
> > > > > > is the number which drives the maximum number of not-runnable jobs that
> > > > > > can become runnable at once, and hence spawn that many work items, and
> > > > > > in turn unbound worker threads.
> > > > > > 
> > > > > > Several problems there.
> > > > > > 
> > > > > > It is fundamentally pointless to have potentially that many more threads
> > > > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > > 
> > > > > To make matters worse, if I follow the code correctly, all these per user
> > > > > context worker thread / work items end up contending on the same lock or
> > > > > circular buffer, both are one instance per GPU:
> > > > > 
> > > > > guc_engine_run_job
> > > > >    -> submit_engine
> > > > >       a) wq_item_append
> > > > >           -> wq_wait_for_space
> > > > >             -> msleep
> > > > 
> > > > a) is dedicated per xe_engine
> > > 
> > > Hah true, what its for then? I thought throttling the LRCA ring is done via:
> > > 
> > 
> > This is a per guc_id 'work queue' which is used for parallel submission
> > (e.g. multiple LRC tail values need to written atomically by the GuC).
> > Again in practice there should always be space.
> 
> Speaking of guc id, where does blocking when none are available happen in
> the non parallel case?
> 

We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
think that is more than enough and can just reject xe_engine creation if
we run out of guc_ids. If this proves to false, we can fix this but the
guc_id stealing the i915 is rather complicated and hopefully not needed.

We will limit the number of guc_ids allowed per user pid to reasonible
number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
whatever they want.

> > >    drm_sched_init(&ge->sched, &drm_sched_ops,
> > > 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
> > > 
> > > Is there something more to throttle other than the ring? It is throttling
> > > something using msleeps..
> > > 
> > > > Also you missed the step of programming the ring which is dedicated per xe_engine
> > > 
> > > I was trying to quickly find places which serialize on something in the
> > > backend, ringbuffer emission did not seem to do that but maybe I missed
> > > something.
> > > 
> > 
> > xe_ring_ops vfunc emit_job is called to write the ring.
> 
> Right but does it serialize between different contexts, I didn't spot that
> it does in which case it wasn't relevant to the sub story.
>

Right just saying this is an additional step that is done in parallel
between xe_engines.
 
> > > > 
> > > > >       b) xe_guc_ct_send
> > > > >           -> guc_ct_send
> > > > >             -> mutex_lock(&ct->lock);
> > > > >             -> later a potential msleep in h2g_has_room
> > > > 
> > > > Techincally there is 1 instance per GT not GPU, yes this is shared but
> > > > in practice there will always be space in the CT channel so contention
> > > > on the lock should be rare.
> > > 
> > > Yeah I used the term GPU to be more understandable to outside audience.
> > > 
> > > I am somewhat disappointed that the Xe opportunity hasn't been used to
> > > improve upon the CT communication bottlenecks. I mean those backoff sleeps
> > > and lock contention. I wish there would be a single thread in charge of the
> > > CT channel and internal users (other parts of the driver) would be able to
> > > send their requests to it in a more efficient manner, with less lock
> > > contention and centralized backoff.
> > > 
> > 
> > Well the CT backend was more or less a complete rewrite. Mutexes
> > actually work rather well to ensure fairness compared to the spin locks
> > used in the i915. This code was pretty heavily reviewed by Daniel and
> > both of us landed a big mutex for all of the CT code compared to the 3
> > or 4 spin locks used in the i915.
> 
> Are the "nb" sends gone? But that aside, I wasn't meaning just the locking
> but the high level approach. Never  mind.
>

xe_guc_ct_send is non-blocking, xe_guc_ct_send_block is blocking. I
don't think the later is used yet.
 
> > > > I haven't read your rather long reply yet, but also FWIW using a
> > > > workqueue has suggested by AMD (original authors of the DRM scheduler)
> > > > when we ran this design by them.
> > > 
> > > Commit message says nothing about that. ;)
> > > 
> > 
> > Yea I missed that, will fix in the next rev. Just dug through my emails
> > and Christian suggested a work queue and Andrey also gave some input on
> > the DRM scheduler design.
> > 
> > Also in the next will likely update the run_wq to be passed in by the
> > user.
> 
> Yes, and IMO that may need to be non-optional.
>

Yea, will fix.

Matt
 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 18:07                                 ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 18:07 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 19:01, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 15:55, Matthew Brost wrote:
> > > > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > 
> > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > > > 
> > > > > > [snip]
> > > > > > 
> > > > > > >        >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > > > >       contexts (per
> > > > > > >        >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > > > >       invasive changes
> > > > > > >        >>> to the shared code is in the spirit of the overall idea and
> > > > > > > instead
> > > > > > >        >>> opportunity should be used to look at way to refactor/improve
> > > > > > >       drm_sched.
> > > > > > > 
> > > > > > > 
> > > > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > > > There's only one real issue which is that it fires off potentially a
> > > > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > > > pretty light and you're not likely to have more kthreads than
> > > > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > > > end of the world either.  Definitely something we can/should
> > > > > > > optimize but if we went through with Xe without this patch, it would
> > > > > > > probably be mostly ok.
> > > > > > > 
> > > > > > >        >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > > > >        >>
> > > > > > >        >> I'm not really prepared to make large changes to DRM scheduler
> > > > > > >       at the
> > > > > > >        >> moment for Xe as they are not really required nor does Boris
> > > > > > >       seem they
> > > > > > >        >> will be required for his work either. I am interested to see
> > > > > > >       what Boris
> > > > > > >        >> comes up with.
> > > > > > >        >>
> > > > > > >        >>> Even on the low level, the idea to replace drm_sched threads
> > > > > > >       with workers
> > > > > > >        >>> has a few problems.
> > > > > > >        >>>
> > > > > > >        >>> To start with, the pattern of:
> > > > > > >        >>>
> > > > > > >        >>>    while (not_stopped) {
> > > > > > >        >>>     keep picking jobs
> > > > > > >        >>>    }
> > > > > > >        >>>
> > > > > > >        >>> Feels fundamentally in disagreement with workers (while
> > > > > > >       obviously fits
> > > > > > >        >>> perfectly with the current kthread design).
> > > > > > >        >>
> > > > > > >        >> The while loop breaks and worker exists if no jobs are ready.
> > > > > > > 
> > > > > > > 
> > > > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > > > better? One scheduling job per work item rather than one big work
> > > > > > > item which handles all available jobs?
> > > > > > 
> > > > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > > > because it is a worst case / theoretical, and I think due more
> > > > > > fundamental concerns.
> > > > > > 
> > > > > > If we have to go back to the low level side of things, I've picked this
> > > > > > random spot to consolidate what I have already mentioned and perhaps
> > > > > > expand.
> > > > > > 
> > > > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > > > 
> > > > > > """
> > > > > > Generally, work items are not expected to hog a CPU and consume many
> > > > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > > > processing from stalling should be optimal.
> > > > > > """
> > > > > > 
> > > > > > For unbound queues:
> > > > > > """
> > > > > > The responsibility of regulating concurrency level is on the users.
> > > > > > """
> > > > > > 
> > > > > > Given the unbound queues will be spawned on demand to service all queued
> > > > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > > > in the proposed design the number of instantiated worker threads does
> > > > > > not correspond to the number of user threads (as you have elsewhere
> > > > > > stated), but pessimistically to the number of active user contexts. That
> > > > > > is the number which drives the maximum number of not-runnable jobs that
> > > > > > can become runnable at once, and hence spawn that many work items, and
> > > > > > in turn unbound worker threads.
> > > > > > 
> > > > > > Several problems there.
> > > > > > 
> > > > > > It is fundamentally pointless to have potentially that many more threads
> > > > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > > 
> > > > > To make matters worse, if I follow the code correctly, all these per user
> > > > > context worker thread / work items end up contending on the same lock or
> > > > > circular buffer, both are one instance per GPU:
> > > > > 
> > > > > guc_engine_run_job
> > > > >    -> submit_engine
> > > > >       a) wq_item_append
> > > > >           -> wq_wait_for_space
> > > > >             -> msleep
> > > > 
> > > > a) is dedicated per xe_engine
> > > 
> > > Hah true, what its for then? I thought throttling the LRCA ring is done via:
> > > 
> > 
> > This is a per guc_id 'work queue' which is used for parallel submission
> > (e.g. multiple LRC tail values need to written atomically by the GuC).
> > Again in practice there should always be space.
> 
> Speaking of guc id, where does blocking when none are available happen in
> the non parallel case?
> 

We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
think that is more than enough and can just reject xe_engine creation if
we run out of guc_ids. If this proves to false, we can fix this but the
guc_id stealing the i915 is rather complicated and hopefully not needed.

We will limit the number of guc_ids allowed per user pid to reasonible
number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
whatever they want.

> > >    drm_sched_init(&ge->sched, &drm_sched_ops,
> > > 		 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,
> > > 
> > > Is there something more to throttle other than the ring? It is throttling
> > > something using msleeps..
> > > 
> > > > Also you missed the step of programming the ring which is dedicated per xe_engine
> > > 
> > > I was trying to quickly find places which serialize on something in the
> > > backend, ringbuffer emission did not seem to do that but maybe I missed
> > > something.
> > > 
> > 
> > xe_ring_ops vfunc emit_job is called to write the ring.
> 
> Right but does it serialize between different contexts, I didn't spot that
> it does in which case it wasn't relevant to the sub story.
>

Right just saying this is an additional step that is done in parallel
between xe_engines.
 
> > > > 
> > > > >       b) xe_guc_ct_send
> > > > >           -> guc_ct_send
> > > > >             -> mutex_lock(&ct->lock);
> > > > >             -> later a potential msleep in h2g_has_room
> > > > 
> > > > Techincally there is 1 instance per GT not GPU, yes this is shared but
> > > > in practice there will always be space in the CT channel so contention
> > > > on the lock should be rare.
> > > 
> > > Yeah I used the term GPU to be more understandable to outside audience.
> > > 
> > > I am somewhat disappointed that the Xe opportunity hasn't been used to
> > > improve upon the CT communication bottlenecks. I mean those backoff sleeps
> > > and lock contention. I wish there would be a single thread in charge of the
> > > CT channel and internal users (other parts of the driver) would be able to
> > > send their requests to it in a more efficient manner, with less lock
> > > contention and centralized backoff.
> > > 
> > 
> > Well the CT backend was more or less a complete rewrite. Mutexes
> > actually work rather well to ensure fairness compared to the spin locks
> > used in the i915. This code was pretty heavily reviewed by Daniel and
> > both of us landed a big mutex for all of the CT code compared to the 3
> > or 4 spin locks used in the i915.
> 
> Are the "nb" sends gone? But that aside, I wasn't meaning just the locking
> but the high level approach. Never  mind.
>

xe_guc_ct_send is non-blocking, xe_guc_ct_send_block is blocking. I
don't think the later is used yet.
 
> > > > I haven't read your rather long reply yet, but also FWIW using a
> > > > workqueue has suggested by AMD (original authors of the DRM scheduler)
> > > > when we ran this design by them.
> > > 
> > > Commit message says nothing about that. ;)
> > > 
> > 
> > Yea I missed that, will fix in the next rev. Just dug through my emails
> > and Christian suggested a work queue and Andrey also gave some input on
> > the DRM scheduler design.
> > 
> > Also in the next will likely update the run_wq to be passed in by the
> > user.
> 
> Yes, and IMO that may need to be non-optional.
>

Yea, will fix.

Matt
 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 18:07                                 ` Matthew Brost
  (?)
@ 2023-01-11 18:52                                 ` John Harrison
  2023-01-11 18:55                                     ` Matthew Brost
  -1 siblings, 1 reply; 161+ messages in thread
From: John Harrison @ 2023-01-11 18:52 UTC (permalink / raw)
  To: Matthew Brost, Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On 1/11/2023 10:07, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
>> On 10/01/2023 19:01, Matthew Brost wrote:
>>> On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
>>>> On 10/01/2023 15:55, Matthew Brost wrote:
>>>>> On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
>>>>>> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
>>>>>>> On 09/01/2023 17:27, Jason Ekstrand wrote:
>>>>>>>
>>>>>>> [snip]
>>>>>>>
>>>>>>>>         >>> AFAICT it proposes to have 1:1 between *userspace* created
>>>>>>>>        contexts (per
>>>>>>>>         >>> context _and_ engine) and drm_sched. I am not sure avoiding
>>>>>>>>        invasive changes
>>>>>>>>         >>> to the shared code is in the spirit of the overall idea and
>>>>>>>> instead
>>>>>>>>         >>> opportunity should be used to look at way to refactor/improve
>>>>>>>>        drm_sched.
>>>>>>>>
>>>>>>>>
>>>>>>>> Maybe?  I'm not convinced that what Xe is doing is an abuse at all
>>>>>>>> or really needs to drive a re-factor.  (More on that later.)
>>>>>>>> There's only one real issue which is that it fires off potentially a
>>>>>>>> lot of kthreads. Even that's not that bad given that kthreads are
>>>>>>>> pretty light and you're not likely to have more kthreads than
>>>>>>>> userspace threads which are much heavier.  Not ideal, but not the
>>>>>>>> end of the world either.  Definitely something we can/should
>>>>>>>> optimize but if we went through with Xe without this patch, it would
>>>>>>>> probably be mostly ok.
>>>>>>>>
>>>>>>>>         >> Yes, it is 1:1 *userspace* engines and drm_sched.
>>>>>>>>         >>
>>>>>>>>         >> I'm not really prepared to make large changes to DRM scheduler
>>>>>>>>        at the
>>>>>>>>         >> moment for Xe as they are not really required nor does Boris
>>>>>>>>        seem they
>>>>>>>>         >> will be required for his work either. I am interested to see
>>>>>>>>        what Boris
>>>>>>>>         >> comes up with.
>>>>>>>>         >>
>>>>>>>>         >>> Even on the low level, the idea to replace drm_sched threads
>>>>>>>>        with workers
>>>>>>>>         >>> has a few problems.
>>>>>>>>         >>>
>>>>>>>>         >>> To start with, the pattern of:
>>>>>>>>         >>>
>>>>>>>>         >>>    while (not_stopped) {
>>>>>>>>         >>>     keep picking jobs
>>>>>>>>         >>>    }
>>>>>>>>         >>>
>>>>>>>>         >>> Feels fundamentally in disagreement with workers (while
>>>>>>>>        obviously fits
>>>>>>>>         >>> perfectly with the current kthread design).
>>>>>>>>         >>
>>>>>>>>         >> The while loop breaks and worker exists if no jobs are ready.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not very familiar with workqueues. What are you saying would fit
>>>>>>>> better? One scheduling job per work item rather than one big work
>>>>>>>> item which handles all available jobs?
>>>>>>> Yes and no, it indeed IMO does not fit to have a work item which is
>>>>>>> potentially unbound in runtime. But it is a bit moot conceptual mismatch
>>>>>>> because it is a worst case / theoretical, and I think due more
>>>>>>> fundamental concerns.
>>>>>>>
>>>>>>> If we have to go back to the low level side of things, I've picked this
>>>>>>> random spot to consolidate what I have already mentioned and perhaps
>>>>>>> expand.
>>>>>>>
>>>>>>> To start with, let me pull out some thoughts from workqueue.rst:
>>>>>>>
>>>>>>> """
>>>>>>> Generally, work items are not expected to hog a CPU and consume many
>>>>>>> cycles. That means maintaining just enough concurrency to prevent work
>>>>>>> processing from stalling should be optimal.
>>>>>>> """
>>>>>>>
>>>>>>> For unbound queues:
>>>>>>> """
>>>>>>> The responsibility of regulating concurrency level is on the users.
>>>>>>> """
>>>>>>>
>>>>>>> Given the unbound queues will be spawned on demand to service all queued
>>>>>>> work items (more interesting when mixing up with the system_unbound_wq),
>>>>>>> in the proposed design the number of instantiated worker threads does
>>>>>>> not correspond to the number of user threads (as you have elsewhere
>>>>>>> stated), but pessimistically to the number of active user contexts. That
>>>>>>> is the number which drives the maximum number of not-runnable jobs that
>>>>>>> can become runnable at once, and hence spawn that many work items, and
>>>>>>> in turn unbound worker threads.
>>>>>>>
>>>>>>> Several problems there.
>>>>>>>
>>>>>>> It is fundamentally pointless to have potentially that many more threads
>>>>>>> than the number of CPU cores - it simply creates a scheduling storm.
>>>>>> To make matters worse, if I follow the code correctly, all these per user
>>>>>> context worker thread / work items end up contending on the same lock or
>>>>>> circular buffer, both are one instance per GPU:
>>>>>>
>>>>>> guc_engine_run_job
>>>>>>     -> submit_engine
>>>>>>        a) wq_item_append
>>>>>>            -> wq_wait_for_space
>>>>>>              -> msleep
>>>>> a) is dedicated per xe_engine
>>>> Hah true, what its for then? I thought throttling the LRCA ring is done via:
>>>>
>>> This is a per guc_id 'work queue' which is used for parallel submission
>>> (e.g. multiple LRC tail values need to written atomically by the GuC).
>>> Again in practice there should always be space.
>> Speaking of guc id, where does blocking when none are available happen in
>> the non parallel case?
>>
> We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
> think that is more than enough and can just reject xe_engine creation if
> we run out of guc_ids. If this proves to false, we can fix this but the
> guc_id stealing the i915 is rather complicated and hopefully not needed.
>
> We will limit the number of guc_ids allowed per user pid to reasonible
> number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
> whatever they want.
What about doorbells? As some point, we will have to start using those 
and they are a much more limited resource - 256 total and way less with VFs.

John.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 18:52                                 ` John Harrison
@ 2023-01-11 18:55                                     ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 18:55 UTC (permalink / raw)
  To: John Harrison; +Cc: Tvrtko Ursulin, intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 10:52:54AM -0800, John Harrison wrote:
> On 1/11/2023 10:07, Matthew Brost wrote:
> > On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
> > > On 10/01/2023 19:01, Matthew Brost wrote:
> > > > On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> > > > > On 10/01/2023 15:55, Matthew Brost wrote:
> > > > > > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > > > > > 
> > > > > > > > [snip]
> > > > > > > > 
> > > > > > > > >         >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > > > > > >        contexts (per
> > > > > > > > >         >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > > > > > >        invasive changes
> > > > > > > > >         >>> to the shared code is in the spirit of the overall idea and
> > > > > > > > > instead
> > > > > > > > >         >>> opportunity should be used to look at way to refactor/improve
> > > > > > > > >        drm_sched.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > > > > > There's only one real issue which is that it fires off potentially a
> > > > > > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > > > > > pretty light and you're not likely to have more kthreads than
> > > > > > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > > > > > end of the world either.  Definitely something we can/should
> > > > > > > > > optimize but if we went through with Xe without this patch, it would
> > > > > > > > > probably be mostly ok.
> > > > > > > > > 
> > > > > > > > >         >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > > > > > >         >>
> > > > > > > > >         >> I'm not really prepared to make large changes to DRM scheduler
> > > > > > > > >        at the
> > > > > > > > >         >> moment for Xe as they are not really required nor does Boris
> > > > > > > > >        seem they
> > > > > > > > >         >> will be required for his work either. I am interested to see
> > > > > > > > >        what Boris
> > > > > > > > >         >> comes up with.
> > > > > > > > >         >>
> > > > > > > > >         >>> Even on the low level, the idea to replace drm_sched threads
> > > > > > > > >        with workers
> > > > > > > > >         >>> has a few problems.
> > > > > > > > >         >>>
> > > > > > > > >         >>> To start with, the pattern of:
> > > > > > > > >         >>>
> > > > > > > > >         >>>    while (not_stopped) {
> > > > > > > > >         >>>     keep picking jobs
> > > > > > > > >         >>>    }
> > > > > > > > >         >>>
> > > > > > > > >         >>> Feels fundamentally in disagreement with workers (while
> > > > > > > > >        obviously fits
> > > > > > > > >         >>> perfectly with the current kthread design).
> > > > > > > > >         >>
> > > > > > > > >         >> The while loop breaks and worker exists if no jobs are ready.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > > > > > better? One scheduling job per work item rather than one big work
> > > > > > > > > item which handles all available jobs?
> > > > > > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > > > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > > > > > because it is a worst case / theoretical, and I think due more
> > > > > > > > fundamental concerns.
> > > > > > > > 
> > > > > > > > If we have to go back to the low level side of things, I've picked this
> > > > > > > > random spot to consolidate what I have already mentioned and perhaps
> > > > > > > > expand.
> > > > > > > > 
> > > > > > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > > > > > 
> > > > > > > > """
> > > > > > > > Generally, work items are not expected to hog a CPU and consume many
> > > > > > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > > > > > processing from stalling should be optimal.
> > > > > > > > """
> > > > > > > > 
> > > > > > > > For unbound queues:
> > > > > > > > """
> > > > > > > > The responsibility of regulating concurrency level is on the users.
> > > > > > > > """
> > > > > > > > 
> > > > > > > > Given the unbound queues will be spawned on demand to service all queued
> > > > > > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > > > > > in the proposed design the number of instantiated worker threads does
> > > > > > > > not correspond to the number of user threads (as you have elsewhere
> > > > > > > > stated), but pessimistically to the number of active user contexts. That
> > > > > > > > is the number which drives the maximum number of not-runnable jobs that
> > > > > > > > can become runnable at once, and hence spawn that many work items, and
> > > > > > > > in turn unbound worker threads.
> > > > > > > > 
> > > > > > > > Several problems there.
> > > > > > > > 
> > > > > > > > It is fundamentally pointless to have potentially that many more threads
> > > > > > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > > > > To make matters worse, if I follow the code correctly, all these per user
> > > > > > > context worker thread / work items end up contending on the same lock or
> > > > > > > circular buffer, both are one instance per GPU:
> > > > > > > 
> > > > > > > guc_engine_run_job
> > > > > > >     -> submit_engine
> > > > > > >        a) wq_item_append
> > > > > > >            -> wq_wait_for_space
> > > > > > >              -> msleep
> > > > > > a) is dedicated per xe_engine
> > > > > Hah true, what its for then? I thought throttling the LRCA ring is done via:
> > > > > 
> > > > This is a per guc_id 'work queue' which is used for parallel submission
> > > > (e.g. multiple LRC tail values need to written atomically by the GuC).
> > > > Again in practice there should always be space.
> > > Speaking of guc id, where does blocking when none are available happen in
> > > the non parallel case?
> > > 
> > We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
> > think that is more than enough and can just reject xe_engine creation if
> > we run out of guc_ids. If this proves to false, we can fix this but the
> > guc_id stealing the i915 is rather complicated and hopefully not needed.
> > 
> > We will limit the number of guc_ids allowed per user pid to reasonible
> > number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
> > whatever they want.
> What about doorbells? As some point, we will have to start using those and
> they are a much more limited resource - 256 total and way less with VFs.
> 

We haven't thought about that one yet, will figure this one out when we
implement this.

Matt

> John.
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 18:55                                     ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 18:55 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 10:52:54AM -0800, John Harrison wrote:
> On 1/11/2023 10:07, Matthew Brost wrote:
> > On Wed, Jan 11, 2023 at 09:17:01AM +0000, Tvrtko Ursulin wrote:
> > > On 10/01/2023 19:01, Matthew Brost wrote:
> > > > On Tue, Jan 10, 2023 at 04:50:55PM +0000, Tvrtko Ursulin wrote:
> > > > > On 10/01/2023 15:55, Matthew Brost wrote:
> > > > > > On Tue, Jan 10, 2023 at 12:19:35PM +0000, Tvrtko Ursulin wrote:
> > > > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > > > > > 
> > > > > > > > [snip]
> > > > > > > > 
> > > > > > > > >         >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > > > > > >        contexts (per
> > > > > > > > >         >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > > > > > >        invasive changes
> > > > > > > > >         >>> to the shared code is in the spirit of the overall idea and
> > > > > > > > > instead
> > > > > > > > >         >>> opportunity should be used to look at way to refactor/improve
> > > > > > > > >        drm_sched.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > > > > > There's only one real issue which is that it fires off potentially a
> > > > > > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > > > > > pretty light and you're not likely to have more kthreads than
> > > > > > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > > > > > end of the world either.  Definitely something we can/should
> > > > > > > > > optimize but if we went through with Xe without this patch, it would
> > > > > > > > > probably be mostly ok.
> > > > > > > > > 
> > > > > > > > >         >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > > > > > >         >>
> > > > > > > > >         >> I'm not really prepared to make large changes to DRM scheduler
> > > > > > > > >        at the
> > > > > > > > >         >> moment for Xe as they are not really required nor does Boris
> > > > > > > > >        seem they
> > > > > > > > >         >> will be required for his work either. I am interested to see
> > > > > > > > >        what Boris
> > > > > > > > >         >> comes up with.
> > > > > > > > >         >>
> > > > > > > > >         >>> Even on the low level, the idea to replace drm_sched threads
> > > > > > > > >        with workers
> > > > > > > > >         >>> has a few problems.
> > > > > > > > >         >>>
> > > > > > > > >         >>> To start with, the pattern of:
> > > > > > > > >         >>>
> > > > > > > > >         >>>    while (not_stopped) {
> > > > > > > > >         >>>     keep picking jobs
> > > > > > > > >         >>>    }
> > > > > > > > >         >>>
> > > > > > > > >         >>> Feels fundamentally in disagreement with workers (while
> > > > > > > > >        obviously fits
> > > > > > > > >         >>> perfectly with the current kthread design).
> > > > > > > > >         >>
> > > > > > > > >         >> The while loop breaks and worker exists if no jobs are ready.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > > > > > better? One scheduling job per work item rather than one big work
> > > > > > > > > item which handles all available jobs?
> > > > > > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > > > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > > > > > because it is a worst case / theoretical, and I think due more
> > > > > > > > fundamental concerns.
> > > > > > > > 
> > > > > > > > If we have to go back to the low level side of things, I've picked this
> > > > > > > > random spot to consolidate what I have already mentioned and perhaps
> > > > > > > > expand.
> > > > > > > > 
> > > > > > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > > > > > 
> > > > > > > > """
> > > > > > > > Generally, work items are not expected to hog a CPU and consume many
> > > > > > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > > > > > processing from stalling should be optimal.
> > > > > > > > """
> > > > > > > > 
> > > > > > > > For unbound queues:
> > > > > > > > """
> > > > > > > > The responsibility of regulating concurrency level is on the users.
> > > > > > > > """
> > > > > > > > 
> > > > > > > > Given the unbound queues will be spawned on demand to service all queued
> > > > > > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > > > > > in the proposed design the number of instantiated worker threads does
> > > > > > > > not correspond to the number of user threads (as you have elsewhere
> > > > > > > > stated), but pessimistically to the number of active user contexts. That
> > > > > > > > is the number which drives the maximum number of not-runnable jobs that
> > > > > > > > can become runnable at once, and hence spawn that many work items, and
> > > > > > > > in turn unbound worker threads.
> > > > > > > > 
> > > > > > > > Several problems there.
> > > > > > > > 
> > > > > > > > It is fundamentally pointless to have potentially that many more threads
> > > > > > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > > > > To make matters worse, if I follow the code correctly, all these per user
> > > > > > > context worker thread / work items end up contending on the same lock or
> > > > > > > circular buffer, both are one instance per GPU:
> > > > > > > 
> > > > > > > guc_engine_run_job
> > > > > > >     -> submit_engine
> > > > > > >        a) wq_item_append
> > > > > > >            -> wq_wait_for_space
> > > > > > >              -> msleep
> > > > > > a) is dedicated per xe_engine
> > > > > Hah true, what its for then? I thought throttling the LRCA ring is done via:
> > > > > 
> > > > This is a per guc_id 'work queue' which is used for parallel submission
> > > > (e.g. multiple LRC tail values need to written atomically by the GuC).
> > > > Again in practice there should always be space.
> > > Speaking of guc id, where does blocking when none are available happen in
> > > the non parallel case?
> > > 
> > We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
> > think that is more than enough and can just reject xe_engine creation if
> > we run out of guc_ids. If this proves to false, we can fix this but the
> > guc_id stealing the i915 is rather complicated and hopefully not needed.
> > 
> > We will limit the number of guc_ids allowed per user pid to reasonible
> > number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
> > whatever they want.
> What about doorbells? As some point, we will have to start using those and
> they are a much more limited resource - 256 total and way less with VFs.
> 

We haven't thought about that one yet, will figure this one out when we
implement this.

Matt

> John.
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11  8:50                         ` Tvrtko Ursulin
@ 2023-01-11 19:40                           ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 19:40 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel, Jason Ekstrand

On Wed, Jan 11, 2023 at 08:50:37AM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 14:08, Jason Ekstrand wrote:
> > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>>
> > wrote:
> > 
> > 
> > 
> >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> >     [snip]
> > 
> >      >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >      >     contexts (per
> >      >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >      >     invasive changes
> >      >      >>> to the shared code is in the spirit of the overall idea
> >     and instead
> >      >      >>> opportunity should be used to look at way to
> >     refactor/improve
> >      >     drm_sched.
> >      >
> >      >
> >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> >     all or
> >      > really needs to drive a re-factor.  (More on that later.)
> > There's only
> >      > one real issue which is that it fires off potentially a lot of
> >     kthreads.
> >      > Even that's not that bad given that kthreads are pretty light and
> >     you're
> >      > not likely to have more kthreads than userspace threads which are
> >     much
> >      > heavier.  Not ideal, but not the end of the world either.
> > Definitely
> >      > something we can/should optimize but if we went through with Xe
> >     without
> >      > this patch, it would probably be mostly ok.
> >      >
> >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >      >>
> >      >      >> I'm not really prepared to make large changes to DRM
> >     scheduler
> >      >     at the
> >      >      >> moment for Xe as they are not really required nor does Boris
> >      >     seem they
> >      >      >> will be required for his work either. I am interested to see
> >      >     what Boris
> >      >      >> comes up with.
> >      >      >>
> >      >      >>> Even on the low level, the idea to replace drm_sched threads
> >      >     with workers
> >      >      >>> has a few problems.
> >      >      >>>
> >      >      >>> To start with, the pattern of:
> >      >      >>>
> >      >      >>>    while (not_stopped) {
> >      >      >>>     keep picking jobs
> >      >      >>>    }
> >      >      >>>
> >      >      >>> Feels fundamentally in disagreement with workers (while
> >      >     obviously fits
> >      >      >>> perfectly with the current kthread design).
> >      >      >>
> >      >      >> The while loop breaks and worker exists if no jobs are ready.
> >      >
> >      >
> >      > I'm not very familiar with workqueues. What are you saying would fit
> >      > better? One scheduling job per work item rather than one big work
> >     item
> >      > which handles all available jobs?
> > 
> >     Yes and no, it indeed IMO does not fit to have a work item which is
> >     potentially unbound in runtime. But it is a bit moot conceptual
> >     mismatch
> >     because it is a worst case / theoretical, and I think due more
> >     fundamental concerns.
> > 
> >     If we have to go back to the low level side of things, I've picked this
> >     random spot to consolidate what I have already mentioned and perhaps
> >     expand.
> > 
> >     To start with, let me pull out some thoughts from workqueue.rst:
> > 
> >     """
> >     Generally, work items are not expected to hog a CPU and consume many
> >     cycles. That means maintaining just enough concurrency to prevent work
> >     processing from stalling should be optimal.
> >     """
> > 
> >     For unbound queues:
> >     """
> >     The responsibility of regulating concurrency level is on the users.
> >     """
> > 
> >     Given the unbound queues will be spawned on demand to service all
> >     queued
> >     work items (more interesting when mixing up with the
> >     system_unbound_wq),
> >     in the proposed design the number of instantiated worker threads does
> >     not correspond to the number of user threads (as you have elsewhere
> >     stated), but pessimistically to the number of active user contexts.
> > 
> > 
> > Those are pretty much the same in practice.  Rather, user threads is
> > typically an upper bound on the number of contexts.  Yes, a single user
> > thread could have a bunch of contexts but basically nothing does that
> > except IGT.  In real-world usage, it's at most one context per user
> > thread.
> 
> Typically is the key here. But I am not sure it is good enough. Consider
> this example - Intel Flex 170:
> 
>  * Delivers up to 36 streams 1080p60 transcode throughput per card.
>  * When scaled to 10 cards in a 4U server configuration, it can support up
> to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> 
> One transcode stream from my experience typically is 3-4 GPU contexts
> (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single CPU
> thread. 4 contexts * 36 streams = 144 active contexts. Multiply by 60fps =
> 8640 jobs submitted and completed per second.
> 

See my reply with my numbers based running xe_exec_threads, on a TGL we
are getting 33711 jobs per sec /w 640 xe_engines. This seems to scale
just fine.

> 144 active contexts in the proposed scheme means possibly means 144 kernel
> worker threads spawned (driven by 36 transcode CPU threads). (I don't think
> the pools would scale down given all are constantly pinged at 60fps.)
> 
> And then each of 144 threads goes to grab the single GuC CT mutex. First
> threads are being made schedulable, then put to sleep as mutex contention is
> hit, then woken again as mutexes are getting released, rinse, repeat.
> 
> (And yes this backend contention is there regardless of 1:1:1, it would
> require a different re-design to solve that. But it is just a question
> whether there are 144 contending threads, or just 6 with the thread per
> engine class scheme.)
> 
> Then multiply all by 10 for a 4U server use case and you get 1440 worker
> kthreads, yes 10 more CT locks, but contending on how many CPU cores? Just
> so they can grab a timeslice and maybe content on a mutex as the next step.
>

Same as above, this seems to scale just fine as I bet the above example
of 33711 job per sec is limited by a single GuC context switching rather
than Xe being about to feed the GuC. Also certainly a server in this
configuration is going to a CPU much faster than the TGL I was using.

Also did another quick change to use 1280 xe_engines in xe_exec_threads:
root@DUT025-TGLU:igt-gpu-tools# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.198s)

More or less same results as 640 xe_engines.
 
> This example is where it would hurt on large systems. Imagine only an even
> wider media transcode card...
> 
> Second example is only a single engine class used (3d desktop?) but with a
> bunch of not-runnable jobs queued and waiting on a fence to signal. Implicit
> or explicit dependencies doesn't matter. Then the fence signals and call
> backs run. N work items get scheduled, but they all submit to the same HW
> engine. So we end up with:
> 
>         /-- wi1 --\
>        / ..     .. \
>  cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>        \ ..    ..  /
>         \-- wiN --/
> 
> 
> All that we have achieved is waking up N CPUs to contend on the same lock
> and effectively insert the job into the same single HW queue. I don't see
> any positives there.
>

I've said this before, the CT channel in practice isn't going to be full
so the section of code protected by the mutex is really, really small.
The mutex really shouldn't ever have contention. Also does a mutex spin
for small period of time before going to sleep? I seem to recall some
type of core lock did this, if we can use a lock that spins for short
period of time this argument falls apart.
 
> This example I think can particularly hurt small / low power devices because
> of needless waking up of many cores for no benefit. Granted, I don't have a
> good feel on how common this pattern is in practice.
> 
> > 
> >     That
> >     is the number which drives the maximum number of not-runnable jobs that
> >     can become runnable at once, and hence spawn that many work items, and
> >     in turn unbound worker threads.
> > 
> >     Several problems there.
> > 
> >     It is fundamentally pointless to have potentially that many more
> >     threads
> >     than the number of CPU cores - it simply creates a scheduling storm.
> > 
> >     Unbound workers have no CPU / cache locality either and no connection
> >     with the CPU scheduler to optimize scheduling patterns. This may matter
> >     either on large systems or on small ones. Whereas the current design
> >     allows for scheduler to notice userspace CPU thread keeps waking up the
> >     same drm scheduler kernel thread, and so it can keep them on the same
> >     CPU, the unbound workers lose that ability and so 2nd CPU might be
> >     getting woken up from low sleep for every submission.
> > 
> >     Hence, apart from being a bit of a impedance mismatch, the proposal has
> >     the potential to change performance and power patterns and both large
> >     and small machines.
> > 
> > 
> > Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
> > deferred kwork does appear to mismatch somewhat with what the scheduler
> > needs or at least how it's worked in the past.  How much impact will
> > that mismatch have?  Unclear.
> > 
> >      >      >>> Secondly, it probably demands separate workers (not
> >     optional),
> >      >     otherwise
> >      >      >>> behaviour of shared workqueues has either the potential to
> >      >     explode number
> >      >      >>> kernel threads anyway, or add latency.
> >      >      >>>
> >      >      >>
> >      >      >> Right now the system_unbound_wq is used which does have a
> >     limit
> >      >     on the
> >      >      >> number of threads, right? I do have a FIXME to allow a
> >     worker to be
> >      >      >> passed in similar to TDR.
> >      >      >>
> >      >      >> WRT to latency, the 1:1 ratio could actually have lower
> >     latency
> >      >     as 2 GPU
> >      >      >> schedulers can be pushing jobs into the backend / cleaning up
> >      >     jobs in
> >      >      >> parallel.
> >      >      >>
> >      >      >
> >      >      > Thought of one more point here where why in Xe we
> >     absolutely want
> >      >     a 1 to
> >      >      > 1 ratio between entity and scheduler - the way we implement
> >      >     timeslicing
> >      >      > for preempt fences.
> >      >      >
> >      >      > Let me try to explain.
> >      >      >
> >      >      > Preempt fences are implemented via the generic messaging
> >      >     interface [1]
> >      >      > with suspend / resume messages. If a suspend messages is
> >     received to
> >      >      > soon after calling resume (this is per entity) we simply
> >     sleep in the
> >      >      > suspend call thus giving the entity a timeslice. This
> >     completely
> >      >     falls
> >      >      > apart with a many to 1 relationship as now a entity
> >     waiting for a
> >      >      > timeslice blocks the other entities. Could we work aroudn
> >     this,
> >      >     sure but
> >      >      > just another bunch of code we'd have to add in Xe. Being to
> >      >     freely sleep
> >      >      > in backend without affecting other entities is really, really
> >      >     nice IMO
> >      >      > and I bet Xe isn't the only driver that is going to feel
> >     this way.
> >      >      >
> >      >      > Last thing I'll say regardless of how anyone feels about
> >     Xe using
> >      >     a 1 to
> >      >      > 1 relationship this patch IMO makes sense as I hope we can all
> >      >     agree a
> >      >      > workqueue scales better than kthreads.
> >      >
> >      >     I don't know for sure what will scale better and for what use
> >     case,
> >      >     combination of CPU cores vs number of GPU engines to keep
> >     busy vs other
> >      >     system activity. But I wager someone is bound to ask for some
> >      >     numbers to
> >      >     make sure proposal is not negatively affecting any other drivers.
> >      >
> >      >
> >      > Then let them ask.  Waving your hands vaguely in the direction of
> >     the
> >      > rest of DRM and saying "Uh, someone (not me) might object" is
> >     profoundly
> >      > unhelpful.  Sure, someone might.  That's why it's on dri-devel.
> > If you
> >      > think there's someone in particular who might have a useful
> >     opinion on
> >      > this, throw them in the CC so they don't miss the e-mail thread.
> >      >
> >      > Or are you asking for numbers?  If so, what numbers are you
> >     asking for?
> > 
> >     It was a heads up to the Xe team in case people weren't appreciating
> >     how
> >     the proposed change has the potential influence power and performance
> >     across the board. And nothing in the follow up discussion made me think
> >     it was considered so I don't think it was redundant to raise it.
> > 
> >     In my experience it is typical that such core changes come with some
> >     numbers. Which is in case of drm scheduler is tricky and probably
> >     requires explicitly asking everyone to test (rather than count on
> >     "don't
> >     miss the email thread"). Real products can fail to ship due ten mW here
> >     or there. Like suddenly an extra core prevented from getting into deep
> >     sleep.
> > 
> >     If that was "profoundly unhelpful" so be it.
> > 
> > 
> > With your above explanation, it makes more sense what you're asking.
> > It's still not something Matt is likely to be able to provide on his
> > own.  We need to tag some other folks and ask them to test it out.  We
> > could play around a bit with it on Xe but it's not exactly production
> > grade yet and is going to hit this differently from most.  Likely
> > candidates are probably AMD and Freedreno.
> 
> Whoever is setup to check out power and performance would be good to give it
> a spin, yes.
> 
> PS. I don't think I was asking Matt to test with other devices. To start
> with I think Xe is a team effort. I was asking for more background on the
> design decision since patch 4/20 does not say anything on that angle, nor
> later in the thread it was IMO sufficiently addressed.
> 
> >      > Also, If we're talking about a design that might paint us into an
> >      > Intel-HW-specific hole, that would be one thing.  But we're not.
> > We're
> >      > talking about switching which kernel threading/task mechanism to
> >     use for
> >      > what's really a very generic problem.  The core Xe design works
> >     without
> >      > this patch (just with more kthreads).  If we land this patch or
> >      > something like it and get it wrong and it causes a performance
> >     problem
> >      > for someone down the line, we can revisit it.
> > 
> >     For some definition of "it works" - I really wouldn't suggest
> >     shipping a
> >     kthread per user context at any point.
> > 
> > 
> > You have yet to elaborate on why. What resources is it consuming that's
> > going to be a problem? Are you anticipating CPU affinity problems? Or
> > does it just seem wasteful?
> 
> Well I don't know, commit message says the approach does not scale. :)
>

I don't think we want a user interface to directly be able to create a
kthread, that seems like a bad idea which Christian pointed out to us
off the list last March.
 
> > I think I largely agree that it's probably unnecessary/wasteful but
> > reducing the number of kthreads seems like a tractable problem to solve
> > regardless of where we put the gpu_scheduler object.  Is this the right
> > solution?  Maybe not.  It was also proposed at one point that we could
> > split the scheduler into two pieces: A scheduler which owns the kthread,
> > and a back-end which targets some HW ring thing where you can have
> > multiple back-ends per scheduler.  That's certainly more invasive from a
> > DRM scheduler internal API PoV but would solve the kthread problem in a
> > way that's more similar to what we have now.
> > 
> >      >     In any case that's a low level question caused by the high
> >     level design
> >      >     decision. So I'd think first focus on the high level - which
> >     is the 1:1
> >      >     mapping of entity to scheduler instance proposal.
> >      >
> >      >     Fundamentally it will be up to the DRM maintainers and the
> >     community to
> >      >     bless your approach. And it is important to stress 1:1 is about
> >      >     userspace contexts, so I believe unlike any other current
> >     scheduler
> >      >     user. And also important to stress this effectively does not
> >     make Xe
> >      >     _really_ use the scheduler that much.
> >      >
> >      >
> >      > I don't think this makes Xe nearly as much of a one-off as you
> >     think it
> >      > does.  I've already told the Asahi team working on Apple M1/2
> >     hardware
> >      > to do it this way and it seems to be a pretty good mapping for
> >     them. I
> >      > believe this is roughly the plan for nouveau as well.  It's not
> >     the way
> >      > it currently works for anyone because most other groups aren't
> >     doing FW
> >      > scheduling yet.  In the world of FW scheduling and hardware
> >     designed to
> >      > support userspace direct-to-FW submit, I think the design makes
> >     perfect
> >      > sense (see below) and I expect we'll see more drivers move in this
> >      > direction as those drivers evolve.  (AMD is doing some customish
> >     thing
> >      > for how with gpu_scheduler on the front-end somehow. I've not dug
> >     into
> >      > those details.)
> >      >
> >      >     I can only offer my opinion, which is that the two options
> >     mentioned in
> >      >     this thread (either improve drm scheduler to cope with what is
> >      >     required,
> >      >     or split up the code so you can use just the parts of
> >     drm_sched which
> >      >     you want - which is frontend dependency tracking) shouldn't be so
> >      >     readily dismissed, given how I think the idea was for the new
> >     driver to
> >      >     work less in a silo and more in the community (not do kludges to
> >      >     workaround stuff because it is thought to be too hard to
> >     improve common
> >      >     code), but fundamentally, "goto previous paragraph" for what I am
> >      >     concerned.
> >      >
> >      >
> >      > Meta comment:  It appears as if you're falling into the standard
> >     i915
> >      > team trap of having an internal discussion about what the community
> >      > discussion might look like instead of actually having the community
> >      > discussion.  If you are seriously concerned about interactions with
> >      > other drivers or whether or setting common direction, the right
> >     way to
> >      > do that is to break a patch or two out into a separate RFC series
> >     and
> >      > tag a handful of driver maintainers.  Trying to predict the
> >     questions
> >      > other people might ask is pointless. Cc them and asking for their
> >     input
> >      > instead.
> > 
> >     I don't follow you here. It's not an internal discussion - I am raising
> >     my concerns on the design publicly. I am supposed to write a patch to
> >     show something, but am allowed to comment on a RFC series?
> > 
> > 
> > I may have misread your tone a bit.  It felt a bit like too many
> > discussions I've had in the past where people are trying to predict what
> > others will say instead of just asking them.  Reading it again, I was
> > probably jumping to conclusions a bit.  Sorry about that.
> 
> Okay no problem, thanks. In any case we don't have to keep discussing it,
> since I wrote one or two emails ago it is fundamentally on the maintainers
> and community to ack the approach. I only felt like RFC did not explain the
> potential downsides sufficiently so I wanted to probe that area a bit.
> 
> >     It is "drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> > 
> > 
> > Yeah, it probably should have.  I think that's mostly what I've been
> > trying to say.
> > 
> >      >
> >      >     Regards,
> >      >
> >      >     Tvrtko
> >      >
> >      >     P.S. And as a related side note, there are more areas where
> >     drm_sched
> >      >     could be improved, like for instance priority handling.
> >      >     Take a look at msm_submitqueue_create /
> >     msm_gpu_convert_priority /
> >      >     get_sched_entity to see how msm works around the drm_sched
> >     hardcoded
> >      >     limit of available priority levels, in order to avoid having
> >     to leave a
> >      >     hw capability unused. I suspect msm would be happier if they
> >     could have
> >      >     all priority levels equal in terms of whether they apply only
> >     at the
> >      >     frontend level or completely throughout the pipeline.
> >      >
> >      >      > [1]
> >      >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >      >
> >  <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> >      >      >
> >      >      >>> What would be interesting to learn is whether the option of
> >      >     refactoring
> >      >      >>> drm_sched to deal with out of order completion was
> >     considered
> >      >     and what were
> >      >      >>> the conclusions.
> >      >      >>>
> >      >      >>
> >      >      >> I coded this up a while back when trying to convert the
> >     i915 to
> >      >     the DRM
> >      >      >> scheduler it isn't all that hard either. The free flow
> >     control
> >      >     on the
> >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> >      >     really what
> >      >      >> sold me on the this design.
> >      >
> >      >
> >      > You're not the only one to suggest supporting out-of-order
> >     completion.
> >      > However, it's tricky and breaks a lot of internal assumptions of the
> >      > scheduler. It also reduces functionality a bit because it can no
> >     longer
> >      > automatically rate-limit HW/FW queues which are often
> >     fixed-size.  (Ok,
> >      > yes, it probably could but it becomes a substantially harder
> >     problem.)
> >      >
> >      > It also seems like a worse mapping to me.  The goal here is to turn
> >      > submissions on a userspace-facing engine/queue into submissions
> >     to a FW
> >      > queue submissions, sorting out any dma_fence dependencies.  Matt's
> >      > description of saying this is a 1:1 mapping between sched/entity
> >     doesn't
> >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> >     mapping?
> >      > Why is that better?
> > 
> >     As I have stated before, what I think what would fit well for Xe is one
> >     drm_scheduler per engine class. In specific terms on our current
> >     hardware, one drm scheduler instance for render, compute, blitter,
> >     video
> >     and video enhance. Userspace contexts remain scheduler entities.
> > 
> > 
> > And this is where we fairly strongly disagree.  More in a bit.
> > 
> >     That way you avoid the whole kthread/kworker story and you have it
> >     actually use the entity picking code in the scheduler, which may be
> >     useful when the backend is congested.
> > 
> > 
> > What back-end congestion are you referring to here?  Running out of FW
> > queue IDs?  Something else?
> 
> CT channel, number of context ids.
> 
> > 
> >     Yes you have to solve the out of order problem so in my mind that is
> >     something to discuss. What the problem actually is (just TDR?), how
> >     tricky and why etc.
> > 
> >     And yes you lose the handy LRCA ring buffer size management so you'd
> >     have to make those entities not runnable in some other way.
> > 
> >     Regarding the argument you raise below - would any of that make the
> >     frontend / backend separation worse and why? Do you think it is less
> >     natural? If neither is true then all remains is that it appears extra
> >     work to support out of order completion of entities has been discounted
> >     in favour of an easy but IMO inelegant option.
> > 
> > 
> > Broadly speaking, the kernel needs to stop thinking about GPU scheduling
> > in terms of scheduling jobs and start thinking in terms of scheduling
> > contexts/engines.  There is still some need for scheduling individual
> > jobs but that is only for the purpose of delaying them as needed to
> > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > get shoved onto the context/engine queue and from there the kernel only
> > really manages whole contexts/engines.  This is a major architectural
> > shift, entirely different from the way i915 scheduling works.  It's also
> > different from the historical usage of DRM scheduler which I think is
> > why this all looks a bit funny.
> > 
> > To justify this architectural shift, let's look at where we're headed.
> > In the glorious future...
> > 
> >   1. Userspace submits directly to firmware queues.  The kernel has no
> > visibility whatsoever into individual jobs.  At most it can pause/resume
> > FW contexts as needed to handle eviction and memory management.
> > 
> >   2. Because of 1, apart from handing out the FW queue IDs at the
> > beginning, the kernel can't really juggle them that much.  Depending on
> > FW design, it may be able to pause a client, give its IDs to another,
> > and then resume it later when IDs free up.  What it's not doing is
> > juggling IDs on a job-by-job basis like i915 currently is.
> > 
> >   3. Long-running compute jobs may not complete for days.  This means
> > that memory management needs to happen in terms of pause/resume of
> > entire contexts/engines using the memory rather than based on waiting
> > for individual jobs to complete or pausing individual jobs until the
> > memory is available.
> > 
> >   4. Synchronization happens via userspace memory fences (UMF) and the
> > kernel is mostly unaware of most dependencies and when a context/engine
> > is or is not runnable.  Instead, it keeps as many of them minimally
> > active (memory is available, even if it's in system RAM) as possible and
> > lets the FW sort out dependencies.  (There may need to be some facility
> > for sleeping a context until a memory change similar to futex() or
> > poll() for userspace threads.  There are some details TBD.)
> > 
> > Are there potential problems that will need to be solved here?  Yes.  Is
> > it a good design?  Well, Microsoft has been living in this future for
> > half a decade or better and it's working quite well for them.  It's also
> > the way all modern game consoles work.  It really is just Linux that's
> > stuck with the same old job model we've had since the monumental shift
> > to DRI2.
> > 
> > To that end, one of the core goals of the Xe project was to make the
> > driver internally behave as close to the above model as possible while
> > keeping the old-school job model as a very thin layer on top.  As the
> > broader ecosystem problems (window-system support for UMF, for instance)
> > are solved, that layer can be peeled back.  The core driver will already
> > be ready for it.
> > 
> > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > and stuff jobs into their respective context/engine queue once they're
> > ready.  All the actual scheduling happens in firmware and any scheduling
> > the kernel does to deal with contention, oversubscriptions, too many
> > contexts, etc. is between contexts/engines, not individual jobs.  Sure,
> > the individual job visibility is nice, but if we design around it, we'll
> > never get to the glorious future.
> > 
> > I really need to turn the above (with a bit more detail) into a blog
> > post.... Maybe I'll do that this week.
> > 
> > In any case, I hope that provides more insight into why Xe is designed
> > the way it is and why I'm pushing back so hard on trying to make it more
> > of a "classic" driver as far as scheduling is concerned.  Are there
> > potential problems here?  Yes, that's why Xe has been labeled a
> > prototype.  Are such radical changes necessary to get to said glorious
> > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> 
> Right, that's all solid I think. My takeaway is that frontend priority
> sorting and that stuff isn't needed and that is okay. And that there are
> multiple options to maybe improve drm scheduler, like the fore mentioned
> making it deal with out of order, or split into functional components, or
> split frontend/backend what you suggested. For most of them cost vs benefit
> is more or less not completely clear, neither how much effort was invested
> to look into them.
> 
> One thing I missed from this explanation is how drm_scheduler per engine
> class interferes with the high level concepts. And I did not manage to pick
> up on what exactly is the TDR problem in that case. Maybe the two are one
> and the same.
> 
> Bottom line is I still have the concern that conversion to kworkers has an
> opportunity to regress. Possibly more opportunity for some Xe use cases than
> to affect other vendors, since they would still be using per physical engine
> / queue scheduler instances.
> 

We certainly don't want to affect other vendors but I haven't yet heard
any push back from other vendors. I don't think speculating about
potential problems is helpful.

> And to put my money where my mouth is I will try to put testing Xe inside
> the full blown ChromeOS environment in my team plans. It would probably also
> be beneficial if Xe team could take a look at real world behaviour of the
> extreme transcode use cases too. If the stack is ready for that and all. It
> would be better to know earlier rather than later if there is a fundamental
> issue.
>

We don't have a media UMD yet it will be tough to test at this point in
time. Also not sure when Xe is going to be POR for a Chrome product
either so porting Xe into ChromeOS likely isn't a top priority for your
team. I know from experience that porting things into ChromeOS isn't
trivial as I've support several of these efforts. Not saying don't do
this just mentioning the realities of what you are suggesting.

Matt

> For the patch at hand, and the cover letter, it certainly feels it would
> benefit to record the past design discussion had with AMD folks, to
> explicitly copy other drivers, and to record the theoretical pros and cons
> of threads vs unbound workers as I have tried to highlight them.
> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 19:40                           ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 19:40 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 08:50:37AM +0000, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 14:08, Jason Ekstrand wrote:
> > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>>
> > wrote:
> > 
> > 
> > 
> >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> >     [snip]
> > 
> >      >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >      >     contexts (per
> >      >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >      >     invasive changes
> >      >      >>> to the shared code is in the spirit of the overall idea
> >     and instead
> >      >      >>> opportunity should be used to look at way to
> >     refactor/improve
> >      >     drm_sched.
> >      >
> >      >
> >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> >     all or
> >      > really needs to drive a re-factor.  (More on that later.)
> > There's only
> >      > one real issue which is that it fires off potentially a lot of
> >     kthreads.
> >      > Even that's not that bad given that kthreads are pretty light and
> >     you're
> >      > not likely to have more kthreads than userspace threads which are
> >     much
> >      > heavier.  Not ideal, but not the end of the world either.
> > Definitely
> >      > something we can/should optimize but if we went through with Xe
> >     without
> >      > this patch, it would probably be mostly ok.
> >      >
> >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >      >>
> >      >      >> I'm not really prepared to make large changes to DRM
> >     scheduler
> >      >     at the
> >      >      >> moment for Xe as they are not really required nor does Boris
> >      >     seem they
> >      >      >> will be required for his work either. I am interested to see
> >      >     what Boris
> >      >      >> comes up with.
> >      >      >>
> >      >      >>> Even on the low level, the idea to replace drm_sched threads
> >      >     with workers
> >      >      >>> has a few problems.
> >      >      >>>
> >      >      >>> To start with, the pattern of:
> >      >      >>>
> >      >      >>>    while (not_stopped) {
> >      >      >>>     keep picking jobs
> >      >      >>>    }
> >      >      >>>
> >      >      >>> Feels fundamentally in disagreement with workers (while
> >      >     obviously fits
> >      >      >>> perfectly with the current kthread design).
> >      >      >>
> >      >      >> The while loop breaks and worker exists if no jobs are ready.
> >      >
> >      >
> >      > I'm not very familiar with workqueues. What are you saying would fit
> >      > better? One scheduling job per work item rather than one big work
> >     item
> >      > which handles all available jobs?
> > 
> >     Yes and no, it indeed IMO does not fit to have a work item which is
> >     potentially unbound in runtime. But it is a bit moot conceptual
> >     mismatch
> >     because it is a worst case / theoretical, and I think due more
> >     fundamental concerns.
> > 
> >     If we have to go back to the low level side of things, I've picked this
> >     random spot to consolidate what I have already mentioned and perhaps
> >     expand.
> > 
> >     To start with, let me pull out some thoughts from workqueue.rst:
> > 
> >     """
> >     Generally, work items are not expected to hog a CPU and consume many
> >     cycles. That means maintaining just enough concurrency to prevent work
> >     processing from stalling should be optimal.
> >     """
> > 
> >     For unbound queues:
> >     """
> >     The responsibility of regulating concurrency level is on the users.
> >     """
> > 
> >     Given the unbound queues will be spawned on demand to service all
> >     queued
> >     work items (more interesting when mixing up with the
> >     system_unbound_wq),
> >     in the proposed design the number of instantiated worker threads does
> >     not correspond to the number of user threads (as you have elsewhere
> >     stated), but pessimistically to the number of active user contexts.
> > 
> > 
> > Those are pretty much the same in practice.  Rather, user threads is
> > typically an upper bound on the number of contexts.  Yes, a single user
> > thread could have a bunch of contexts but basically nothing does that
> > except IGT.  In real-world usage, it's at most one context per user
> > thread.
> 
> Typically is the key here. But I am not sure it is good enough. Consider
> this example - Intel Flex 170:
> 
>  * Delivers up to 36 streams 1080p60 transcode throughput per card.
>  * When scaled to 10 cards in a 4U server configuration, it can support up
> to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> 
> One transcode stream from my experience typically is 3-4 GPU contexts
> (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single CPU
> thread. 4 contexts * 36 streams = 144 active contexts. Multiply by 60fps =
> 8640 jobs submitted and completed per second.
> 

See my reply with my numbers based running xe_exec_threads, on a TGL we
are getting 33711 jobs per sec /w 640 xe_engines. This seems to scale
just fine.

> 144 active contexts in the proposed scheme means possibly means 144 kernel
> worker threads spawned (driven by 36 transcode CPU threads). (I don't think
> the pools would scale down given all are constantly pinged at 60fps.)
> 
> And then each of 144 threads goes to grab the single GuC CT mutex. First
> threads are being made schedulable, then put to sleep as mutex contention is
> hit, then woken again as mutexes are getting released, rinse, repeat.
> 
> (And yes this backend contention is there regardless of 1:1:1, it would
> require a different re-design to solve that. But it is just a question
> whether there are 144 contending threads, or just 6 with the thread per
> engine class scheme.)
> 
> Then multiply all by 10 for a 4U server use case and you get 1440 worker
> kthreads, yes 10 more CT locks, but contending on how many CPU cores? Just
> so they can grab a timeslice and maybe content on a mutex as the next step.
>

Same as above, this seems to scale just fine as I bet the above example
of 33711 job per sec is limited by a single GuC context switching rather
than Xe being about to feed the GuC. Also certainly a server in this
configuration is going to a CPU much faster than the TGL I was using.

Also did another quick change to use 1280 xe_engines in xe_exec_threads:
root@DUT025-TGLU:igt-gpu-tools# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.198s)

More or less same results as 640 xe_engines.
 
> This example is where it would hurt on large systems. Imagine only an even
> wider media transcode card...
> 
> Second example is only a single engine class used (3d desktop?) but with a
> bunch of not-runnable jobs queued and waiting on a fence to signal. Implicit
> or explicit dependencies doesn't matter. Then the fence signals and call
> backs run. N work items get scheduled, but they all submit to the same HW
> engine. So we end up with:
> 
>         /-- wi1 --\
>        / ..     .. \
>  cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>        \ ..    ..  /
>         \-- wiN --/
> 
> 
> All that we have achieved is waking up N CPUs to contend on the same lock
> and effectively insert the job into the same single HW queue. I don't see
> any positives there.
>

I've said this before, the CT channel in practice isn't going to be full
so the section of code protected by the mutex is really, really small.
The mutex really shouldn't ever have contention. Also does a mutex spin
for small period of time before going to sleep? I seem to recall some
type of core lock did this, if we can use a lock that spins for short
period of time this argument falls apart.
 
> This example I think can particularly hurt small / low power devices because
> of needless waking up of many cores for no benefit. Granted, I don't have a
> good feel on how common this pattern is in practice.
> 
> > 
> >     That
> >     is the number which drives the maximum number of not-runnable jobs that
> >     can become runnable at once, and hence spawn that many work items, and
> >     in turn unbound worker threads.
> > 
> >     Several problems there.
> > 
> >     It is fundamentally pointless to have potentially that many more
> >     threads
> >     than the number of CPU cores - it simply creates a scheduling storm.
> > 
> >     Unbound workers have no CPU / cache locality either and no connection
> >     with the CPU scheduler to optimize scheduling patterns. This may matter
> >     either on large systems or on small ones. Whereas the current design
> >     allows for scheduler to notice userspace CPU thread keeps waking up the
> >     same drm scheduler kernel thread, and so it can keep them on the same
> >     CPU, the unbound workers lose that ability and so 2nd CPU might be
> >     getting woken up from low sleep for every submission.
> > 
> >     Hence, apart from being a bit of a impedance mismatch, the proposal has
> >     the potential to change performance and power patterns and both large
> >     and small machines.
> > 
> > 
> > Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
> > deferred kwork does appear to mismatch somewhat with what the scheduler
> > needs or at least how it's worked in the past.  How much impact will
> > that mismatch have?  Unclear.
> > 
> >      >      >>> Secondly, it probably demands separate workers (not
> >     optional),
> >      >     otherwise
> >      >      >>> behaviour of shared workqueues has either the potential to
> >      >     explode number
> >      >      >>> kernel threads anyway, or add latency.
> >      >      >>>
> >      >      >>
> >      >      >> Right now the system_unbound_wq is used which does have a
> >     limit
> >      >     on the
> >      >      >> number of threads, right? I do have a FIXME to allow a
> >     worker to be
> >      >      >> passed in similar to TDR.
> >      >      >>
> >      >      >> WRT to latency, the 1:1 ratio could actually have lower
> >     latency
> >      >     as 2 GPU
> >      >      >> schedulers can be pushing jobs into the backend / cleaning up
> >      >     jobs in
> >      >      >> parallel.
> >      >      >>
> >      >      >
> >      >      > Thought of one more point here where why in Xe we
> >     absolutely want
> >      >     a 1 to
> >      >      > 1 ratio between entity and scheduler - the way we implement
> >      >     timeslicing
> >      >      > for preempt fences.
> >      >      >
> >      >      > Let me try to explain.
> >      >      >
> >      >      > Preempt fences are implemented via the generic messaging
> >      >     interface [1]
> >      >      > with suspend / resume messages. If a suspend messages is
> >     received to
> >      >      > soon after calling resume (this is per entity) we simply
> >     sleep in the
> >      >      > suspend call thus giving the entity a timeslice. This
> >     completely
> >      >     falls
> >      >      > apart with a many to 1 relationship as now a entity
> >     waiting for a
> >      >      > timeslice blocks the other entities. Could we work aroudn
> >     this,
> >      >     sure but
> >      >      > just another bunch of code we'd have to add in Xe. Being to
> >      >     freely sleep
> >      >      > in backend without affecting other entities is really, really
> >      >     nice IMO
> >      >      > and I bet Xe isn't the only driver that is going to feel
> >     this way.
> >      >      >
> >      >      > Last thing I'll say regardless of how anyone feels about
> >     Xe using
> >      >     a 1 to
> >      >      > 1 relationship this patch IMO makes sense as I hope we can all
> >      >     agree a
> >      >      > workqueue scales better than kthreads.
> >      >
> >      >     I don't know for sure what will scale better and for what use
> >     case,
> >      >     combination of CPU cores vs number of GPU engines to keep
> >     busy vs other
> >      >     system activity. But I wager someone is bound to ask for some
> >      >     numbers to
> >      >     make sure proposal is not negatively affecting any other drivers.
> >      >
> >      >
> >      > Then let them ask.  Waving your hands vaguely in the direction of
> >     the
> >      > rest of DRM and saying "Uh, someone (not me) might object" is
> >     profoundly
> >      > unhelpful.  Sure, someone might.  That's why it's on dri-devel.
> > If you
> >      > think there's someone in particular who might have a useful
> >     opinion on
> >      > this, throw them in the CC so they don't miss the e-mail thread.
> >      >
> >      > Or are you asking for numbers?  If so, what numbers are you
> >     asking for?
> > 
> >     It was a heads up to the Xe team in case people weren't appreciating
> >     how
> >     the proposed change has the potential influence power and performance
> >     across the board. And nothing in the follow up discussion made me think
> >     it was considered so I don't think it was redundant to raise it.
> > 
> >     In my experience it is typical that such core changes come with some
> >     numbers. Which is in case of drm scheduler is tricky and probably
> >     requires explicitly asking everyone to test (rather than count on
> >     "don't
> >     miss the email thread"). Real products can fail to ship due ten mW here
> >     or there. Like suddenly an extra core prevented from getting into deep
> >     sleep.
> > 
> >     If that was "profoundly unhelpful" so be it.
> > 
> > 
> > With your above explanation, it makes more sense what you're asking.
> > It's still not something Matt is likely to be able to provide on his
> > own.  We need to tag some other folks and ask them to test it out.  We
> > could play around a bit with it on Xe but it's not exactly production
> > grade yet and is going to hit this differently from most.  Likely
> > candidates are probably AMD and Freedreno.
> 
> Whoever is setup to check out power and performance would be good to give it
> a spin, yes.
> 
> PS. I don't think I was asking Matt to test with other devices. To start
> with I think Xe is a team effort. I was asking for more background on the
> design decision since patch 4/20 does not say anything on that angle, nor
> later in the thread it was IMO sufficiently addressed.
> 
> >      > Also, If we're talking about a design that might paint us into an
> >      > Intel-HW-specific hole, that would be one thing.  But we're not.
> > We're
> >      > talking about switching which kernel threading/task mechanism to
> >     use for
> >      > what's really a very generic problem.  The core Xe design works
> >     without
> >      > this patch (just with more kthreads).  If we land this patch or
> >      > something like it and get it wrong and it causes a performance
> >     problem
> >      > for someone down the line, we can revisit it.
> > 
> >     For some definition of "it works" - I really wouldn't suggest
> >     shipping a
> >     kthread per user context at any point.
> > 
> > 
> > You have yet to elaborate on why. What resources is it consuming that's
> > going to be a problem? Are you anticipating CPU affinity problems? Or
> > does it just seem wasteful?
> 
> Well I don't know, commit message says the approach does not scale. :)
>

I don't think we want a user interface to directly be able to create a
kthread, that seems like a bad idea which Christian pointed out to us
off the list last March.
 
> > I think I largely agree that it's probably unnecessary/wasteful but
> > reducing the number of kthreads seems like a tractable problem to solve
> > regardless of where we put the gpu_scheduler object.  Is this the right
> > solution?  Maybe not.  It was also proposed at one point that we could
> > split the scheduler into two pieces: A scheduler which owns the kthread,
> > and a back-end which targets some HW ring thing where you can have
> > multiple back-ends per scheduler.  That's certainly more invasive from a
> > DRM scheduler internal API PoV but would solve the kthread problem in a
> > way that's more similar to what we have now.
> > 
> >      >     In any case that's a low level question caused by the high
> >     level design
> >      >     decision. So I'd think first focus on the high level - which
> >     is the 1:1
> >      >     mapping of entity to scheduler instance proposal.
> >      >
> >      >     Fundamentally it will be up to the DRM maintainers and the
> >     community to
> >      >     bless your approach. And it is important to stress 1:1 is about
> >      >     userspace contexts, so I believe unlike any other current
> >     scheduler
> >      >     user. And also important to stress this effectively does not
> >     make Xe
> >      >     _really_ use the scheduler that much.
> >      >
> >      >
> >      > I don't think this makes Xe nearly as much of a one-off as you
> >     think it
> >      > does.  I've already told the Asahi team working on Apple M1/2
> >     hardware
> >      > to do it this way and it seems to be a pretty good mapping for
> >     them. I
> >      > believe this is roughly the plan for nouveau as well.  It's not
> >     the way
> >      > it currently works for anyone because most other groups aren't
> >     doing FW
> >      > scheduling yet.  In the world of FW scheduling and hardware
> >     designed to
> >      > support userspace direct-to-FW submit, I think the design makes
> >     perfect
> >      > sense (see below) and I expect we'll see more drivers move in this
> >      > direction as those drivers evolve.  (AMD is doing some customish
> >     thing
> >      > for how with gpu_scheduler on the front-end somehow. I've not dug
> >     into
> >      > those details.)
> >      >
> >      >     I can only offer my opinion, which is that the two options
> >     mentioned in
> >      >     this thread (either improve drm scheduler to cope with what is
> >      >     required,
> >      >     or split up the code so you can use just the parts of
> >     drm_sched which
> >      >     you want - which is frontend dependency tracking) shouldn't be so
> >      >     readily dismissed, given how I think the idea was for the new
> >     driver to
> >      >     work less in a silo and more in the community (not do kludges to
> >      >     workaround stuff because it is thought to be too hard to
> >     improve common
> >      >     code), but fundamentally, "goto previous paragraph" for what I am
> >      >     concerned.
> >      >
> >      >
> >      > Meta comment:  It appears as if you're falling into the standard
> >     i915
> >      > team trap of having an internal discussion about what the community
> >      > discussion might look like instead of actually having the community
> >      > discussion.  If you are seriously concerned about interactions with
> >      > other drivers or whether or setting common direction, the right
> >     way to
> >      > do that is to break a patch or two out into a separate RFC series
> >     and
> >      > tag a handful of driver maintainers.  Trying to predict the
> >     questions
> >      > other people might ask is pointless. Cc them and asking for their
> >     input
> >      > instead.
> > 
> >     I don't follow you here. It's not an internal discussion - I am raising
> >     my concerns on the design publicly. I am supposed to write a patch to
> >     show something, but am allowed to comment on a RFC series?
> > 
> > 
> > I may have misread your tone a bit.  It felt a bit like too many
> > discussions I've had in the past where people are trying to predict what
> > others will say instead of just asking them.  Reading it again, I was
> > probably jumping to conclusions a bit.  Sorry about that.
> 
> Okay no problem, thanks. In any case we don't have to keep discussing it,
> since I wrote one or two emails ago it is fundamentally on the maintainers
> and community to ack the approach. I only felt like RFC did not explain the
> potential downsides sufficiently so I wanted to probe that area a bit.
> 
> >     It is "drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> > 
> > 
> > Yeah, it probably should have.  I think that's mostly what I've been
> > trying to say.
> > 
> >      >
> >      >     Regards,
> >      >
> >      >     Tvrtko
> >      >
> >      >     P.S. And as a related side note, there are more areas where
> >     drm_sched
> >      >     could be improved, like for instance priority handling.
> >      >     Take a look at msm_submitqueue_create /
> >     msm_gpu_convert_priority /
> >      >     get_sched_entity to see how msm works around the drm_sched
> >     hardcoded
> >      >     limit of available priority levels, in order to avoid having
> >     to leave a
> >      >     hw capability unused. I suspect msm would be happier if they
> >     could have
> >      >     all priority levels equal in terms of whether they apply only
> >     at the
> >      >     frontend level or completely throughout the pipeline.
> >      >
> >      >      > [1]
> >      >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >      >
> >  <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> >      >      >
> >      >      >>> What would be interesting to learn is whether the option of
> >      >     refactoring
> >      >      >>> drm_sched to deal with out of order completion was
> >     considered
> >      >     and what were
> >      >      >>> the conclusions.
> >      >      >>>
> >      >      >>
> >      >      >> I coded this up a while back when trying to convert the
> >     i915 to
> >      >     the DRM
> >      >      >> scheduler it isn't all that hard either. The free flow
> >     control
> >      >     on the
> >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) is
> >      >     really what
> >      >      >> sold me on the this design.
> >      >
> >      >
> >      > You're not the only one to suggest supporting out-of-order
> >     completion.
> >      > However, it's tricky and breaks a lot of internal assumptions of the
> >      > scheduler. It also reduces functionality a bit because it can no
> >     longer
> >      > automatically rate-limit HW/FW queues which are often
> >     fixed-size.  (Ok,
> >      > yes, it probably could but it becomes a substantially harder
> >     problem.)
> >      >
> >      > It also seems like a worse mapping to me.  The goal here is to turn
> >      > submissions on a userspace-facing engine/queue into submissions
> >     to a FW
> >      > queue submissions, sorting out any dma_fence dependencies.  Matt's
> >      > description of saying this is a 1:1 mapping between sched/entity
> >     doesn't
> >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> >     mapping?
> >      > Why is that better?
> > 
> >     As I have stated before, what I think what would fit well for Xe is one
> >     drm_scheduler per engine class. In specific terms on our current
> >     hardware, one drm scheduler instance for render, compute, blitter,
> >     video
> >     and video enhance. Userspace contexts remain scheduler entities.
> > 
> > 
> > And this is where we fairly strongly disagree.  More in a bit.
> > 
> >     That way you avoid the whole kthread/kworker story and you have it
> >     actually use the entity picking code in the scheduler, which may be
> >     useful when the backend is congested.
> > 
> > 
> > What back-end congestion are you referring to here?  Running out of FW
> > queue IDs?  Something else?
> 
> CT channel, number of context ids.
> 
> > 
> >     Yes you have to solve the out of order problem so in my mind that is
> >     something to discuss. What the problem actually is (just TDR?), how
> >     tricky and why etc.
> > 
> >     And yes you lose the handy LRCA ring buffer size management so you'd
> >     have to make those entities not runnable in some other way.
> > 
> >     Regarding the argument you raise below - would any of that make the
> >     frontend / backend separation worse and why? Do you think it is less
> >     natural? If neither is true then all remains is that it appears extra
> >     work to support out of order completion of entities has been discounted
> >     in favour of an easy but IMO inelegant option.
> > 
> > 
> > Broadly speaking, the kernel needs to stop thinking about GPU scheduling
> > in terms of scheduling jobs and start thinking in terms of scheduling
> > contexts/engines.  There is still some need for scheduling individual
> > jobs but that is only for the purpose of delaying them as needed to
> > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > get shoved onto the context/engine queue and from there the kernel only
> > really manages whole contexts/engines.  This is a major architectural
> > shift, entirely different from the way i915 scheduling works.  It's also
> > different from the historical usage of DRM scheduler which I think is
> > why this all looks a bit funny.
> > 
> > To justify this architectural shift, let's look at where we're headed.
> > In the glorious future...
> > 
> >   1. Userspace submits directly to firmware queues.  The kernel has no
> > visibility whatsoever into individual jobs.  At most it can pause/resume
> > FW contexts as needed to handle eviction and memory management.
> > 
> >   2. Because of 1, apart from handing out the FW queue IDs at the
> > beginning, the kernel can't really juggle them that much.  Depending on
> > FW design, it may be able to pause a client, give its IDs to another,
> > and then resume it later when IDs free up.  What it's not doing is
> > juggling IDs on a job-by-job basis like i915 currently is.
> > 
> >   3. Long-running compute jobs may not complete for days.  This means
> > that memory management needs to happen in terms of pause/resume of
> > entire contexts/engines using the memory rather than based on waiting
> > for individual jobs to complete or pausing individual jobs until the
> > memory is available.
> > 
> >   4. Synchronization happens via userspace memory fences (UMF) and the
> > kernel is mostly unaware of most dependencies and when a context/engine
> > is or is not runnable.  Instead, it keeps as many of them minimally
> > active (memory is available, even if it's in system RAM) as possible and
> > lets the FW sort out dependencies.  (There may need to be some facility
> > for sleeping a context until a memory change similar to futex() or
> > poll() for userspace threads.  There are some details TBD.)
> > 
> > Are there potential problems that will need to be solved here?  Yes.  Is
> > it a good design?  Well, Microsoft has been living in this future for
> > half a decade or better and it's working quite well for them.  It's also
> > the way all modern game consoles work.  It really is just Linux that's
> > stuck with the same old job model we've had since the monumental shift
> > to DRI2.
> > 
> > To that end, one of the core goals of the Xe project was to make the
> > driver internally behave as close to the above model as possible while
> > keeping the old-school job model as a very thin layer on top.  As the
> > broader ecosystem problems (window-system support for UMF, for instance)
> > are solved, that layer can be peeled back.  The core driver will already
> > be ready for it.
> > 
> > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > and stuff jobs into their respective context/engine queue once they're
> > ready.  All the actual scheduling happens in firmware and any scheduling
> > the kernel does to deal with contention, oversubscriptions, too many
> > contexts, etc. is between contexts/engines, not individual jobs.  Sure,
> > the individual job visibility is nice, but if we design around it, we'll
> > never get to the glorious future.
> > 
> > I really need to turn the above (with a bit more detail) into a blog
> > post.... Maybe I'll do that this week.
> > 
> > In any case, I hope that provides more insight into why Xe is designed
> > the way it is and why I'm pushing back so hard on trying to make it more
> > of a "classic" driver as far as scheduling is concerned.  Are there
> > potential problems here?  Yes, that's why Xe has been labeled a
> > prototype.  Are such radical changes necessary to get to said glorious
> > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> 
> Right, that's all solid I think. My takeaway is that frontend priority
> sorting and that stuff isn't needed and that is okay. And that there are
> multiple options to maybe improve drm scheduler, like the fore mentioned
> making it deal with out of order, or split into functional components, or
> split frontend/backend what you suggested. For most of them cost vs benefit
> is more or less not completely clear, neither how much effort was invested
> to look into them.
> 
> One thing I missed from this explanation is how drm_scheduler per engine
> class interferes with the high level concepts. And I did not manage to pick
> up on what exactly is the TDR problem in that case. Maybe the two are one
> and the same.
> 
> Bottom line is I still have the concern that conversion to kworkers has an
> opportunity to regress. Possibly more opportunity for some Xe use cases than
> to affect other vendors, since they would still be using per physical engine
> / queue scheduler instances.
> 

We certainly don't want to affect other vendors but I haven't yet heard
any push back from other vendors. I don't think speculating about
potential problems is helpful.

> And to put my money where my mouth is I will try to put testing Xe inside
> the full blown ChromeOS environment in my team plans. It would probably also
> be beneficial if Xe team could take a look at real world behaviour of the
> extreme transcode use cases too. If the stack is ready for that and all. It
> would be better to know earlier rather than later if there is a fundamental
> issue.
>

We don't have a media UMD yet it will be tough to test at this point in
time. Also not sure when Xe is going to be POR for a Chrome product
either so porting Xe into ChromeOS likely isn't a top priority for your
team. I know from experience that porting things into ChromeOS isn't
trivial as I've support several of these efforts. Not saying don't do
this just mentioning the realities of what you are suggesting.

Matt

> For the patch at hand, and the cover letter, it certainly feels it would
> benefit to record the past design discussion had with AMD folks, to
> explicitly copy other drivers, and to record the theoretical pros and cons
> of threads vs unbound workers as I have tried to highlight them.
> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-10  8:46                   ` Boris Brezillon
@ 2023-01-11 21:47                     ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-11 21:47 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hi Daniel,
>
> On Mon, 9 Jan 2023 21:40:21 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
> > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > Hi Jason,
> > >
> > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > wrote:
> > > >
> > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > >
> > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > >
> > > > > > > > Hello Matthew,
> > > > > > > >
> > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > >
> > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> > > > > this
> > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > >
> > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > guaranteed to be the same completion even if targeting the same
> > > > > hardware
> > > > > > > > > engine. This is because in XE we have a firmware scheduler, the
> > > > > GuC,
> > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> > > > > using
> > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> > > > > falls
> > > > > > > > > apart as the TDR expects submission order == completion order.
> > > > > Using a
> > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > > problem.
> > > > > > > >
> > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > > FW-assisted
> > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > because the integration with drm_sched was painful, but also because
> > > > > I
> > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > timeslice-oriented scheduling model wasn't really future proof.
> > > > > Giving
> > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> > > > > might
> > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > > > >
> > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > >
> > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > aside from the most basic level of not overflowing the submission ring
> > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > scheduler fits quite well.
> > > > >
> > > >
> > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > want here and what you need for Arm thanks to the number of FW queues
> > > > available. I don't remember the exact number of GuC queues but it's at
> > > > least 1k. This puts it in an entirely different class from what you have on
> > > > Mali. Roughly, there's about three categories here:
> > > >
> > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > categorization.)
> > > >
> > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > >
> > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > windows.
> > > >
> > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > here and will require different designs. For Xe, we really are just using
> > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > >
> > > > How do we deal with class 2? That's an interesting question.  We may
> > > > eventually want to break that off into a separate discussion and not litter
> > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > pretty reasonable solutions but they're going to look a bit different.
> > > >
> > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > Panfrost, I think we want something slightly different but can borrow some
> > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > able to juggle if you run out.
> > >
> > > That's more or less what I do right now, except I don't use the
> > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > (queues with pending jobs that are not yet resident on a FW slot)
> > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > ticks being handled with a delayed-work (tick happening every X
> > > milliseconds when queues are waiting for a slot). It all seems very
> > > HW/FW-specific though, and I think it's a bit premature to try to
> > > generalize that part, but the dep-tracking logic implemented by
> > > drm_sched looked like something I could easily re-use, hence my
> > > interest in Xe's approach.
> >
> > So another option for these few fw queue slots schedulers would be to
> > treat them as vram and enlist ttm.
> >
> > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > handle idr (or xarray or whatever you want) and then help you with all the
> > pipelining (and the drm_sched then with sorting out dependencies). If you
> > then also preferentially "evict" low-priority queus you pretty much have
> > the perfect thing.
> >
> > Note that GuC with sriov splits up the id space and together with some
> > restrictions due to multi-engine contexts media needs might also need this
> > all.
> >
> > If you're balking at the idea of enlisting ttm just for fw queue
> > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > index allocation. Might be worth it to instead lift that into some sched
> > helper code.
>
> Would you mind pointing me to the amdgpu code you're mentioning here?
> Still have a hard time seeing what TTM has to do with scheduling, but I
> also don't know much about TTM, so I'll keep digging.

ttm is about moving stuff in&out of a limited space and gives you some
nice tooling for pipelining it all. It doesn't care whether that space
is vram or some limited id space. vmwgfx used ttm as an id manager
iirc.

> > Either way there's two imo rather solid approaches available to sort this
> > out. And once you have that, then there shouldn't be any big difference in
> > driver design between fw with defacto unlimited queue ids, and those with
> > severe restrictions in number of queues.
>
> Honestly, I don't think there's much difference between those two cases
> already. There's just a bunch of additional code to schedule queues on
> FW slots for the limited-number-of-FW-slots case, which, right now, is
> driver specific. The job queuing front-end pretty much achieves what
> drm_sched does already: queuing job to entities, checking deps,
> submitting job to HW (in our case, writing to the command stream ring
> buffer). Things start to differ after that point: once a scheduling
> entity has pending jobs, we add it to one of the runnable queues (one
> queue per prio) and kick the kernel-side timeslice-based scheduler to
> re-evaluate, if needed.
>
> I'm all for using generic code when it makes sense, even if that means
> adding this common code when it doesn't exists, but I don't want to be
> dragged into some major refactoring that might take years to land.
> Especially if pancsf is the first
> FW-assisted-scheduler-with-few-FW-slot driver.

I don't see where there's a major refactoring that you're getting dragged into?

Yes there's a huge sprawling discussion right now, but I think that's
just largely people getting confused.

Wrt the actual id assignment stuff, in amdgpu at least it's few lines
of code. See the amdgpu_vmid_grab stuff for the simplest starting
point.

And also yes a scheduler frontend for dependency sorting shouldn't
really be a that big thing, so there's not going to be huge amounts of
code sharing in the end. It's the conceptual sharing, and sharing
stuff like drm_sched_entity to eventual build some cross driver gpu
context stuff on top that really is going to matter.

Also like I mentioned, at least in some cases i915-guc might also have
a need for fw scheduler slot allocation for a bunch of running things.

Finally I'm a bit confused why you're building a time sharing
scheduler in the kernel if you have one in fw already. Or do I get
that part wrong?
-Daniel

> Here's a link to my WIP branch [1], and here is the scheduler logic
> [2] if you want to have a look. Don't pay too much attention to the
> driver uAPI (it's being redesigned).
>
> Regards,
>
> Boris
>
> [1]https://gitlab.freedesktop.org/bbrezillon/linux/-/tree/pancsf
> [2]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/pancsf/drivers/gpu/drm/pancsf/pancsf_sched.c



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 21:47                     ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-11 21:47 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel

On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hi Daniel,
>
> On Mon, 9 Jan 2023 21:40:21 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
> > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > Hi Jason,
> > >
> > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > wrote:
> > > >
> > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > >
> > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > >
> > > > > > > > Hello Matthew,
> > > > > > > >
> > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > >
> > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> > > > > this
> > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > >
> > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > guaranteed to be the same completion even if targeting the same
> > > > > hardware
> > > > > > > > > engine. This is because in XE we have a firmware scheduler, the
> > > > > GuC,
> > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> > > > > using
> > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> > > > > falls
> > > > > > > > > apart as the TDR expects submission order == completion order.
> > > > > Using a
> > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > > problem.
> > > > > > > >
> > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > > FW-assisted
> > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > because the integration with drm_sched was painful, but also because
> > > > > I
> > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > timeslice-oriented scheduling model wasn't really future proof.
> > > > > Giving
> > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> > > > > might
> > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > > > >
> > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > >
> > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > aside from the most basic level of not overflowing the submission ring
> > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > scheduler fits quite well.
> > > > >
> > > >
> > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > want here and what you need for Arm thanks to the number of FW queues
> > > > available. I don't remember the exact number of GuC queues but it's at
> > > > least 1k. This puts it in an entirely different class from what you have on
> > > > Mali. Roughly, there's about three categories here:
> > > >
> > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > categorization.)
> > > >
> > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > >
> > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > windows.
> > > >
> > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > here and will require different designs. For Xe, we really are just using
> > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > >
> > > > How do we deal with class 2? That's an interesting question.  We may
> > > > eventually want to break that off into a separate discussion and not litter
> > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > pretty reasonable solutions but they're going to look a bit different.
> > > >
> > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > Panfrost, I think we want something slightly different but can borrow some
> > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > able to juggle if you run out.
> > >
> > > That's more or less what I do right now, except I don't use the
> > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > (queues with pending jobs that are not yet resident on a FW slot)
> > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > ticks being handled with a delayed-work (tick happening every X
> > > milliseconds when queues are waiting for a slot). It all seems very
> > > HW/FW-specific though, and I think it's a bit premature to try to
> > > generalize that part, but the dep-tracking logic implemented by
> > > drm_sched looked like something I could easily re-use, hence my
> > > interest in Xe's approach.
> >
> > So another option for these few fw queue slots schedulers would be to
> > treat them as vram and enlist ttm.
> >
> > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > handle idr (or xarray or whatever you want) and then help you with all the
> > pipelining (and the drm_sched then with sorting out dependencies). If you
> > then also preferentially "evict" low-priority queus you pretty much have
> > the perfect thing.
> >
> > Note that GuC with sriov splits up the id space and together with some
> > restrictions due to multi-engine contexts media needs might also need this
> > all.
> >
> > If you're balking at the idea of enlisting ttm just for fw queue
> > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > index allocation. Might be worth it to instead lift that into some sched
> > helper code.
>
> Would you mind pointing me to the amdgpu code you're mentioning here?
> Still have a hard time seeing what TTM has to do with scheduling, but I
> also don't know much about TTM, so I'll keep digging.

ttm is about moving stuff in&out of a limited space and gives you some
nice tooling for pipelining it all. It doesn't care whether that space
is vram or some limited id space. vmwgfx used ttm as an id manager
iirc.

> > Either way there's two imo rather solid approaches available to sort this
> > out. And once you have that, then there shouldn't be any big difference in
> > driver design between fw with defacto unlimited queue ids, and those with
> > severe restrictions in number of queues.
>
> Honestly, I don't think there's much difference between those two cases
> already. There's just a bunch of additional code to schedule queues on
> FW slots for the limited-number-of-FW-slots case, which, right now, is
> driver specific. The job queuing front-end pretty much achieves what
> drm_sched does already: queuing job to entities, checking deps,
> submitting job to HW (in our case, writing to the command stream ring
> buffer). Things start to differ after that point: once a scheduling
> entity has pending jobs, we add it to one of the runnable queues (one
> queue per prio) and kick the kernel-side timeslice-based scheduler to
> re-evaluate, if needed.
>
> I'm all for using generic code when it makes sense, even if that means
> adding this common code when it doesn't exists, but I don't want to be
> dragged into some major refactoring that might take years to land.
> Especially if pancsf is the first
> FW-assisted-scheduler-with-few-FW-slot driver.

I don't see where there's a major refactoring that you're getting dragged into?

Yes there's a huge sprawling discussion right now, but I think that's
just largely people getting confused.

Wrt the actual id assignment stuff, in amdgpu at least it's few lines
of code. See the amdgpu_vmid_grab stuff for the simplest starting
point.

And also yes a scheduler frontend for dependency sorting shouldn't
really be a that big thing, so there's not going to be huge amounts of
code sharing in the end. It's the conceptual sharing, and sharing
stuff like drm_sched_entity to eventual build some cross driver gpu
context stuff on top that really is going to matter.

Also like I mentioned, at least in some cases i915-guc might also have
a need for fw scheduler slot allocation for a bunch of running things.

Finally I'm a bit confused why you're building a time sharing
scheduler in the kernel if you have one in fw already. Or do I get
that part wrong?
-Daniel

> Here's a link to my WIP branch [1], and here is the scheduler logic
> [2] if you want to have a look. Don't pay too much attention to the
> driver uAPI (it's being redesigned).
>
> Regards,
>
> Boris
>
> [1]https://gitlab.freedesktop.org/bbrezillon/linux/-/tree/pancsf
> [2]https://gitlab.freedesktop.org/bbrezillon/linux/-/blob/pancsf/drivers/gpu/drm/pancsf/pancsf_sched.c



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11  8:50                         ` Tvrtko Ursulin
@ 2023-01-11 22:18                           ` Jason Ekstrand
  -1 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-11 22:18 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Matthew Brost, intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 30539 bytes --]

On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
tvrtko.ursulin@linux.intel.com> wrote:

>
> On 10/01/2023 14:08, Jason Ekstrand wrote:
> > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>>
>
> > wrote:
> >
> >
> >
> >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> >
> >     [snip]
> >
> >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> created
> >      >     contexts (per
> >      >      >>> context _and_ engine) and drm_sched. I am not sure
> avoiding
> >      >     invasive changes
> >      >      >>> to the shared code is in the spirit of the overall idea
> >     and instead
> >      >      >>> opportunity should be used to look at way to
> >     refactor/improve
> >      >     drm_sched.
> >      >
> >      >
> >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> >     all or
> >      > really needs to drive a re-factor.  (More on that later.)
> >     There's only
> >      > one real issue which is that it fires off potentially a lot of
> >     kthreads.
> >      > Even that's not that bad given that kthreads are pretty light and
> >     you're
> >      > not likely to have more kthreads than userspace threads which are
> >     much
> >      > heavier.  Not ideal, but not the end of the world either.
> >     Definitely
> >      > something we can/should optimize but if we went through with Xe
> >     without
> >      > this patch, it would probably be mostly ok.
> >      >
> >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >      >>
> >      >      >> I'm not really prepared to make large changes to DRM
> >     scheduler
> >      >     at the
> >      >      >> moment for Xe as they are not really required nor does
> Boris
> >      >     seem they
> >      >      >> will be required for his work either. I am interested to
> see
> >      >     what Boris
> >      >      >> comes up with.
> >      >      >>
> >      >      >>> Even on the low level, the idea to replace drm_sched
> threads
> >      >     with workers
> >      >      >>> has a few problems.
> >      >      >>>
> >      >      >>> To start with, the pattern of:
> >      >      >>>
> >      >      >>>    while (not_stopped) {
> >      >      >>>     keep picking jobs
> >      >      >>>    }
> >      >      >>>
> >      >      >>> Feels fundamentally in disagreement with workers (while
> >      >     obviously fits
> >      >      >>> perfectly with the current kthread design).
> >      >      >>
> >      >      >> The while loop breaks and worker exists if no jobs are
> ready.
> >      >
> >      >
> >      > I'm not very familiar with workqueues. What are you saying would
> fit
> >      > better? One scheduling job per work item rather than one big work
> >     item
> >      > which handles all available jobs?
> >
> >     Yes and no, it indeed IMO does not fit to have a work item which is
> >     potentially unbound in runtime. But it is a bit moot conceptual
> >     mismatch
> >     because it is a worst case / theoretical, and I think due more
> >     fundamental concerns.
> >
> >     If we have to go back to the low level side of things, I've picked
> this
> >     random spot to consolidate what I have already mentioned and perhaps
> >     expand.
> >
> >     To start with, let me pull out some thoughts from workqueue.rst:
> >
> >     """
> >     Generally, work items are not expected to hog a CPU and consume many
> >     cycles. That means maintaining just enough concurrency to prevent
> work
> >     processing from stalling should be optimal.
> >     """
> >
> >     For unbound queues:
> >     """
> >     The responsibility of regulating concurrency level is on the users.
> >     """
> >
> >     Given the unbound queues will be spawned on demand to service all
> >     queued
> >     work items (more interesting when mixing up with the
> >     system_unbound_wq),
> >     in the proposed design the number of instantiated worker threads does
> >     not correspond to the number of user threads (as you have elsewhere
> >     stated), but pessimistically to the number of active user contexts.
> >
> >
> > Those are pretty much the same in practice.  Rather, user threads is
> > typically an upper bound on the number of contexts.  Yes, a single user
> > thread could have a bunch of contexts but basically nothing does that
> > except IGT.  In real-world usage, it's at most one context per user
> thread.
>
> Typically is the key here. But I am not sure it is good enough. Consider
> this example - Intel Flex 170:
>
>   * Delivers up to 36 streams 1080p60 transcode throughput per card.
>   * When scaled to 10 cards in a 4U server configuration, it can support
> up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
>

I had a feeling it was going to be media.... 😅


> One transcode stream from my experience typically is 3-4 GPU contexts
> (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> 60fps = 8640 jobs submitted and completed per second.
>
> 144 active contexts in the proposed scheme means possibly means 144
> kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> don't think the pools would scale down given all are constantly pinged
> at 60fps.)
>
> And then each of 144 threads goes to grab the single GuC CT mutex. First
> threads are being made schedulable, then put to sleep as mutex
> contention is hit, then woken again as mutexes are getting released,
> rinse, repeat.
>

Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
back-end yet but I was under the impression that most run_job() would be
just shoving another packet into a ring buffer.  If we have to send the GuC
a message on the control ring every single time we submit a job, that's
pretty horrible.

--Jason


(And yes this backend contention is there regardless of 1:1:1, it would
> require a different re-design to solve that. But it is just a question
> whether there are 144 contending threads, or just 6 with the thread per
> engine class scheme.)
>
> Then multiply all by 10 for a 4U server use case and you get 1440 worker
> kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> Just so they can grab a timeslice and maybe content on a mutex as the
> next step.
>
> This example is where it would hurt on large systems. Imagine only an
> even wider media transcode card...
>
> Second example is only a single engine class used (3d desktop?) but with
> a bunch of not-runnable jobs queued and waiting on a fence to signal.
> Implicit or explicit dependencies doesn't matter. Then the fence signals
> and call backs run. N work items get scheduled, but they all submit to
> the same HW engine. So we end up with:
>
>          /-- wi1 --\
>         / ..     .. \
>   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>         \ ..    ..  /
>          \-- wiN --/
>
>
> All that we have achieved is waking up N CPUs to contend on the same
> lock and effectively insert the job into the same single HW queue. I
> don't see any positives there.
>
> This example I think can particularly hurt small / low power devices
> because of needless waking up of many cores for no benefit. Granted, I
> don't have a good feel on how common this pattern is in practice.
>
> >
> >     That
> >     is the number which drives the maximum number of not-runnable jobs
> that
> >     can become runnable at once, and hence spawn that many work items,
> and
> >     in turn unbound worker threads.
> >
> >     Several problems there.
> >
> >     It is fundamentally pointless to have potentially that many more
> >     threads
> >     than the number of CPU cores - it simply creates a scheduling storm.
> >
> >     Unbound workers have no CPU / cache locality either and no connection
> >     with the CPU scheduler to optimize scheduling patterns. This may
> matter
> >     either on large systems or on small ones. Whereas the current design
> >     allows for scheduler to notice userspace CPU thread keeps waking up
> the
> >     same drm scheduler kernel thread, and so it can keep them on the same
> >     CPU, the unbound workers lose that ability and so 2nd CPU might be
> >     getting woken up from low sleep for every submission.
> >
> >     Hence, apart from being a bit of a impedance mismatch, the proposal
> has
> >     the potential to change performance and power patterns and both large
> >     and small machines.
> >
> >
> > Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
> > deferred kwork does appear to mismatch somewhat with what the scheduler
> > needs or at least how it's worked in the past.  How much impact will
> > that mismatch have?  Unclear.
> >
> >      >      >>> Secondly, it probably demands separate workers (not
> >     optional),
> >      >     otherwise
> >      >      >>> behaviour of shared workqueues has either the potential
> to
> >      >     explode number
> >      >      >>> kernel threads anyway, or add latency.
> >      >      >>>
> >      >      >>
> >      >      >> Right now the system_unbound_wq is used which does have a
> >     limit
> >      >     on the
> >      >      >> number of threads, right? I do have a FIXME to allow a
> >     worker to be
> >      >      >> passed in similar to TDR.
> >      >      >>
> >      >      >> WRT to latency, the 1:1 ratio could actually have lower
> >     latency
> >      >     as 2 GPU
> >      >      >> schedulers can be pushing jobs into the backend /
> cleaning up
> >      >     jobs in
> >      >      >> parallel.
> >      >      >>
> >      >      >
> >      >      > Thought of one more point here where why in Xe we
> >     absolutely want
> >      >     a 1 to
> >      >      > 1 ratio between entity and scheduler - the way we implement
> >      >     timeslicing
> >      >      > for preempt fences.
> >      >      >
> >      >      > Let me try to explain.
> >      >      >
> >      >      > Preempt fences are implemented via the generic messaging
> >      >     interface [1]
> >      >      > with suspend / resume messages. If a suspend messages is
> >     received to
> >      >      > soon after calling resume (this is per entity) we simply
> >     sleep in the
> >      >      > suspend call thus giving the entity a timeslice. This
> >     completely
> >      >     falls
> >      >      > apart with a many to 1 relationship as now a entity
> >     waiting for a
> >      >      > timeslice blocks the other entities. Could we work aroudn
> >     this,
> >      >     sure but
> >      >      > just another bunch of code we'd have to add in Xe. Being to
> >      >     freely sleep
> >      >      > in backend without affecting other entities is really,
> really
> >      >     nice IMO
> >      >      > and I bet Xe isn't the only driver that is going to feel
> >     this way.
> >      >      >
> >      >      > Last thing I'll say regardless of how anyone feels about
> >     Xe using
> >      >     a 1 to
> >      >      > 1 relationship this patch IMO makes sense as I hope we can
> all
> >      >     agree a
> >      >      > workqueue scales better than kthreads.
> >      >
> >      >     I don't know for sure what will scale better and for what use
> >     case,
> >      >     combination of CPU cores vs number of GPU engines to keep
> >     busy vs other
> >      >     system activity. But I wager someone is bound to ask for some
> >      >     numbers to
> >      >     make sure proposal is not negatively affecting any other
> drivers.
> >      >
> >      >
> >      > Then let them ask.  Waving your hands vaguely in the direction of
> >     the
> >      > rest of DRM and saying "Uh, someone (not me) might object" is
> >     profoundly
> >      > unhelpful.  Sure, someone might.  That's why it's on dri-devel.
> >     If you
> >      > think there's someone in particular who might have a useful
> >     opinion on
> >      > this, throw them in the CC so they don't miss the e-mail thread.
> >      >
> >      > Or are you asking for numbers?  If so, what numbers are you
> >     asking for?
> >
> >     It was a heads up to the Xe team in case people weren't appreciating
> >     how
> >     the proposed change has the potential influence power and performance
> >     across the board. And nothing in the follow up discussion made me
> think
> >     it was considered so I don't think it was redundant to raise it.
> >
> >     In my experience it is typical that such core changes come with some
> >     numbers. Which is in case of drm scheduler is tricky and probably
> >     requires explicitly asking everyone to test (rather than count on
> >     "don't
> >     miss the email thread"). Real products can fail to ship due ten mW
> here
> >     or there. Like suddenly an extra core prevented from getting into
> deep
> >     sleep.
> >
> >     If that was "profoundly unhelpful" so be it.
> >
> >
> > With your above explanation, it makes more sense what you're asking.
> > It's still not something Matt is likely to be able to provide on his
> > own.  We need to tag some other folks and ask them to test it out.  We
> > could play around a bit with it on Xe but it's not exactly production
> > grade yet and is going to hit this differently from most.  Likely
> > candidates are probably AMD and Freedreno.
>
> Whoever is setup to check out power and performance would be good to
> give it a spin, yes.
>
> PS. I don't think I was asking Matt to test with other devices. To start
> with I think Xe is a team effort. I was asking for more background on
> the design decision since patch 4/20 does not say anything on that
> angle, nor later in the thread it was IMO sufficiently addressed.
>
> >      > Also, If we're talking about a design that might paint us into an
> >      > Intel-HW-specific hole, that would be one thing.  But we're not.
> >     We're
> >      > talking about switching which kernel threading/task mechanism to
> >     use for
> >      > what's really a very generic problem.  The core Xe design works
> >     without
> >      > this patch (just with more kthreads).  If we land this patch or
> >      > something like it and get it wrong and it causes a performance
> >     problem
> >      > for someone down the line, we can revisit it.
> >
> >     For some definition of "it works" - I really wouldn't suggest
> >     shipping a
> >     kthread per user context at any point.
> >
> >
> > You have yet to elaborate on why. What resources is it consuming that's
> > going to be a problem? Are you anticipating CPU affinity problems? Or
> > does it just seem wasteful?
>
> Well I don't know, commit message says the approach does not scale. :)
>
> > I think I largely agree that it's probably unnecessary/wasteful but
> > reducing the number of kthreads seems like a tractable problem to solve
> > regardless of where we put the gpu_scheduler object.  Is this the right
> > solution?  Maybe not.  It was also proposed at one point that we could
> > split the scheduler into two pieces: A scheduler which owns the kthread,
> > and a back-end which targets some HW ring thing where you can have
> > multiple back-ends per scheduler.  That's certainly more invasive from a
> > DRM scheduler internal API PoV but would solve the kthread problem in a
> > way that's more similar to what we have now.
> >
> >      >     In any case that's a low level question caused by the high
> >     level design
> >      >     decision. So I'd think first focus on the high level - which
> >     is the 1:1
> >      >     mapping of entity to scheduler instance proposal.
> >      >
> >      >     Fundamentally it will be up to the DRM maintainers and the
> >     community to
> >      >     bless your approach. And it is important to stress 1:1 is
> about
> >      >     userspace contexts, so I believe unlike any other current
> >     scheduler
> >      >     user. And also important to stress this effectively does not
> >     make Xe
> >      >     _really_ use the scheduler that much.
> >      >
> >      >
> >      > I don't think this makes Xe nearly as much of a one-off as you
> >     think it
> >      > does.  I've already told the Asahi team working on Apple M1/2
> >     hardware
> >      > to do it this way and it seems to be a pretty good mapping for
> >     them. I
> >      > believe this is roughly the plan for nouveau as well.  It's not
> >     the way
> >      > it currently works for anyone because most other groups aren't
> >     doing FW
> >      > scheduling yet.  In the world of FW scheduling and hardware
> >     designed to
> >      > support userspace direct-to-FW submit, I think the design makes
> >     perfect
> >      > sense (see below) and I expect we'll see more drivers move in this
> >      > direction as those drivers evolve.  (AMD is doing some customish
> >     thing
> >      > for how with gpu_scheduler on the front-end somehow. I've not dug
> >     into
> >      > those details.)
> >      >
> >      >     I can only offer my opinion, which is that the two options
> >     mentioned in
> >      >     this thread (either improve drm scheduler to cope with what is
> >      >     required,
> >      >     or split up the code so you can use just the parts of
> >     drm_sched which
> >      >     you want - which is frontend dependency tracking) shouldn't
> be so
> >      >     readily dismissed, given how I think the idea was for the new
> >     driver to
> >      >     work less in a silo and more in the community (not do kludges
> to
> >      >     workaround stuff because it is thought to be too hard to
> >     improve common
> >      >     code), but fundamentally, "goto previous paragraph" for what
> I am
> >      >     concerned.
> >      >
> >      >
> >      > Meta comment:  It appears as if you're falling into the standard
> >     i915
> >      > team trap of having an internal discussion about what the
> community
> >      > discussion might look like instead of actually having the
> community
> >      > discussion.  If you are seriously concerned about interactions
> with
> >      > other drivers or whether or setting common direction, the right
> >     way to
> >      > do that is to break a patch or two out into a separate RFC series
> >     and
> >      > tag a handful of driver maintainers.  Trying to predict the
> >     questions
> >      > other people might ask is pointless. Cc them and asking for their
> >     input
> >      > instead.
> >
> >     I don't follow you here. It's not an internal discussion - I am
> raising
> >     my concerns on the design publicly. I am supposed to write a patch to
> >     show something, but am allowed to comment on a RFC series?
> >
> >
> > I may have misread your tone a bit.  It felt a bit like too many
> > discussions I've had in the past where people are trying to predict what
> > others will say instead of just asking them.  Reading it again, I was
> > probably jumping to conclusions a bit.  Sorry about that.
>
> Okay no problem, thanks. In any case we don't have to keep discussing
> it, since I wrote one or two emails ago it is fundamentally on the
> maintainers and community to ack the approach. I only felt like RFC did
> not explain the potential downsides sufficiently so I wanted to probe
> that area a bit.
>
> >     It is "drm/sched: Convert drm scheduler to use a work queue rather
> than
> >     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> >
> >
> > Yeah, it probably should have.  I think that's mostly what I've been
> > trying to say.
> >
> >      >
> >      >     Regards,
> >      >
> >      >     Tvrtko
> >      >
> >      >     P.S. And as a related side note, there are more areas where
> >     drm_sched
> >      >     could be improved, like for instance priority handling.
> >      >     Take a look at msm_submitqueue_create /
> >     msm_gpu_convert_priority /
> >      >     get_sched_entity to see how msm works around the drm_sched
> >     hardcoded
> >      >     limit of available priority levels, in order to avoid having
> >     to leave a
> >      >     hw capability unused. I suspect msm would be happier if they
> >     could have
> >      >     all priority levels equal in terms of whether they apply only
> >     at the
> >      >     frontend level or completely throughout the pipeline.
> >      >
> >      >      > [1]
> >      >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >      >
> >       <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> >      >      >
> >      >      >>> What would be interesting to learn is whether the option
> of
> >      >     refactoring
> >      >      >>> drm_sched to deal with out of order completion was
> >     considered
> >      >     and what were
> >      >      >>> the conclusions.
> >      >      >>>
> >      >      >>
> >      >      >> I coded this up a while back when trying to convert the
> >     i915 to
> >      >     the DRM
> >      >      >> scheduler it isn't all that hard either. The free flow
> >     control
> >      >     on the
> >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE)
> is
> >      >     really what
> >      >      >> sold me on the this design.
> >      >
> >      >
> >      > You're not the only one to suggest supporting out-of-order
> >     completion.
> >      > However, it's tricky and breaks a lot of internal assumptions of
> the
> >      > scheduler. It also reduces functionality a bit because it can no
> >     longer
> >      > automatically rate-limit HW/FW queues which are often
> >     fixed-size.  (Ok,
> >      > yes, it probably could but it becomes a substantially harder
> >     problem.)
> >      >
> >      > It also seems like a worse mapping to me.  The goal here is to
> turn
> >      > submissions on a userspace-facing engine/queue into submissions
> >     to a FW
> >      > queue submissions, sorting out any dma_fence dependencies.  Matt's
> >      > description of saying this is a 1:1 mapping between sched/entity
> >     doesn't
> >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> >     mapping?
> >      > Why is that better?
> >
> >     As I have stated before, what I think what would fit well for Xe is
> one
> >     drm_scheduler per engine class. In specific terms on our current
> >     hardware, one drm scheduler instance for render, compute, blitter,
> >     video
> >     and video enhance. Userspace contexts remain scheduler entities.
> >
> >
> > And this is where we fairly strongly disagree.  More in a bit.
> >
> >     That way you avoid the whole kthread/kworker story and you have it
> >     actually use the entity picking code in the scheduler, which may be
> >     useful when the backend is congested.
> >
> >
> > What back-end congestion are you referring to here?  Running out of FW
> > queue IDs?  Something else?
>
> CT channel, number of context ids.
>
> >
> >     Yes you have to solve the out of order problem so in my mind that is
> >     something to discuss. What the problem actually is (just TDR?), how
> >     tricky and why etc.
> >
> >     And yes you lose the handy LRCA ring buffer size management so you'd
> >     have to make those entities not runnable in some other way.
> >
> >     Regarding the argument you raise below - would any of that make the
> >     frontend / backend separation worse and why? Do you think it is less
> >     natural? If neither is true then all remains is that it appears extra
> >     work to support out of order completion of entities has been
> discounted
> >     in favour of an easy but IMO inelegant option.
> >
> >
> > Broadly speaking, the kernel needs to stop thinking about GPU scheduling
> > in terms of scheduling jobs and start thinking in terms of scheduling
> > contexts/engines.  There is still some need for scheduling individual
> > jobs but that is only for the purpose of delaying them as needed to
> > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > get shoved onto the context/engine queue and from there the kernel only
> > really manages whole contexts/engines.  This is a major architectural
> > shift, entirely different from the way i915 scheduling works.  It's also
> > different from the historical usage of DRM scheduler which I think is
> > why this all looks a bit funny.
> >
> > To justify this architectural shift, let's look at where we're headed.
> > In the glorious future...
> >
> >   1. Userspace submits directly to firmware queues.  The kernel has no
> > visibility whatsoever into individual jobs.  At most it can pause/resume
> > FW contexts as needed to handle eviction and memory management.
> >
> >   2. Because of 1, apart from handing out the FW queue IDs at the
> > beginning, the kernel can't really juggle them that much.  Depending on
> > FW design, it may be able to pause a client, give its IDs to another,
> > and then resume it later when IDs free up.  What it's not doing is
> > juggling IDs on a job-by-job basis like i915 currently is.
> >
> >   3. Long-running compute jobs may not complete for days.  This means
> > that memory management needs to happen in terms of pause/resume of
> > entire contexts/engines using the memory rather than based on waiting
> > for individual jobs to complete or pausing individual jobs until the
> > memory is available.
> >
> >   4. Synchronization happens via userspace memory fences (UMF) and the
> > kernel is mostly unaware of most dependencies and when a context/engine
> > is or is not runnable.  Instead, it keeps as many of them minimally
> > active (memory is available, even if it's in system RAM) as possible and
> > lets the FW sort out dependencies.  (There may need to be some facility
> > for sleeping a context until a memory change similar to futex() or
> > poll() for userspace threads.  There are some details TBD.)
> >
> > Are there potential problems that will need to be solved here?  Yes.  Is
> > it a good design?  Well, Microsoft has been living in this future for
> > half a decade or better and it's working quite well for them.  It's also
> > the way all modern game consoles work.  It really is just Linux that's
> > stuck with the same old job model we've had since the monumental shift
> > to DRI2.
> >
> > To that end, one of the core goals of the Xe project was to make the
> > driver internally behave as close to the above model as possible while
> > keeping the old-school job model as a very thin layer on top.  As the
> > broader ecosystem problems (window-system support for UMF, for instance)
> > are solved, that layer can be peeled back.  The core driver will already
> > be ready for it.
> >
> > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > and stuff jobs into their respective context/engine queue once they're
> > ready.  All the actual scheduling happens in firmware and any scheduling
> > the kernel does to deal with contention, oversubscriptions, too many
> > contexts, etc. is between contexts/engines, not individual jobs.  Sure,
> > the individual job visibility is nice, but if we design around it, we'll
> > never get to the glorious future.
> >
> > I really need to turn the above (with a bit more detail) into a blog
> > post.... Maybe I'll do that this week.
> >
> > In any case, I hope that provides more insight into why Xe is designed
> > the way it is and why I'm pushing back so hard on trying to make it more
> > of a "classic" driver as far as scheduling is concerned.  Are there
> > potential problems here?  Yes, that's why Xe has been labeled a
> > prototype.  Are such radical changes necessary to get to said glorious
> > future?  Yes, I think they are.  Will it be worth it?  I believe so.
>
> Right, that's all solid I think. My takeaway is that frontend priority
> sorting and that stuff isn't needed and that is okay. And that there are
> multiple options to maybe improve drm scheduler, like the fore mentioned
> making it deal with out of order, or split into functional components,
> or split frontend/backend what you suggested. For most of them cost vs
> benefit is more or less not completely clear, neither how much effort
> was invested to look into them.
>
> One thing I missed from this explanation is how drm_scheduler per engine
> class interferes with the high level concepts. And I did not manage to
> pick up on what exactly is the TDR problem in that case. Maybe the two
> are one and the same.
>
> Bottom line is I still have the concern that conversion to kworkers has
> an opportunity to regress. Possibly more opportunity for some Xe use
> cases than to affect other vendors, since they would still be using per
> physical engine / queue scheduler instances.
>
> And to put my money where my mouth is I will try to put testing Xe
> inside the full blown ChromeOS environment in my team plans. It would
> probably also be beneficial if Xe team could take a look at real world
> behaviour of the extreme transcode use cases too. If the stack is ready
> for that and all. It would be better to know earlier rather than later
> if there is a fundamental issue.
>
> For the patch at hand, and the cover letter, it certainly feels it would
> benefit to record the past design discussion had with AMD folks, to
> explicitly copy other drivers, and to record the theoretical pros and
> cons of threads vs unbound workers as I have tried to highlight them.
>
> Regards,
>
> Tvrtko
>

[-- Attachment #2: Type: text/html, Size: 37891 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 22:18                           ` Jason Ekstrand
  0 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-11 22:18 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 30539 bytes --]

On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
tvrtko.ursulin@linux.intel.com> wrote:

>
> On 10/01/2023 14:08, Jason Ekstrand wrote:
> > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>>
>
> > wrote:
> >
> >
> >
> >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> >
> >     [snip]
> >
> >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> created
> >      >     contexts (per
> >      >      >>> context _and_ engine) and drm_sched. I am not sure
> avoiding
> >      >     invasive changes
> >      >      >>> to the shared code is in the spirit of the overall idea
> >     and instead
> >      >      >>> opportunity should be used to look at way to
> >     refactor/improve
> >      >     drm_sched.
> >      >
> >      >
> >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> >     all or
> >      > really needs to drive a re-factor.  (More on that later.)
> >     There's only
> >      > one real issue which is that it fires off potentially a lot of
> >     kthreads.
> >      > Even that's not that bad given that kthreads are pretty light and
> >     you're
> >      > not likely to have more kthreads than userspace threads which are
> >     much
> >      > heavier.  Not ideal, but not the end of the world either.
> >     Definitely
> >      > something we can/should optimize but if we went through with Xe
> >     without
> >      > this patch, it would probably be mostly ok.
> >      >
> >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >      >      >>
> >      >      >> I'm not really prepared to make large changes to DRM
> >     scheduler
> >      >     at the
> >      >      >> moment for Xe as they are not really required nor does
> Boris
> >      >     seem they
> >      >      >> will be required for his work either. I am interested to
> see
> >      >     what Boris
> >      >      >> comes up with.
> >      >      >>
> >      >      >>> Even on the low level, the idea to replace drm_sched
> threads
> >      >     with workers
> >      >      >>> has a few problems.
> >      >      >>>
> >      >      >>> To start with, the pattern of:
> >      >      >>>
> >      >      >>>    while (not_stopped) {
> >      >      >>>     keep picking jobs
> >      >      >>>    }
> >      >      >>>
> >      >      >>> Feels fundamentally in disagreement with workers (while
> >      >     obviously fits
> >      >      >>> perfectly with the current kthread design).
> >      >      >>
> >      >      >> The while loop breaks and worker exists if no jobs are
> ready.
> >      >
> >      >
> >      > I'm not very familiar with workqueues. What are you saying would
> fit
> >      > better? One scheduling job per work item rather than one big work
> >     item
> >      > which handles all available jobs?
> >
> >     Yes and no, it indeed IMO does not fit to have a work item which is
> >     potentially unbound in runtime. But it is a bit moot conceptual
> >     mismatch
> >     because it is a worst case / theoretical, and I think due more
> >     fundamental concerns.
> >
> >     If we have to go back to the low level side of things, I've picked
> this
> >     random spot to consolidate what I have already mentioned and perhaps
> >     expand.
> >
> >     To start with, let me pull out some thoughts from workqueue.rst:
> >
> >     """
> >     Generally, work items are not expected to hog a CPU and consume many
> >     cycles. That means maintaining just enough concurrency to prevent
> work
> >     processing from stalling should be optimal.
> >     """
> >
> >     For unbound queues:
> >     """
> >     The responsibility of regulating concurrency level is on the users.
> >     """
> >
> >     Given the unbound queues will be spawned on demand to service all
> >     queued
> >     work items (more interesting when mixing up with the
> >     system_unbound_wq),
> >     in the proposed design the number of instantiated worker threads does
> >     not correspond to the number of user threads (as you have elsewhere
> >     stated), but pessimistically to the number of active user contexts.
> >
> >
> > Those are pretty much the same in practice.  Rather, user threads is
> > typically an upper bound on the number of contexts.  Yes, a single user
> > thread could have a bunch of contexts but basically nothing does that
> > except IGT.  In real-world usage, it's at most one context per user
> thread.
>
> Typically is the key here. But I am not sure it is good enough. Consider
> this example - Intel Flex 170:
>
>   * Delivers up to 36 streams 1080p60 transcode throughput per card.
>   * When scaled to 10 cards in a 4U server configuration, it can support
> up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
>

I had a feeling it was going to be media.... 😅


> One transcode stream from my experience typically is 3-4 GPU contexts
> (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> 60fps = 8640 jobs submitted and completed per second.
>
> 144 active contexts in the proposed scheme means possibly means 144
> kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> don't think the pools would scale down given all are constantly pinged
> at 60fps.)
>
> And then each of 144 threads goes to grab the single GuC CT mutex. First
> threads are being made schedulable, then put to sleep as mutex
> contention is hit, then woken again as mutexes are getting released,
> rinse, repeat.
>

Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
back-end yet but I was under the impression that most run_job() would be
just shoving another packet into a ring buffer.  If we have to send the GuC
a message on the control ring every single time we submit a job, that's
pretty horrible.

--Jason


(And yes this backend contention is there regardless of 1:1:1, it would
> require a different re-design to solve that. But it is just a question
> whether there are 144 contending threads, or just 6 with the thread per
> engine class scheme.)
>
> Then multiply all by 10 for a 4U server use case and you get 1440 worker
> kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> Just so they can grab a timeslice and maybe content on a mutex as the
> next step.
>
> This example is where it would hurt on large systems. Imagine only an
> even wider media transcode card...
>
> Second example is only a single engine class used (3d desktop?) but with
> a bunch of not-runnable jobs queued and waiting on a fence to signal.
> Implicit or explicit dependencies doesn't matter. Then the fence signals
> and call backs run. N work items get scheduled, but they all submit to
> the same HW engine. So we end up with:
>
>          /-- wi1 --\
>         / ..     .. \
>   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>         \ ..    ..  /
>          \-- wiN --/
>
>
> All that we have achieved is waking up N CPUs to contend on the same
> lock and effectively insert the job into the same single HW queue. I
> don't see any positives there.
>
> This example I think can particularly hurt small / low power devices
> because of needless waking up of many cores for no benefit. Granted, I
> don't have a good feel on how common this pattern is in practice.
>
> >
> >     That
> >     is the number which drives the maximum number of not-runnable jobs
> that
> >     can become runnable at once, and hence spawn that many work items,
> and
> >     in turn unbound worker threads.
> >
> >     Several problems there.
> >
> >     It is fundamentally pointless to have potentially that many more
> >     threads
> >     than the number of CPU cores - it simply creates a scheduling storm.
> >
> >     Unbound workers have no CPU / cache locality either and no connection
> >     with the CPU scheduler to optimize scheduling patterns. This may
> matter
> >     either on large systems or on small ones. Whereas the current design
> >     allows for scheduler to notice userspace CPU thread keeps waking up
> the
> >     same drm scheduler kernel thread, and so it can keep them on the same
> >     CPU, the unbound workers lose that ability and so 2nd CPU might be
> >     getting woken up from low sleep for every submission.
> >
> >     Hence, apart from being a bit of a impedance mismatch, the proposal
> has
> >     the potential to change performance and power patterns and both large
> >     and small machines.
> >
> >
> > Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
> > deferred kwork does appear to mismatch somewhat with what the scheduler
> > needs or at least how it's worked in the past.  How much impact will
> > that mismatch have?  Unclear.
> >
> >      >      >>> Secondly, it probably demands separate workers (not
> >     optional),
> >      >     otherwise
> >      >      >>> behaviour of shared workqueues has either the potential
> to
> >      >     explode number
> >      >      >>> kernel threads anyway, or add latency.
> >      >      >>>
> >      >      >>
> >      >      >> Right now the system_unbound_wq is used which does have a
> >     limit
> >      >     on the
> >      >      >> number of threads, right? I do have a FIXME to allow a
> >     worker to be
> >      >      >> passed in similar to TDR.
> >      >      >>
> >      >      >> WRT to latency, the 1:1 ratio could actually have lower
> >     latency
> >      >     as 2 GPU
> >      >      >> schedulers can be pushing jobs into the backend /
> cleaning up
> >      >     jobs in
> >      >      >> parallel.
> >      >      >>
> >      >      >
> >      >      > Thought of one more point here where why in Xe we
> >     absolutely want
> >      >     a 1 to
> >      >      > 1 ratio between entity and scheduler - the way we implement
> >      >     timeslicing
> >      >      > for preempt fences.
> >      >      >
> >      >      > Let me try to explain.
> >      >      >
> >      >      > Preempt fences are implemented via the generic messaging
> >      >     interface [1]
> >      >      > with suspend / resume messages. If a suspend messages is
> >     received to
> >      >      > soon after calling resume (this is per entity) we simply
> >     sleep in the
> >      >      > suspend call thus giving the entity a timeslice. This
> >     completely
> >      >     falls
> >      >      > apart with a many to 1 relationship as now a entity
> >     waiting for a
> >      >      > timeslice blocks the other entities. Could we work aroudn
> >     this,
> >      >     sure but
> >      >      > just another bunch of code we'd have to add in Xe. Being to
> >      >     freely sleep
> >      >      > in backend without affecting other entities is really,
> really
> >      >     nice IMO
> >      >      > and I bet Xe isn't the only driver that is going to feel
> >     this way.
> >      >      >
> >      >      > Last thing I'll say regardless of how anyone feels about
> >     Xe using
> >      >     a 1 to
> >      >      > 1 relationship this patch IMO makes sense as I hope we can
> all
> >      >     agree a
> >      >      > workqueue scales better than kthreads.
> >      >
> >      >     I don't know for sure what will scale better and for what use
> >     case,
> >      >     combination of CPU cores vs number of GPU engines to keep
> >     busy vs other
> >      >     system activity. But I wager someone is bound to ask for some
> >      >     numbers to
> >      >     make sure proposal is not negatively affecting any other
> drivers.
> >      >
> >      >
> >      > Then let them ask.  Waving your hands vaguely in the direction of
> >     the
> >      > rest of DRM and saying "Uh, someone (not me) might object" is
> >     profoundly
> >      > unhelpful.  Sure, someone might.  That's why it's on dri-devel.
> >     If you
> >      > think there's someone in particular who might have a useful
> >     opinion on
> >      > this, throw them in the CC so they don't miss the e-mail thread.
> >      >
> >      > Or are you asking for numbers?  If so, what numbers are you
> >     asking for?
> >
> >     It was a heads up to the Xe team in case people weren't appreciating
> >     how
> >     the proposed change has the potential influence power and performance
> >     across the board. And nothing in the follow up discussion made me
> think
> >     it was considered so I don't think it was redundant to raise it.
> >
> >     In my experience it is typical that such core changes come with some
> >     numbers. Which is in case of drm scheduler is tricky and probably
> >     requires explicitly asking everyone to test (rather than count on
> >     "don't
> >     miss the email thread"). Real products can fail to ship due ten mW
> here
> >     or there. Like suddenly an extra core prevented from getting into
> deep
> >     sleep.
> >
> >     If that was "profoundly unhelpful" so be it.
> >
> >
> > With your above explanation, it makes more sense what you're asking.
> > It's still not something Matt is likely to be able to provide on his
> > own.  We need to tag some other folks and ask them to test it out.  We
> > could play around a bit with it on Xe but it's not exactly production
> > grade yet and is going to hit this differently from most.  Likely
> > candidates are probably AMD and Freedreno.
>
> Whoever is setup to check out power and performance would be good to
> give it a spin, yes.
>
> PS. I don't think I was asking Matt to test with other devices. To start
> with I think Xe is a team effort. I was asking for more background on
> the design decision since patch 4/20 does not say anything on that
> angle, nor later in the thread it was IMO sufficiently addressed.
>
> >      > Also, If we're talking about a design that might paint us into an
> >      > Intel-HW-specific hole, that would be one thing.  But we're not.
> >     We're
> >      > talking about switching which kernel threading/task mechanism to
> >     use for
> >      > what's really a very generic problem.  The core Xe design works
> >     without
> >      > this patch (just with more kthreads).  If we land this patch or
> >      > something like it and get it wrong and it causes a performance
> >     problem
> >      > for someone down the line, we can revisit it.
> >
> >     For some definition of "it works" - I really wouldn't suggest
> >     shipping a
> >     kthread per user context at any point.
> >
> >
> > You have yet to elaborate on why. What resources is it consuming that's
> > going to be a problem? Are you anticipating CPU affinity problems? Or
> > does it just seem wasteful?
>
> Well I don't know, commit message says the approach does not scale. :)
>
> > I think I largely agree that it's probably unnecessary/wasteful but
> > reducing the number of kthreads seems like a tractable problem to solve
> > regardless of where we put the gpu_scheduler object.  Is this the right
> > solution?  Maybe not.  It was also proposed at one point that we could
> > split the scheduler into two pieces: A scheduler which owns the kthread,
> > and a back-end which targets some HW ring thing where you can have
> > multiple back-ends per scheduler.  That's certainly more invasive from a
> > DRM scheduler internal API PoV but would solve the kthread problem in a
> > way that's more similar to what we have now.
> >
> >      >     In any case that's a low level question caused by the high
> >     level design
> >      >     decision. So I'd think first focus on the high level - which
> >     is the 1:1
> >      >     mapping of entity to scheduler instance proposal.
> >      >
> >      >     Fundamentally it will be up to the DRM maintainers and the
> >     community to
> >      >     bless your approach. And it is important to stress 1:1 is
> about
> >      >     userspace contexts, so I believe unlike any other current
> >     scheduler
> >      >     user. And also important to stress this effectively does not
> >     make Xe
> >      >     _really_ use the scheduler that much.
> >      >
> >      >
> >      > I don't think this makes Xe nearly as much of a one-off as you
> >     think it
> >      > does.  I've already told the Asahi team working on Apple M1/2
> >     hardware
> >      > to do it this way and it seems to be a pretty good mapping for
> >     them. I
> >      > believe this is roughly the plan for nouveau as well.  It's not
> >     the way
> >      > it currently works for anyone because most other groups aren't
> >     doing FW
> >      > scheduling yet.  In the world of FW scheduling and hardware
> >     designed to
> >      > support userspace direct-to-FW submit, I think the design makes
> >     perfect
> >      > sense (see below) and I expect we'll see more drivers move in this
> >      > direction as those drivers evolve.  (AMD is doing some customish
> >     thing
> >      > for how with gpu_scheduler on the front-end somehow. I've not dug
> >     into
> >      > those details.)
> >      >
> >      >     I can only offer my opinion, which is that the two options
> >     mentioned in
> >      >     this thread (either improve drm scheduler to cope with what is
> >      >     required,
> >      >     or split up the code so you can use just the parts of
> >     drm_sched which
> >      >     you want - which is frontend dependency tracking) shouldn't
> be so
> >      >     readily dismissed, given how I think the idea was for the new
> >     driver to
> >      >     work less in a silo and more in the community (not do kludges
> to
> >      >     workaround stuff because it is thought to be too hard to
> >     improve common
> >      >     code), but fundamentally, "goto previous paragraph" for what
> I am
> >      >     concerned.
> >      >
> >      >
> >      > Meta comment:  It appears as if you're falling into the standard
> >     i915
> >      > team trap of having an internal discussion about what the
> community
> >      > discussion might look like instead of actually having the
> community
> >      > discussion.  If you are seriously concerned about interactions
> with
> >      > other drivers or whether or setting common direction, the right
> >     way to
> >      > do that is to break a patch or two out into a separate RFC series
> >     and
> >      > tag a handful of driver maintainers.  Trying to predict the
> >     questions
> >      > other people might ask is pointless. Cc them and asking for their
> >     input
> >      > instead.
> >
> >     I don't follow you here. It's not an internal discussion - I am
> raising
> >     my concerns on the design publicly. I am supposed to write a patch to
> >     show something, but am allowed to comment on a RFC series?
> >
> >
> > I may have misread your tone a bit.  It felt a bit like too many
> > discussions I've had in the past where people are trying to predict what
> > others will say instead of just asking them.  Reading it again, I was
> > probably jumping to conclusions a bit.  Sorry about that.
>
> Okay no problem, thanks. In any case we don't have to keep discussing
> it, since I wrote one or two emails ago it is fundamentally on the
> maintainers and community to ack the approach. I only felt like RFC did
> not explain the potential downsides sufficiently so I wanted to probe
> that area a bit.
>
> >     It is "drm/sched: Convert drm scheduler to use a work queue rather
> than
> >     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> >
> >
> > Yeah, it probably should have.  I think that's mostly what I've been
> > trying to say.
> >
> >      >
> >      >     Regards,
> >      >
> >      >     Tvrtko
> >      >
> >      >     P.S. And as a related side note, there are more areas where
> >     drm_sched
> >      >     could be improved, like for instance priority handling.
> >      >     Take a look at msm_submitqueue_create /
> >     msm_gpu_convert_priority /
> >      >     get_sched_entity to see how msm works around the drm_sched
> >     hardcoded
> >      >     limit of available priority levels, in order to avoid having
> >     to leave a
> >      >     hw capability unused. I suspect msm would be happier if they
> >     could have
> >      >     all priority levels equal in terms of whether they apply only
> >     at the
> >      >     frontend level or completely throughout the pipeline.
> >      >
> >      >      > [1]
> >      >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >
> >      >
> >       <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> >      >      >
> >      >      >>> What would be interesting to learn is whether the option
> of
> >      >     refactoring
> >      >      >>> drm_sched to deal with out of order completion was
> >     considered
> >      >     and what were
> >      >      >>> the conclusions.
> >      >      >>>
> >      >      >>
> >      >      >> I coded this up a while back when trying to convert the
> >     i915 to
> >      >     the DRM
> >      >      >> scheduler it isn't all that hard either. The free flow
> >     control
> >      >     on the
> >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE)
> is
> >      >     really what
> >      >      >> sold me on the this design.
> >      >
> >      >
> >      > You're not the only one to suggest supporting out-of-order
> >     completion.
> >      > However, it's tricky and breaks a lot of internal assumptions of
> the
> >      > scheduler. It also reduces functionality a bit because it can no
> >     longer
> >      > automatically rate-limit HW/FW queues which are often
> >     fixed-size.  (Ok,
> >      > yes, it probably could but it becomes a substantially harder
> >     problem.)
> >      >
> >      > It also seems like a worse mapping to me.  The goal here is to
> turn
> >      > submissions on a userspace-facing engine/queue into submissions
> >     to a FW
> >      > queue submissions, sorting out any dma_fence dependencies.  Matt's
> >      > description of saying this is a 1:1 mapping between sched/entity
> >     doesn't
> >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> >     mapping?
> >      > Why is that better?
> >
> >     As I have stated before, what I think what would fit well for Xe is
> one
> >     drm_scheduler per engine class. In specific terms on our current
> >     hardware, one drm scheduler instance for render, compute, blitter,
> >     video
> >     and video enhance. Userspace contexts remain scheduler entities.
> >
> >
> > And this is where we fairly strongly disagree.  More in a bit.
> >
> >     That way you avoid the whole kthread/kworker story and you have it
> >     actually use the entity picking code in the scheduler, which may be
> >     useful when the backend is congested.
> >
> >
> > What back-end congestion are you referring to here?  Running out of FW
> > queue IDs?  Something else?
>
> CT channel, number of context ids.
>
> >
> >     Yes you have to solve the out of order problem so in my mind that is
> >     something to discuss. What the problem actually is (just TDR?), how
> >     tricky and why etc.
> >
> >     And yes you lose the handy LRCA ring buffer size management so you'd
> >     have to make those entities not runnable in some other way.
> >
> >     Regarding the argument you raise below - would any of that make the
> >     frontend / backend separation worse and why? Do you think it is less
> >     natural? If neither is true then all remains is that it appears extra
> >     work to support out of order completion of entities has been
> discounted
> >     in favour of an easy but IMO inelegant option.
> >
> >
> > Broadly speaking, the kernel needs to stop thinking about GPU scheduling
> > in terms of scheduling jobs and start thinking in terms of scheduling
> > contexts/engines.  There is still some need for scheduling individual
> > jobs but that is only for the purpose of delaying them as needed to
> > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > get shoved onto the context/engine queue and from there the kernel only
> > really manages whole contexts/engines.  This is a major architectural
> > shift, entirely different from the way i915 scheduling works.  It's also
> > different from the historical usage of DRM scheduler which I think is
> > why this all looks a bit funny.
> >
> > To justify this architectural shift, let's look at where we're headed.
> > In the glorious future...
> >
> >   1. Userspace submits directly to firmware queues.  The kernel has no
> > visibility whatsoever into individual jobs.  At most it can pause/resume
> > FW contexts as needed to handle eviction and memory management.
> >
> >   2. Because of 1, apart from handing out the FW queue IDs at the
> > beginning, the kernel can't really juggle them that much.  Depending on
> > FW design, it may be able to pause a client, give its IDs to another,
> > and then resume it later when IDs free up.  What it's not doing is
> > juggling IDs on a job-by-job basis like i915 currently is.
> >
> >   3. Long-running compute jobs may not complete for days.  This means
> > that memory management needs to happen in terms of pause/resume of
> > entire contexts/engines using the memory rather than based on waiting
> > for individual jobs to complete or pausing individual jobs until the
> > memory is available.
> >
> >   4. Synchronization happens via userspace memory fences (UMF) and the
> > kernel is mostly unaware of most dependencies and when a context/engine
> > is or is not runnable.  Instead, it keeps as many of them minimally
> > active (memory is available, even if it's in system RAM) as possible and
> > lets the FW sort out dependencies.  (There may need to be some facility
> > for sleeping a context until a memory change similar to futex() or
> > poll() for userspace threads.  There are some details TBD.)
> >
> > Are there potential problems that will need to be solved here?  Yes.  Is
> > it a good design?  Well, Microsoft has been living in this future for
> > half a decade or better and it's working quite well for them.  It's also
> > the way all modern game consoles work.  It really is just Linux that's
> > stuck with the same old job model we've had since the monumental shift
> > to DRI2.
> >
> > To that end, one of the core goals of the Xe project was to make the
> > driver internally behave as close to the above model as possible while
> > keeping the old-school job model as a very thin layer on top.  As the
> > broader ecosystem problems (window-system support for UMF, for instance)
> > are solved, that layer can be peeled back.  The core driver will already
> > be ready for it.
> >
> > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > and stuff jobs into their respective context/engine queue once they're
> > ready.  All the actual scheduling happens in firmware and any scheduling
> > the kernel does to deal with contention, oversubscriptions, too many
> > contexts, etc. is between contexts/engines, not individual jobs.  Sure,
> > the individual job visibility is nice, but if we design around it, we'll
> > never get to the glorious future.
> >
> > I really need to turn the above (with a bit more detail) into a blog
> > post.... Maybe I'll do that this week.
> >
> > In any case, I hope that provides more insight into why Xe is designed
> > the way it is and why I'm pushing back so hard on trying to make it more
> > of a "classic" driver as far as scheduling is concerned.  Are there
> > potential problems here?  Yes, that's why Xe has been labeled a
> > prototype.  Are such radical changes necessary to get to said glorious
> > future?  Yes, I think they are.  Will it be worth it?  I believe so.
>
> Right, that's all solid I think. My takeaway is that frontend priority
> sorting and that stuff isn't needed and that is okay. And that there are
> multiple options to maybe improve drm scheduler, like the fore mentioned
> making it deal with out of order, or split into functional components,
> or split frontend/backend what you suggested. For most of them cost vs
> benefit is more or less not completely clear, neither how much effort
> was invested to look into them.
>
> One thing I missed from this explanation is how drm_scheduler per engine
> class interferes with the high level concepts. And I did not manage to
> pick up on what exactly is the TDR problem in that case. Maybe the two
> are one and the same.
>
> Bottom line is I still have the concern that conversion to kworkers has
> an opportunity to regress. Possibly more opportunity for some Xe use
> cases than to affect other vendors, since they would still be using per
> physical engine / queue scheduler instances.
>
> And to put my money where my mouth is I will try to put testing Xe
> inside the full blown ChromeOS environment in my team plans. It would
> probably also be beneficial if Xe team could take a look at real world
> behaviour of the extreme transcode use cases too. If the stack is ready
> for that and all. It would be better to know earlier rather than later
> if there is a fundamental issue.
>
> For the patch at hand, and the cover letter, it certainly feels it would
> benefit to record the past design discussion had with AMD folks, to
> explicitly copy other drivers, and to record the theoretical pros and
> cons of threads vs unbound workers as I have tried to highlight them.
>
> Regards,
>
> Tvrtko
>

[-- Attachment #2: Type: text/html, Size: 37891 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 22:18                           ` Jason Ekstrand
@ 2023-01-11 22:31                             ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 22:31 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Tvrtko Ursulin, intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> tvrtko.ursulin@linux.intel.com> wrote:
> 
> >
> > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>>
> >
> > > wrote:
> > >
> > >
> > >
> > >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > >
> > >     [snip]
> > >
> > >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> > created
> > >      >     contexts (per
> > >      >      >>> context _and_ engine) and drm_sched. I am not sure
> > avoiding
> > >      >     invasive changes
> > >      >      >>> to the shared code is in the spirit of the overall idea
> > >     and instead
> > >      >      >>> opportunity should be used to look at way to
> > >     refactor/improve
> > >      >     drm_sched.
> > >      >
> > >      >
> > >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > >     all or
> > >      > really needs to drive a re-factor.  (More on that later.)
> > >     There's only
> > >      > one real issue which is that it fires off potentially a lot of
> > >     kthreads.
> > >      > Even that's not that bad given that kthreads are pretty light and
> > >     you're
> > >      > not likely to have more kthreads than userspace threads which are
> > >     much
> > >      > heavier.  Not ideal, but not the end of the world either.
> > >     Definitely
> > >      > something we can/should optimize but if we went through with Xe
> > >     without
> > >      > this patch, it would probably be mostly ok.
> > >      >
> > >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >      >>
> > >      >      >> I'm not really prepared to make large changes to DRM
> > >     scheduler
> > >      >     at the
> > >      >      >> moment for Xe as they are not really required nor does
> > Boris
> > >      >     seem they
> > >      >      >> will be required for his work either. I am interested to
> > see
> > >      >     what Boris
> > >      >      >> comes up with.
> > >      >      >>
> > >      >      >>> Even on the low level, the idea to replace drm_sched
> > threads
> > >      >     with workers
> > >      >      >>> has a few problems.
> > >      >      >>>
> > >      >      >>> To start with, the pattern of:
> > >      >      >>>
> > >      >      >>>    while (not_stopped) {
> > >      >      >>>     keep picking jobs
> > >      >      >>>    }
> > >      >      >>>
> > >      >      >>> Feels fundamentally in disagreement with workers (while
> > >      >     obviously fits
> > >      >      >>> perfectly with the current kthread design).
> > >      >      >>
> > >      >      >> The while loop breaks and worker exists if no jobs are
> > ready.
> > >      >
> > >      >
> > >      > I'm not very familiar with workqueues. What are you saying would
> > fit
> > >      > better? One scheduling job per work item rather than one big work
> > >     item
> > >      > which handles all available jobs?
> > >
> > >     Yes and no, it indeed IMO does not fit to have a work item which is
> > >     potentially unbound in runtime. But it is a bit moot conceptual
> > >     mismatch
> > >     because it is a worst case / theoretical, and I think due more
> > >     fundamental concerns.
> > >
> > >     If we have to go back to the low level side of things, I've picked
> > this
> > >     random spot to consolidate what I have already mentioned and perhaps
> > >     expand.
> > >
> > >     To start with, let me pull out some thoughts from workqueue.rst:
> > >
> > >     """
> > >     Generally, work items are not expected to hog a CPU and consume many
> > >     cycles. That means maintaining just enough concurrency to prevent
> > work
> > >     processing from stalling should be optimal.
> > >     """
> > >
> > >     For unbound queues:
> > >     """
> > >     The responsibility of regulating concurrency level is on the users.
> > >     """
> > >
> > >     Given the unbound queues will be spawned on demand to service all
> > >     queued
> > >     work items (more interesting when mixing up with the
> > >     system_unbound_wq),
> > >     in the proposed design the number of instantiated worker threads does
> > >     not correspond to the number of user threads (as you have elsewhere
> > >     stated), but pessimistically to the number of active user contexts.
> > >
> > >
> > > Those are pretty much the same in practice.  Rather, user threads is
> > > typically an upper bound on the number of contexts.  Yes, a single user
> > > thread could have a bunch of contexts but basically nothing does that
> > > except IGT.  In real-world usage, it's at most one context per user
> > thread.
> >
> > Typically is the key here. But I am not sure it is good enough. Consider
> > this example - Intel Flex 170:
> >
> >   * Delivers up to 36 streams 1080p60 transcode throughput per card.
> >   * When scaled to 10 cards in a 4U server configuration, it can support
> > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> >
> 
> I had a feeling it was going to be media.... 😅
> 

Yea wondering the media UMD can be rewritten to use less xe_engines, it
is massive rewrite for VM bind + no implicit dependencies so let's just
pile on some more work?

> 
> > One transcode stream from my experience typically is 3-4 GPU contexts
> > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> > CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> > 60fps = 8640 jobs submitted and completed per second.
> >
> > 144 active contexts in the proposed scheme means possibly means 144
> > kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> > don't think the pools would scale down given all are constantly pinged
> > at 60fps.)
> >
> > And then each of 144 threads goes to grab the single GuC CT mutex. First
> > threads are being made schedulable, then put to sleep as mutex
> > contention is hit, then woken again as mutexes are getting released,
> > rinse, repeat.
> >
> 
> Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
> back-end yet but I was under the impression that most run_job() would be
> just shoving another packet into a ring buffer.  If we have to send the GuC
> a message on the control ring every single time we submit a job, that's
> pretty horrible.
>

Run job writes the ring buffer and moves the tail as the first step (no
lock required). Next it needs to tell the GuC the xe_engine LRC tail has
moved, this is done from a single Host to GuC channel which is circular
buffer, the writing of the channel protected by the mutex. There are
little more nuances too but in practice there is always space in the
channel so the time mutex needs to held is really, really small
(check cached credits, write 3 dwords in payload, write 1 dword to move
tail). I also believe mutexes in Linux are hybrid where they spin for a
little bit before sleeping and certainly if there is space in the
channel we shouldn't sleep mutex contention.

As far as this being horrible, well didn't design the GuC and this how
it is implemented for KMD based submission. We also have 256 doorbells
so we wouldn't need a lock but I think are other issues with that design
too which need to be worked out in the Xe2 / Xe3 timeframe.

Also if you see my follow up response Xe is ~33k execs per second with
the current implementation on a 8 core (or maybe 8 thread) TGL which
seems to fine to me.

Matt
 
> --Jason
> 
> 
> (And yes this backend contention is there regardless of 1:1:1, it would
> > require a different re-design to solve that. But it is just a question
> > whether there are 144 contending threads, or just 6 with the thread per
> > engine class scheme.)
> >
> > Then multiply all by 10 for a 4U server use case and you get 1440 worker
> > kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> > Just so they can grab a timeslice and maybe content on a mutex as the
> > next step.
> >
> > This example is where it would hurt on large systems. Imagine only an
> > even wider media transcode card...
> >
> > Second example is only a single engine class used (3d desktop?) but with
> > a bunch of not-runnable jobs queued and waiting on a fence to signal.
> > Implicit or explicit dependencies doesn't matter. Then the fence signals
> > and call backs run. N work items get scheduled, but they all submit to
> > the same HW engine. So we end up with:
> >
> >          /-- wi1 --\
> >         / ..     .. \
> >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> >         \ ..    ..  /
> >          \-- wiN --/
> >
> >
> > All that we have achieved is waking up N CPUs to contend on the same
> > lock and effectively insert the job into the same single HW queue. I
> > don't see any positives there.
> >
> > This example I think can particularly hurt small / low power devices
> > because of needless waking up of many cores for no benefit. Granted, I
> > don't have a good feel on how common this pattern is in practice.
> >
> > >
> > >     That
> > >     is the number which drives the maximum number of not-runnable jobs
> > that
> > >     can become runnable at once, and hence spawn that many work items,
> > and
> > >     in turn unbound worker threads.
> > >
> > >     Several problems there.
> > >
> > >     It is fundamentally pointless to have potentially that many more
> > >     threads
> > >     than the number of CPU cores - it simply creates a scheduling storm.
> > >
> > >     Unbound workers have no CPU / cache locality either and no connection
> > >     with the CPU scheduler to optimize scheduling patterns. This may
> > matter
> > >     either on large systems or on small ones. Whereas the current design
> > >     allows for scheduler to notice userspace CPU thread keeps waking up
> > the
> > >     same drm scheduler kernel thread, and so it can keep them on the same
> > >     CPU, the unbound workers lose that ability and so 2nd CPU might be
> > >     getting woken up from low sleep for every submission.
> > >
> > >     Hence, apart from being a bit of a impedance mismatch, the proposal
> > has
> > >     the potential to change performance and power patterns and both large
> > >     and small machines.
> > >
> > >
> > > Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
> > > deferred kwork does appear to mismatch somewhat with what the scheduler
> > > needs or at least how it's worked in the past.  How much impact will
> > > that mismatch have?  Unclear.
> > >
> > >      >      >>> Secondly, it probably demands separate workers (not
> > >     optional),
> > >      >     otherwise
> > >      >      >>> behaviour of shared workqueues has either the potential
> > to
> > >      >     explode number
> > >      >      >>> kernel threads anyway, or add latency.
> > >      >      >>>
> > >      >      >>
> > >      >      >> Right now the system_unbound_wq is used which does have a
> > >     limit
> > >      >     on the
> > >      >      >> number of threads, right? I do have a FIXME to allow a
> > >     worker to be
> > >      >      >> passed in similar to TDR.
> > >      >      >>
> > >      >      >> WRT to latency, the 1:1 ratio could actually have lower
> > >     latency
> > >      >     as 2 GPU
> > >      >      >> schedulers can be pushing jobs into the backend /
> > cleaning up
> > >      >     jobs in
> > >      >      >> parallel.
> > >      >      >>
> > >      >      >
> > >      >      > Thought of one more point here where why in Xe we
> > >     absolutely want
> > >      >     a 1 to
> > >      >      > 1 ratio between entity and scheduler - the way we implement
> > >      >     timeslicing
> > >      >      > for preempt fences.
> > >      >      >
> > >      >      > Let me try to explain.
> > >      >      >
> > >      >      > Preempt fences are implemented via the generic messaging
> > >      >     interface [1]
> > >      >      > with suspend / resume messages. If a suspend messages is
> > >     received to
> > >      >      > soon after calling resume (this is per entity) we simply
> > >     sleep in the
> > >      >      > suspend call thus giving the entity a timeslice. This
> > >     completely
> > >      >     falls
> > >      >      > apart with a many to 1 relationship as now a entity
> > >     waiting for a
> > >      >      > timeslice blocks the other entities. Could we work aroudn
> > >     this,
> > >      >     sure but
> > >      >      > just another bunch of code we'd have to add in Xe. Being to
> > >      >     freely sleep
> > >      >      > in backend without affecting other entities is really,
> > really
> > >      >     nice IMO
> > >      >      > and I bet Xe isn't the only driver that is going to feel
> > >     this way.
> > >      >      >
> > >      >      > Last thing I'll say regardless of how anyone feels about
> > >     Xe using
> > >      >     a 1 to
> > >      >      > 1 relationship this patch IMO makes sense as I hope we can
> > all
> > >      >     agree a
> > >      >      > workqueue scales better than kthreads.
> > >      >
> > >      >     I don't know for sure what will scale better and for what use
> > >     case,
> > >      >     combination of CPU cores vs number of GPU engines to keep
> > >     busy vs other
> > >      >     system activity. But I wager someone is bound to ask for some
> > >      >     numbers to
> > >      >     make sure proposal is not negatively affecting any other
> > drivers.
> > >      >
> > >      >
> > >      > Then let them ask.  Waving your hands vaguely in the direction of
> > >     the
> > >      > rest of DRM and saying "Uh, someone (not me) might object" is
> > >     profoundly
> > >      > unhelpful.  Sure, someone might.  That's why it's on dri-devel.
> > >     If you
> > >      > think there's someone in particular who might have a useful
> > >     opinion on
> > >      > this, throw them in the CC so they don't miss the e-mail thread.
> > >      >
> > >      > Or are you asking for numbers?  If so, what numbers are you
> > >     asking for?
> > >
> > >     It was a heads up to the Xe team in case people weren't appreciating
> > >     how
> > >     the proposed change has the potential influence power and performance
> > >     across the board. And nothing in the follow up discussion made me
> > think
> > >     it was considered so I don't think it was redundant to raise it.
> > >
> > >     In my experience it is typical that such core changes come with some
> > >     numbers. Which is in case of drm scheduler is tricky and probably
> > >     requires explicitly asking everyone to test (rather than count on
> > >     "don't
> > >     miss the email thread"). Real products can fail to ship due ten mW
> > here
> > >     or there. Like suddenly an extra core prevented from getting into
> > deep
> > >     sleep.
> > >
> > >     If that was "profoundly unhelpful" so be it.
> > >
> > >
> > > With your above explanation, it makes more sense what you're asking.
> > > It's still not something Matt is likely to be able to provide on his
> > > own.  We need to tag some other folks and ask them to test it out.  We
> > > could play around a bit with it on Xe but it's not exactly production
> > > grade yet and is going to hit this differently from most.  Likely
> > > candidates are probably AMD and Freedreno.
> >
> > Whoever is setup to check out power and performance would be good to
> > give it a spin, yes.
> >
> > PS. I don't think I was asking Matt to test with other devices. To start
> > with I think Xe is a team effort. I was asking for more background on
> > the design decision since patch 4/20 does not say anything on that
> > angle, nor later in the thread it was IMO sufficiently addressed.
> >
> > >      > Also, If we're talking about a design that might paint us into an
> > >      > Intel-HW-specific hole, that would be one thing.  But we're not.
> > >     We're
> > >      > talking about switching which kernel threading/task mechanism to
> > >     use for
> > >      > what's really a very generic problem.  The core Xe design works
> > >     without
> > >      > this patch (just with more kthreads).  If we land this patch or
> > >      > something like it and get it wrong and it causes a performance
> > >     problem
> > >      > for someone down the line, we can revisit it.
> > >
> > >     For some definition of "it works" - I really wouldn't suggest
> > >     shipping a
> > >     kthread per user context at any point.
> > >
> > >
> > > You have yet to elaborate on why. What resources is it consuming that's
> > > going to be a problem? Are you anticipating CPU affinity problems? Or
> > > does it just seem wasteful?
> >
> > Well I don't know, commit message says the approach does not scale. :)
> >
> > > I think I largely agree that it's probably unnecessary/wasteful but
> > > reducing the number of kthreads seems like a tractable problem to solve
> > > regardless of where we put the gpu_scheduler object.  Is this the right
> > > solution?  Maybe not.  It was also proposed at one point that we could
> > > split the scheduler into two pieces: A scheduler which owns the kthread,
> > > and a back-end which targets some HW ring thing where you can have
> > > multiple back-ends per scheduler.  That's certainly more invasive from a
> > > DRM scheduler internal API PoV but would solve the kthread problem in a
> > > way that's more similar to what we have now.
> > >
> > >      >     In any case that's a low level question caused by the high
> > >     level design
> > >      >     decision. So I'd think first focus on the high level - which
> > >     is the 1:1
> > >      >     mapping of entity to scheduler instance proposal.
> > >      >
> > >      >     Fundamentally it will be up to the DRM maintainers and the
> > >     community to
> > >      >     bless your approach. And it is important to stress 1:1 is
> > about
> > >      >     userspace contexts, so I believe unlike any other current
> > >     scheduler
> > >      >     user. And also important to stress this effectively does not
> > >     make Xe
> > >      >     _really_ use the scheduler that much.
> > >      >
> > >      >
> > >      > I don't think this makes Xe nearly as much of a one-off as you
> > >     think it
> > >      > does.  I've already told the Asahi team working on Apple M1/2
> > >     hardware
> > >      > to do it this way and it seems to be a pretty good mapping for
> > >     them. I
> > >      > believe this is roughly the plan for nouveau as well.  It's not
> > >     the way
> > >      > it currently works for anyone because most other groups aren't
> > >     doing FW
> > >      > scheduling yet.  In the world of FW scheduling and hardware
> > >     designed to
> > >      > support userspace direct-to-FW submit, I think the design makes
> > >     perfect
> > >      > sense (see below) and I expect we'll see more drivers move in this
> > >      > direction as those drivers evolve.  (AMD is doing some customish
> > >     thing
> > >      > for how with gpu_scheduler on the front-end somehow. I've not dug
> > >     into
> > >      > those details.)
> > >      >
> > >      >     I can only offer my opinion, which is that the two options
> > >     mentioned in
> > >      >     this thread (either improve drm scheduler to cope with what is
> > >      >     required,
> > >      >     or split up the code so you can use just the parts of
> > >     drm_sched which
> > >      >     you want - which is frontend dependency tracking) shouldn't
> > be so
> > >      >     readily dismissed, given how I think the idea was for the new
> > >     driver to
> > >      >     work less in a silo and more in the community (not do kludges
> > to
> > >      >     workaround stuff because it is thought to be too hard to
> > >     improve common
> > >      >     code), but fundamentally, "goto previous paragraph" for what
> > I am
> > >      >     concerned.
> > >      >
> > >      >
> > >      > Meta comment:  It appears as if you're falling into the standard
> > >     i915
> > >      > team trap of having an internal discussion about what the
> > community
> > >      > discussion might look like instead of actually having the
> > community
> > >      > discussion.  If you are seriously concerned about interactions
> > with
> > >      > other drivers or whether or setting common direction, the right
> > >     way to
> > >      > do that is to break a patch or two out into a separate RFC series
> > >     and
> > >      > tag a handful of driver maintainers.  Trying to predict the
> > >     questions
> > >      > other people might ask is pointless. Cc them and asking for their
> > >     input
> > >      > instead.
> > >
> > >     I don't follow you here. It's not an internal discussion - I am
> > raising
> > >     my concerns on the design publicly. I am supposed to write a patch to
> > >     show something, but am allowed to comment on a RFC series?
> > >
> > >
> > > I may have misread your tone a bit.  It felt a bit like too many
> > > discussions I've had in the past where people are trying to predict what
> > > others will say instead of just asking them.  Reading it again, I was
> > > probably jumping to conclusions a bit.  Sorry about that.
> >
> > Okay no problem, thanks. In any case we don't have to keep discussing
> > it, since I wrote one or two emails ago it is fundamentally on the
> > maintainers and community to ack the approach. I only felt like RFC did
> > not explain the potential downsides sufficiently so I wanted to probe
> > that area a bit.
> >
> > >     It is "drm/sched: Convert drm scheduler to use a work queue rather
> > than
> > >     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> > >
> > >
> > > Yeah, it probably should have.  I think that's mostly what I've been
> > > trying to say.
> > >
> > >      >
> > >      >     Regards,
> > >      >
> > >      >     Tvrtko
> > >      >
> > >      >     P.S. And as a related side note, there are more areas where
> > >     drm_sched
> > >      >     could be improved, like for instance priority handling.
> > >      >     Take a look at msm_submitqueue_create /
> > >     msm_gpu_convert_priority /
> > >      >     get_sched_entity to see how msm works around the drm_sched
> > >     hardcoded
> > >      >     limit of available priority levels, in order to avoid having
> > >     to leave a
> > >      >     hw capability unused. I suspect msm would be happier if they
> > >     could have
> > >      >     all priority levels equal in terms of whether they apply only
> > >     at the
> > >      >     frontend level or completely throughout the pipeline.
> > >      >
> > >      >      > [1]
> > >      >
> > >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >
> > >      >
> > >       <
> > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> > >      >      >
> > >      >      >>> What would be interesting to learn is whether the option
> > of
> > >      >     refactoring
> > >      >      >>> drm_sched to deal with out of order completion was
> > >     considered
> > >      >     and what were
> > >      >      >>> the conclusions.
> > >      >      >>>
> > >      >      >>
> > >      >      >> I coded this up a while back when trying to convert the
> > >     i915 to
> > >      >     the DRM
> > >      >      >> scheduler it isn't all that hard either. The free flow
> > >     control
> > >      >     on the
> > >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE)
> > is
> > >      >     really what
> > >      >      >> sold me on the this design.
> > >      >
> > >      >
> > >      > You're not the only one to suggest supporting out-of-order
> > >     completion.
> > >      > However, it's tricky and breaks a lot of internal assumptions of
> > the
> > >      > scheduler. It also reduces functionality a bit because it can no
> > >     longer
> > >      > automatically rate-limit HW/FW queues which are often
> > >     fixed-size.  (Ok,
> > >      > yes, it probably could but it becomes a substantially harder
> > >     problem.)
> > >      >
> > >      > It also seems like a worse mapping to me.  The goal here is to
> > turn
> > >      > submissions on a userspace-facing engine/queue into submissions
> > >     to a FW
> > >      > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > >      > description of saying this is a 1:1 mapping between sched/entity
> > >     doesn't
> > >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> > >     mapping?
> > >      > Why is that better?
> > >
> > >     As I have stated before, what I think what would fit well for Xe is
> > one
> > >     drm_scheduler per engine class. In specific terms on our current
> > >     hardware, one drm scheduler instance for render, compute, blitter,
> > >     video
> > >     and video enhance. Userspace contexts remain scheduler entities.
> > >
> > >
> > > And this is where we fairly strongly disagree.  More in a bit.
> > >
> > >     That way you avoid the whole kthread/kworker story and you have it
> > >     actually use the entity picking code in the scheduler, which may be
> > >     useful when the backend is congested.
> > >
> > >
> > > What back-end congestion are you referring to here?  Running out of FW
> > > queue IDs?  Something else?
> >
> > CT channel, number of context ids.
> >
> > >
> > >     Yes you have to solve the out of order problem so in my mind that is
> > >     something to discuss. What the problem actually is (just TDR?), how
> > >     tricky and why etc.
> > >
> > >     And yes you lose the handy LRCA ring buffer size management so you'd
> > >     have to make those entities not runnable in some other way.
> > >
> > >     Regarding the argument you raise below - would any of that make the
> > >     frontend / backend separation worse and why? Do you think it is less
> > >     natural? If neither is true then all remains is that it appears extra
> > >     work to support out of order completion of entities has been
> > discounted
> > >     in favour of an easy but IMO inelegant option.
> > >
> > >
> > > Broadly speaking, the kernel needs to stop thinking about GPU scheduling
> > > in terms of scheduling jobs and start thinking in terms of scheduling
> > > contexts/engines.  There is still some need for scheduling individual
> > > jobs but that is only for the purpose of delaying them as needed to
> > > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > > get shoved onto the context/engine queue and from there the kernel only
> > > really manages whole contexts/engines.  This is a major architectural
> > > shift, entirely different from the way i915 scheduling works.  It's also
> > > different from the historical usage of DRM scheduler which I think is
> > > why this all looks a bit funny.
> > >
> > > To justify this architectural shift, let's look at where we're headed.
> > > In the glorious future...
> > >
> > >   1. Userspace submits directly to firmware queues.  The kernel has no
> > > visibility whatsoever into individual jobs.  At most it can pause/resume
> > > FW contexts as needed to handle eviction and memory management.
> > >
> > >   2. Because of 1, apart from handing out the FW queue IDs at the
> > > beginning, the kernel can't really juggle them that much.  Depending on
> > > FW design, it may be able to pause a client, give its IDs to another,
> > > and then resume it later when IDs free up.  What it's not doing is
> > > juggling IDs on a job-by-job basis like i915 currently is.
> > >
> > >   3. Long-running compute jobs may not complete for days.  This means
> > > that memory management needs to happen in terms of pause/resume of
> > > entire contexts/engines using the memory rather than based on waiting
> > > for individual jobs to complete or pausing individual jobs until the
> > > memory is available.
> > >
> > >   4. Synchronization happens via userspace memory fences (UMF) and the
> > > kernel is mostly unaware of most dependencies and when a context/engine
> > > is or is not runnable.  Instead, it keeps as many of them minimally
> > > active (memory is available, even if it's in system RAM) as possible and
> > > lets the FW sort out dependencies.  (There may need to be some facility
> > > for sleeping a context until a memory change similar to futex() or
> > > poll() for userspace threads.  There are some details TBD.)
> > >
> > > Are there potential problems that will need to be solved here?  Yes.  Is
> > > it a good design?  Well, Microsoft has been living in this future for
> > > half a decade or better and it's working quite well for them.  It's also
> > > the way all modern game consoles work.  It really is just Linux that's
> > > stuck with the same old job model we've had since the monumental shift
> > > to DRI2.
> > >
> > > To that end, one of the core goals of the Xe project was to make the
> > > driver internally behave as close to the above model as possible while
> > > keeping the old-school job model as a very thin layer on top.  As the
> > > broader ecosystem problems (window-system support for UMF, for instance)
> > > are solved, that layer can be peeled back.  The core driver will already
> > > be ready for it.
> > >
> > > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > > and stuff jobs into their respective context/engine queue once they're
> > > ready.  All the actual scheduling happens in firmware and any scheduling
> > > the kernel does to deal with contention, oversubscriptions, too many
> > > contexts, etc. is between contexts/engines, not individual jobs.  Sure,
> > > the individual job visibility is nice, but if we design around it, we'll
> > > never get to the glorious future.
> > >
> > > I really need to turn the above (with a bit more detail) into a blog
> > > post.... Maybe I'll do that this week.
> > >
> > > In any case, I hope that provides more insight into why Xe is designed
> > > the way it is and why I'm pushing back so hard on trying to make it more
> > > of a "classic" driver as far as scheduling is concerned.  Are there
> > > potential problems here?  Yes, that's why Xe has been labeled a
> > > prototype.  Are such radical changes necessary to get to said glorious
> > > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> >
> > Right, that's all solid I think. My takeaway is that frontend priority
> > sorting and that stuff isn't needed and that is okay. And that there are
> > multiple options to maybe improve drm scheduler, like the fore mentioned
> > making it deal with out of order, or split into functional components,
> > or split frontend/backend what you suggested. For most of them cost vs
> > benefit is more or less not completely clear, neither how much effort
> > was invested to look into them.
> >
> > One thing I missed from this explanation is how drm_scheduler per engine
> > class interferes with the high level concepts. And I did not manage to
> > pick up on what exactly is the TDR problem in that case. Maybe the two
> > are one and the same.
> >
> > Bottom line is I still have the concern that conversion to kworkers has
> > an opportunity to regress. Possibly more opportunity for some Xe use
> > cases than to affect other vendors, since they would still be using per
> > physical engine / queue scheduler instances.
> >
> > And to put my money where my mouth is I will try to put testing Xe
> > inside the full blown ChromeOS environment in my team plans. It would
> > probably also be beneficial if Xe team could take a look at real world
> > behaviour of the extreme transcode use cases too. If the stack is ready
> > for that and all. It would be better to know earlier rather than later
> > if there is a fundamental issue.
> >
> > For the patch at hand, and the cover letter, it certainly feels it would
> > benefit to record the past design discussion had with AMD folks, to
> > explicitly copy other drivers, and to record the theoretical pros and
> > cons of threads vs unbound workers as I have tried to highlight them.
> >
> > Regards,
> >
> > Tvrtko
> >

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 22:31                             ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-11 22:31 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> tvrtko.ursulin@linux.intel.com> wrote:
> 
> >
> > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > <tvrtko.ursulin@linux.intel.com <mailto:tvrtko.ursulin@linux.intel.com>>
> >
> > > wrote:
> > >
> > >
> > >
> > >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > >
> > >     [snip]
> > >
> > >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> > created
> > >      >     contexts (per
> > >      >      >>> context _and_ engine) and drm_sched. I am not sure
> > avoiding
> > >      >     invasive changes
> > >      >      >>> to the shared code is in the spirit of the overall idea
> > >     and instead
> > >      >      >>> opportunity should be used to look at way to
> > >     refactor/improve
> > >      >     drm_sched.
> > >      >
> > >      >
> > >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > >     all or
> > >      > really needs to drive a re-factor.  (More on that later.)
> > >     There's only
> > >      > one real issue which is that it fires off potentially a lot of
> > >     kthreads.
> > >      > Even that's not that bad given that kthreads are pretty light and
> > >     you're
> > >      > not likely to have more kthreads than userspace threads which are
> > >     much
> > >      > heavier.  Not ideal, but not the end of the world either.
> > >     Definitely
> > >      > something we can/should optimize but if we went through with Xe
> > >     without
> > >      > this patch, it would probably be mostly ok.
> > >      >
> > >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >      >      >>
> > >      >      >> I'm not really prepared to make large changes to DRM
> > >     scheduler
> > >      >     at the
> > >      >      >> moment for Xe as they are not really required nor does
> > Boris
> > >      >     seem they
> > >      >      >> will be required for his work either. I am interested to
> > see
> > >      >     what Boris
> > >      >      >> comes up with.
> > >      >      >>
> > >      >      >>> Even on the low level, the idea to replace drm_sched
> > threads
> > >      >     with workers
> > >      >      >>> has a few problems.
> > >      >      >>>
> > >      >      >>> To start with, the pattern of:
> > >      >      >>>
> > >      >      >>>    while (not_stopped) {
> > >      >      >>>     keep picking jobs
> > >      >      >>>    }
> > >      >      >>>
> > >      >      >>> Feels fundamentally in disagreement with workers (while
> > >      >     obviously fits
> > >      >      >>> perfectly with the current kthread design).
> > >      >      >>
> > >      >      >> The while loop breaks and worker exists if no jobs are
> > ready.
> > >      >
> > >      >
> > >      > I'm not very familiar with workqueues. What are you saying would
> > fit
> > >      > better? One scheduling job per work item rather than one big work
> > >     item
> > >      > which handles all available jobs?
> > >
> > >     Yes and no, it indeed IMO does not fit to have a work item which is
> > >     potentially unbound in runtime. But it is a bit moot conceptual
> > >     mismatch
> > >     because it is a worst case / theoretical, and I think due more
> > >     fundamental concerns.
> > >
> > >     If we have to go back to the low level side of things, I've picked
> > this
> > >     random spot to consolidate what I have already mentioned and perhaps
> > >     expand.
> > >
> > >     To start with, let me pull out some thoughts from workqueue.rst:
> > >
> > >     """
> > >     Generally, work items are not expected to hog a CPU and consume many
> > >     cycles. That means maintaining just enough concurrency to prevent
> > work
> > >     processing from stalling should be optimal.
> > >     """
> > >
> > >     For unbound queues:
> > >     """
> > >     The responsibility of regulating concurrency level is on the users.
> > >     """
> > >
> > >     Given the unbound queues will be spawned on demand to service all
> > >     queued
> > >     work items (more interesting when mixing up with the
> > >     system_unbound_wq),
> > >     in the proposed design the number of instantiated worker threads does
> > >     not correspond to the number of user threads (as you have elsewhere
> > >     stated), but pessimistically to the number of active user contexts.
> > >
> > >
> > > Those are pretty much the same in practice.  Rather, user threads is
> > > typically an upper bound on the number of contexts.  Yes, a single user
> > > thread could have a bunch of contexts but basically nothing does that
> > > except IGT.  In real-world usage, it's at most one context per user
> > thread.
> >
> > Typically is the key here. But I am not sure it is good enough. Consider
> > this example - Intel Flex 170:
> >
> >   * Delivers up to 36 streams 1080p60 transcode throughput per card.
> >   * When scaled to 10 cards in a 4U server configuration, it can support
> > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> >
> 
> I had a feeling it was going to be media.... 😅
> 

Yea wondering the media UMD can be rewritten to use less xe_engines, it
is massive rewrite for VM bind + no implicit dependencies so let's just
pile on some more work?

> 
> > One transcode stream from my experience typically is 3-4 GPU contexts
> > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> > CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> > 60fps = 8640 jobs submitted and completed per second.
> >
> > 144 active contexts in the proposed scheme means possibly means 144
> > kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> > don't think the pools would scale down given all are constantly pinged
> > at 60fps.)
> >
> > And then each of 144 threads goes to grab the single GuC CT mutex. First
> > threads are being made schedulable, then put to sleep as mutex
> > contention is hit, then woken again as mutexes are getting released,
> > rinse, repeat.
> >
> 
> Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
> back-end yet but I was under the impression that most run_job() would be
> just shoving another packet into a ring buffer.  If we have to send the GuC
> a message on the control ring every single time we submit a job, that's
> pretty horrible.
>

Run job writes the ring buffer and moves the tail as the first step (no
lock required). Next it needs to tell the GuC the xe_engine LRC tail has
moved, this is done from a single Host to GuC channel which is circular
buffer, the writing of the channel protected by the mutex. There are
little more nuances too but in practice there is always space in the
channel so the time mutex needs to held is really, really small
(check cached credits, write 3 dwords in payload, write 1 dword to move
tail). I also believe mutexes in Linux are hybrid where they spin for a
little bit before sleeping and certainly if there is space in the
channel we shouldn't sleep mutex contention.

As far as this being horrible, well didn't design the GuC and this how
it is implemented for KMD based submission. We also have 256 doorbells
so we wouldn't need a lock but I think are other issues with that design
too which need to be worked out in the Xe2 / Xe3 timeframe.

Also if you see my follow up response Xe is ~33k execs per second with
the current implementation on a 8 core (or maybe 8 thread) TGL which
seems to fine to me.

Matt
 
> --Jason
> 
> 
> (And yes this backend contention is there regardless of 1:1:1, it would
> > require a different re-design to solve that. But it is just a question
> > whether there are 144 contending threads, or just 6 with the thread per
> > engine class scheme.)
> >
> > Then multiply all by 10 for a 4U server use case and you get 1440 worker
> > kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> > Just so they can grab a timeslice and maybe content on a mutex as the
> > next step.
> >
> > This example is where it would hurt on large systems. Imagine only an
> > even wider media transcode card...
> >
> > Second example is only a single engine class used (3d desktop?) but with
> > a bunch of not-runnable jobs queued and waiting on a fence to signal.
> > Implicit or explicit dependencies doesn't matter. Then the fence signals
> > and call backs run. N work items get scheduled, but they all submit to
> > the same HW engine. So we end up with:
> >
> >          /-- wi1 --\
> >         / ..     .. \
> >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> >         \ ..    ..  /
> >          \-- wiN --/
> >
> >
> > All that we have achieved is waking up N CPUs to contend on the same
> > lock and effectively insert the job into the same single HW queue. I
> > don't see any positives there.
> >
> > This example I think can particularly hurt small / low power devices
> > because of needless waking up of many cores for no benefit. Granted, I
> > don't have a good feel on how common this pattern is in practice.
> >
> > >
> > >     That
> > >     is the number which drives the maximum number of not-runnable jobs
> > that
> > >     can become runnable at once, and hence spawn that many work items,
> > and
> > >     in turn unbound worker threads.
> > >
> > >     Several problems there.
> > >
> > >     It is fundamentally pointless to have potentially that many more
> > >     threads
> > >     than the number of CPU cores - it simply creates a scheduling storm.
> > >
> > >     Unbound workers have no CPU / cache locality either and no connection
> > >     with the CPU scheduler to optimize scheduling patterns. This may
> > matter
> > >     either on large systems or on small ones. Whereas the current design
> > >     allows for scheduler to notice userspace CPU thread keeps waking up
> > the
> > >     same drm scheduler kernel thread, and so it can keep them on the same
> > >     CPU, the unbound workers lose that ability and so 2nd CPU might be
> > >     getting woken up from low sleep for every submission.
> > >
> > >     Hence, apart from being a bit of a impedance mismatch, the proposal
> > has
> > >     the potential to change performance and power patterns and both large
> > >     and small machines.
> > >
> > >
> > > Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
> > > deferred kwork does appear to mismatch somewhat with what the scheduler
> > > needs or at least how it's worked in the past.  How much impact will
> > > that mismatch have?  Unclear.
> > >
> > >      >      >>> Secondly, it probably demands separate workers (not
> > >     optional),
> > >      >     otherwise
> > >      >      >>> behaviour of shared workqueues has either the potential
> > to
> > >      >     explode number
> > >      >      >>> kernel threads anyway, or add latency.
> > >      >      >>>
> > >      >      >>
> > >      >      >> Right now the system_unbound_wq is used which does have a
> > >     limit
> > >      >     on the
> > >      >      >> number of threads, right? I do have a FIXME to allow a
> > >     worker to be
> > >      >      >> passed in similar to TDR.
> > >      >      >>
> > >      >      >> WRT to latency, the 1:1 ratio could actually have lower
> > >     latency
> > >      >     as 2 GPU
> > >      >      >> schedulers can be pushing jobs into the backend /
> > cleaning up
> > >      >     jobs in
> > >      >      >> parallel.
> > >      >      >>
> > >      >      >
> > >      >      > Thought of one more point here where why in Xe we
> > >     absolutely want
> > >      >     a 1 to
> > >      >      > 1 ratio between entity and scheduler - the way we implement
> > >      >     timeslicing
> > >      >      > for preempt fences.
> > >      >      >
> > >      >      > Let me try to explain.
> > >      >      >
> > >      >      > Preempt fences are implemented via the generic messaging
> > >      >     interface [1]
> > >      >      > with suspend / resume messages. If a suspend messages is
> > >     received to
> > >      >      > soon after calling resume (this is per entity) we simply
> > >     sleep in the
> > >      >      > suspend call thus giving the entity a timeslice. This
> > >     completely
> > >      >     falls
> > >      >      > apart with a many to 1 relationship as now a entity
> > >     waiting for a
> > >      >      > timeslice blocks the other entities. Could we work aroudn
> > >     this,
> > >      >     sure but
> > >      >      > just another bunch of code we'd have to add in Xe. Being to
> > >      >     freely sleep
> > >      >      > in backend without affecting other entities is really,
> > really
> > >      >     nice IMO
> > >      >      > and I bet Xe isn't the only driver that is going to feel
> > >     this way.
> > >      >      >
> > >      >      > Last thing I'll say regardless of how anyone feels about
> > >     Xe using
> > >      >     a 1 to
> > >      >      > 1 relationship this patch IMO makes sense as I hope we can
> > all
> > >      >     agree a
> > >      >      > workqueue scales better than kthreads.
> > >      >
> > >      >     I don't know for sure what will scale better and for what use
> > >     case,
> > >      >     combination of CPU cores vs number of GPU engines to keep
> > >     busy vs other
> > >      >     system activity. But I wager someone is bound to ask for some
> > >      >     numbers to
> > >      >     make sure proposal is not negatively affecting any other
> > drivers.
> > >      >
> > >      >
> > >      > Then let them ask.  Waving your hands vaguely in the direction of
> > >     the
> > >      > rest of DRM and saying "Uh, someone (not me) might object" is
> > >     profoundly
> > >      > unhelpful.  Sure, someone might.  That's why it's on dri-devel.
> > >     If you
> > >      > think there's someone in particular who might have a useful
> > >     opinion on
> > >      > this, throw them in the CC so they don't miss the e-mail thread.
> > >      >
> > >      > Or are you asking for numbers?  If so, what numbers are you
> > >     asking for?
> > >
> > >     It was a heads up to the Xe team in case people weren't appreciating
> > >     how
> > >     the proposed change has the potential influence power and performance
> > >     across the board. And nothing in the follow up discussion made me
> > think
> > >     it was considered so I don't think it was redundant to raise it.
> > >
> > >     In my experience it is typical that such core changes come with some
> > >     numbers. Which is in case of drm scheduler is tricky and probably
> > >     requires explicitly asking everyone to test (rather than count on
> > >     "don't
> > >     miss the email thread"). Real products can fail to ship due ten mW
> > here
> > >     or there. Like suddenly an extra core prevented from getting into
> > deep
> > >     sleep.
> > >
> > >     If that was "profoundly unhelpful" so be it.
> > >
> > >
> > > With your above explanation, it makes more sense what you're asking.
> > > It's still not something Matt is likely to be able to provide on his
> > > own.  We need to tag some other folks and ask them to test it out.  We
> > > could play around a bit with it on Xe but it's not exactly production
> > > grade yet and is going to hit this differently from most.  Likely
> > > candidates are probably AMD and Freedreno.
> >
> > Whoever is setup to check out power and performance would be good to
> > give it a spin, yes.
> >
> > PS. I don't think I was asking Matt to test with other devices. To start
> > with I think Xe is a team effort. I was asking for more background on
> > the design decision since patch 4/20 does not say anything on that
> > angle, nor later in the thread it was IMO sufficiently addressed.
> >
> > >      > Also, If we're talking about a design that might paint us into an
> > >      > Intel-HW-specific hole, that would be one thing.  But we're not.
> > >     We're
> > >      > talking about switching which kernel threading/task mechanism to
> > >     use for
> > >      > what's really a very generic problem.  The core Xe design works
> > >     without
> > >      > this patch (just with more kthreads).  If we land this patch or
> > >      > something like it and get it wrong and it causes a performance
> > >     problem
> > >      > for someone down the line, we can revisit it.
> > >
> > >     For some definition of "it works" - I really wouldn't suggest
> > >     shipping a
> > >     kthread per user context at any point.
> > >
> > >
> > > You have yet to elaborate on why. What resources is it consuming that's
> > > going to be a problem? Are you anticipating CPU affinity problems? Or
> > > does it just seem wasteful?
> >
> > Well I don't know, commit message says the approach does not scale. :)
> >
> > > I think I largely agree that it's probably unnecessary/wasteful but
> > > reducing the number of kthreads seems like a tractable problem to solve
> > > regardless of where we put the gpu_scheduler object.  Is this the right
> > > solution?  Maybe not.  It was also proposed at one point that we could
> > > split the scheduler into two pieces: A scheduler which owns the kthread,
> > > and a back-end which targets some HW ring thing where you can have
> > > multiple back-ends per scheduler.  That's certainly more invasive from a
> > > DRM scheduler internal API PoV but would solve the kthread problem in a
> > > way that's more similar to what we have now.
> > >
> > >      >     In any case that's a low level question caused by the high
> > >     level design
> > >      >     decision. So I'd think first focus on the high level - which
> > >     is the 1:1
> > >      >     mapping of entity to scheduler instance proposal.
> > >      >
> > >      >     Fundamentally it will be up to the DRM maintainers and the
> > >     community to
> > >      >     bless your approach. And it is important to stress 1:1 is
> > about
> > >      >     userspace contexts, so I believe unlike any other current
> > >     scheduler
> > >      >     user. And also important to stress this effectively does not
> > >     make Xe
> > >      >     _really_ use the scheduler that much.
> > >      >
> > >      >
> > >      > I don't think this makes Xe nearly as much of a one-off as you
> > >     think it
> > >      > does.  I've already told the Asahi team working on Apple M1/2
> > >     hardware
> > >      > to do it this way and it seems to be a pretty good mapping for
> > >     them. I
> > >      > believe this is roughly the plan for nouveau as well.  It's not
> > >     the way
> > >      > it currently works for anyone because most other groups aren't
> > >     doing FW
> > >      > scheduling yet.  In the world of FW scheduling and hardware
> > >     designed to
> > >      > support userspace direct-to-FW submit, I think the design makes
> > >     perfect
> > >      > sense (see below) and I expect we'll see more drivers move in this
> > >      > direction as those drivers evolve.  (AMD is doing some customish
> > >     thing
> > >      > for how with gpu_scheduler on the front-end somehow. I've not dug
> > >     into
> > >      > those details.)
> > >      >
> > >      >     I can only offer my opinion, which is that the two options
> > >     mentioned in
> > >      >     this thread (either improve drm scheduler to cope with what is
> > >      >     required,
> > >      >     or split up the code so you can use just the parts of
> > >     drm_sched which
> > >      >     you want - which is frontend dependency tracking) shouldn't
> > be so
> > >      >     readily dismissed, given how I think the idea was for the new
> > >     driver to
> > >      >     work less in a silo and more in the community (not do kludges
> > to
> > >      >     workaround stuff because it is thought to be too hard to
> > >     improve common
> > >      >     code), but fundamentally, "goto previous paragraph" for what
> > I am
> > >      >     concerned.
> > >      >
> > >      >
> > >      > Meta comment:  It appears as if you're falling into the standard
> > >     i915
> > >      > team trap of having an internal discussion about what the
> > community
> > >      > discussion might look like instead of actually having the
> > community
> > >      > discussion.  If you are seriously concerned about interactions
> > with
> > >      > other drivers or whether or setting common direction, the right
> > >     way to
> > >      > do that is to break a patch or two out into a separate RFC series
> > >     and
> > >      > tag a handful of driver maintainers.  Trying to predict the
> > >     questions
> > >      > other people might ask is pointless. Cc them and asking for their
> > >     input
> > >      > instead.
> > >
> > >     I don't follow you here. It's not an internal discussion - I am
> > raising
> > >     my concerns on the design publicly. I am supposed to write a patch to
> > >     show something, but am allowed to comment on a RFC series?
> > >
> > >
> > > I may have misread your tone a bit.  It felt a bit like too many
> > > discussions I've had in the past where people are trying to predict what
> > > others will say instead of just asking them.  Reading it again, I was
> > > probably jumping to conclusions a bit.  Sorry about that.
> >
> > Okay no problem, thanks. In any case we don't have to keep discussing
> > it, since I wrote one or two emails ago it is fundamentally on the
> > maintainers and community to ack the approach. I only felt like RFC did
> > not explain the potential downsides sufficiently so I wanted to probe
> > that area a bit.
> >
> > >     It is "drm/sched: Convert drm scheduler to use a work queue rather
> > than
> > >     kthread" which should have Cc-ed _everyone_ who use drm scheduler.
> > >
> > >
> > > Yeah, it probably should have.  I think that's mostly what I've been
> > > trying to say.
> > >
> > >      >
> > >      >     Regards,
> > >      >
> > >      >     Tvrtko
> > >      >
> > >      >     P.S. And as a related side note, there are more areas where
> > >     drm_sched
> > >      >     could be improved, like for instance priority handling.
> > >      >     Take a look at msm_submitqueue_create /
> > >     msm_gpu_convert_priority /
> > >      >     get_sched_entity to see how msm works around the drm_sched
> > >     hardcoded
> > >      >     limit of available priority levels, in order to avoid having
> > >     to leave a
> > >      >     hw capability unused. I suspect msm would be happier if they
> > >     could have
> > >      >     all priority levels equal in terms of whether they apply only
> > >     at the
> > >      >     frontend level or completely throughout the pipeline.
> > >      >
> > >      >      > [1]
> > >      >
> > >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > >
> > >      >
> > >       <
> > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> > >      >      >
> > >      >      >>> What would be interesting to learn is whether the option
> > of
> > >      >     refactoring
> > >      >      >>> drm_sched to deal with out of order completion was
> > >     considered
> > >      >     and what were
> > >      >      >>> the conclusions.
> > >      >      >>>
> > >      >      >>
> > >      >      >> I coded this up a while back when trying to convert the
> > >     i915 to
> > >      >     the DRM
> > >      >      >> scheduler it isn't all that hard either. The free flow
> > >     control
> > >      >     on the
> > >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE)
> > is
> > >      >     really what
> > >      >      >> sold me on the this design.
> > >      >
> > >      >
> > >      > You're not the only one to suggest supporting out-of-order
> > >     completion.
> > >      > However, it's tricky and breaks a lot of internal assumptions of
> > the
> > >      > scheduler. It also reduces functionality a bit because it can no
> > >     longer
> > >      > automatically rate-limit HW/FW queues which are often
> > >     fixed-size.  (Ok,
> > >      > yes, it probably could but it becomes a substantially harder
> > >     problem.)
> > >      >
> > >      > It also seems like a worse mapping to me.  The goal here is to
> > turn
> > >      > submissions on a userspace-facing engine/queue into submissions
> > >     to a FW
> > >      > queue submissions, sorting out any dma_fence dependencies.  Matt's
> > >      > description of saying this is a 1:1 mapping between sched/entity
> > >     doesn't
> > >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> > >     mapping?
> > >      > Why is that better?
> > >
> > >     As I have stated before, what I think what would fit well for Xe is
> > one
> > >     drm_scheduler per engine class. In specific terms on our current
> > >     hardware, one drm scheduler instance for render, compute, blitter,
> > >     video
> > >     and video enhance. Userspace contexts remain scheduler entities.
> > >
> > >
> > > And this is where we fairly strongly disagree.  More in a bit.
> > >
> > >     That way you avoid the whole kthread/kworker story and you have it
> > >     actually use the entity picking code in the scheduler, which may be
> > >     useful when the backend is congested.
> > >
> > >
> > > What back-end congestion are you referring to here?  Running out of FW
> > > queue IDs?  Something else?
> >
> > CT channel, number of context ids.
> >
> > >
> > >     Yes you have to solve the out of order problem so in my mind that is
> > >     something to discuss. What the problem actually is (just TDR?), how
> > >     tricky and why etc.
> > >
> > >     And yes you lose the handy LRCA ring buffer size management so you'd
> > >     have to make those entities not runnable in some other way.
> > >
> > >     Regarding the argument you raise below - would any of that make the
> > >     frontend / backend separation worse and why? Do you think it is less
> > >     natural? If neither is true then all remains is that it appears extra
> > >     work to support out of order completion of entities has been
> > discounted
> > >     in favour of an easy but IMO inelegant option.
> > >
> > >
> > > Broadly speaking, the kernel needs to stop thinking about GPU scheduling
> > > in terms of scheduling jobs and start thinking in terms of scheduling
> > > contexts/engines.  There is still some need for scheduling individual
> > > jobs but that is only for the purpose of delaying them as needed to
> > > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > > get shoved onto the context/engine queue and from there the kernel only
> > > really manages whole contexts/engines.  This is a major architectural
> > > shift, entirely different from the way i915 scheduling works.  It's also
> > > different from the historical usage of DRM scheduler which I think is
> > > why this all looks a bit funny.
> > >
> > > To justify this architectural shift, let's look at where we're headed.
> > > In the glorious future...
> > >
> > >   1. Userspace submits directly to firmware queues.  The kernel has no
> > > visibility whatsoever into individual jobs.  At most it can pause/resume
> > > FW contexts as needed to handle eviction and memory management.
> > >
> > >   2. Because of 1, apart from handing out the FW queue IDs at the
> > > beginning, the kernel can't really juggle them that much.  Depending on
> > > FW design, it may be able to pause a client, give its IDs to another,
> > > and then resume it later when IDs free up.  What it's not doing is
> > > juggling IDs on a job-by-job basis like i915 currently is.
> > >
> > >   3. Long-running compute jobs may not complete for days.  This means
> > > that memory management needs to happen in terms of pause/resume of
> > > entire contexts/engines using the memory rather than based on waiting
> > > for individual jobs to complete or pausing individual jobs until the
> > > memory is available.
> > >
> > >   4. Synchronization happens via userspace memory fences (UMF) and the
> > > kernel is mostly unaware of most dependencies and when a context/engine
> > > is or is not runnable.  Instead, it keeps as many of them minimally
> > > active (memory is available, even if it's in system RAM) as possible and
> > > lets the FW sort out dependencies.  (There may need to be some facility
> > > for sleeping a context until a memory change similar to futex() or
> > > poll() for userspace threads.  There are some details TBD.)
> > >
> > > Are there potential problems that will need to be solved here?  Yes.  Is
> > > it a good design?  Well, Microsoft has been living in this future for
> > > half a decade or better and it's working quite well for them.  It's also
> > > the way all modern game consoles work.  It really is just Linux that's
> > > stuck with the same old job model we've had since the monumental shift
> > > to DRI2.
> > >
> > > To that end, one of the core goals of the Xe project was to make the
> > > driver internally behave as close to the above model as possible while
> > > keeping the old-school job model as a very thin layer on top.  As the
> > > broader ecosystem problems (window-system support for UMF, for instance)
> > > are solved, that layer can be peeled back.  The core driver will already
> > > be ready for it.
> > >
> > > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > > and stuff jobs into their respective context/engine queue once they're
> > > ready.  All the actual scheduling happens in firmware and any scheduling
> > > the kernel does to deal with contention, oversubscriptions, too many
> > > contexts, etc. is between contexts/engines, not individual jobs.  Sure,
> > > the individual job visibility is nice, but if we design around it, we'll
> > > never get to the glorious future.
> > >
> > > I really need to turn the above (with a bit more detail) into a blog
> > > post.... Maybe I'll do that this week.
> > >
> > > In any case, I hope that provides more insight into why Xe is designed
> > > the way it is and why I'm pushing back so hard on trying to make it more
> > > of a "classic" driver as far as scheduling is concerned.  Are there
> > > potential problems here?  Yes, that's why Xe has been labeled a
> > > prototype.  Are such radical changes necessary to get to said glorious
> > > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> >
> > Right, that's all solid I think. My takeaway is that frontend priority
> > sorting and that stuff isn't needed and that is okay. And that there are
> > multiple options to maybe improve drm scheduler, like the fore mentioned
> > making it deal with out of order, or split into functional components,
> > or split frontend/backend what you suggested. For most of them cost vs
> > benefit is more or less not completely clear, neither how much effort
> > was invested to look into them.
> >
> > One thing I missed from this explanation is how drm_scheduler per engine
> > class interferes with the high level concepts. And I did not manage to
> > pick up on what exactly is the TDR problem in that case. Maybe the two
> > are one and the same.
> >
> > Bottom line is I still have the concern that conversion to kworkers has
> > an opportunity to regress. Possibly more opportunity for some Xe use
> > cases than to affect other vendors, since they would still be using per
> > physical engine / queue scheduler instances.
> >
> > And to put my money where my mouth is I will try to put testing Xe
> > inside the full blown ChromeOS environment in my team plans. It would
> > probably also be beneficial if Xe team could take a look at real world
> > behaviour of the extreme transcode use cases too. If the stack is ready
> > for that and all. It would be better to know earlier rather than later
> > if there is a fundamental issue.
> >
> > For the patch at hand, and the cover letter, it certainly feels it would
> > benefit to record the past design discussion had with AMD folks, to
> > explicitly copy other drivers, and to record the theoretical pros and
> > cons of threads vs unbound workers as I have tried to highlight them.
> >
> > Regards,
> >
> > Tvrtko
> >

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 22:31                             ` Matthew Brost
@ 2023-01-11 22:56                               ` Jason Ekstrand
  -1 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-11 22:56 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Tvrtko Ursulin, intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 36928 bytes --]

On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.brost@intel.com>
wrote:

> On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> > tvrtko.ursulin@linux.intel.com> wrote:
> >
> > >
> > > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > > <tvrtko.ursulin@linux.intel.com <mailto:
> tvrtko.ursulin@linux.intel.com>>
> > >
> > > > wrote:
> > > >
> > > >
> > > >
> > > >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > >
> > > >     [snip]
> > > >
> > > >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> > > created
> > > >      >     contexts (per
> > > >      >      >>> context _and_ engine) and drm_sched. I am not sure
> > > avoiding
> > > >      >     invasive changes
> > > >      >      >>> to the shared code is in the spirit of the overall
> idea
> > > >     and instead
> > > >      >      >>> opportunity should be used to look at way to
> > > >     refactor/improve
> > > >      >     drm_sched.
> > > >      >
> > > >      >
> > > >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > > >     all or
> > > >      > really needs to drive a re-factor.  (More on that later.)
> > > >     There's only
> > > >      > one real issue which is that it fires off potentially a lot of
> > > >     kthreads.
> > > >      > Even that's not that bad given that kthreads are pretty light
> and
> > > >     you're
> > > >      > not likely to have more kthreads than userspace threads which
> are
> > > >     much
> > > >      > heavier.  Not ideal, but not the end of the world either.
> > > >     Definitely
> > > >      > something we can/should optimize but if we went through with
> Xe
> > > >     without
> > > >      > this patch, it would probably be mostly ok.
> > > >      >
> > > >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > >      >      >>
> > > >      >      >> I'm not really prepared to make large changes to DRM
> > > >     scheduler
> > > >      >     at the
> > > >      >      >> moment for Xe as they are not really required nor does
> > > Boris
> > > >      >     seem they
> > > >      >      >> will be required for his work either. I am interested
> to
> > > see
> > > >      >     what Boris
> > > >      >      >> comes up with.
> > > >      >      >>
> > > >      >      >>> Even on the low level, the idea to replace drm_sched
> > > threads
> > > >      >     with workers
> > > >      >      >>> has a few problems.
> > > >      >      >>>
> > > >      >      >>> To start with, the pattern of:
> > > >      >      >>>
> > > >      >      >>>    while (not_stopped) {
> > > >      >      >>>     keep picking jobs
> > > >      >      >>>    }
> > > >      >      >>>
> > > >      >      >>> Feels fundamentally in disagreement with workers
> (while
> > > >      >     obviously fits
> > > >      >      >>> perfectly with the current kthread design).
> > > >      >      >>
> > > >      >      >> The while loop breaks and worker exists if no jobs are
> > > ready.
> > > >      >
> > > >      >
> > > >      > I'm not very familiar with workqueues. What are you saying
> would
> > > fit
> > > >      > better? One scheduling job per work item rather than one big
> work
> > > >     item
> > > >      > which handles all available jobs?
> > > >
> > > >     Yes and no, it indeed IMO does not fit to have a work item which
> is
> > > >     potentially unbound in runtime. But it is a bit moot conceptual
> > > >     mismatch
> > > >     because it is a worst case / theoretical, and I think due more
> > > >     fundamental concerns.
> > > >
> > > >     If we have to go back to the low level side of things, I've
> picked
> > > this
> > > >     random spot to consolidate what I have already mentioned and
> perhaps
> > > >     expand.
> > > >
> > > >     To start with, let me pull out some thoughts from workqueue.rst:
> > > >
> > > >     """
> > > >     Generally, work items are not expected to hog a CPU and consume
> many
> > > >     cycles. That means maintaining just enough concurrency to prevent
> > > work
> > > >     processing from stalling should be optimal.
> > > >     """
> > > >
> > > >     For unbound queues:
> > > >     """
> > > >     The responsibility of regulating concurrency level is on the
> users.
> > > >     """
> > > >
> > > >     Given the unbound queues will be spawned on demand to service all
> > > >     queued
> > > >     work items (more interesting when mixing up with the
> > > >     system_unbound_wq),
> > > >     in the proposed design the number of instantiated worker threads
> does
> > > >     not correspond to the number of user threads (as you have
> elsewhere
> > > >     stated), but pessimistically to the number of active user
> contexts.
> > > >
> > > >
> > > > Those are pretty much the same in practice.  Rather, user threads is
> > > > typically an upper bound on the number of contexts.  Yes, a single
> user
> > > > thread could have a bunch of contexts but basically nothing does that
> > > > except IGT.  In real-world usage, it's at most one context per user
> > > thread.
> > >
> > > Typically is the key here. But I am not sure it is good enough.
> Consider
> > > this example - Intel Flex 170:
> > >
> > >   * Delivers up to 36 streams 1080p60 transcode throughput per card.
> > >   * When scaled to 10 cards in a 4U server configuration, it can
> support
> > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> > >
> >
> > I had a feeling it was going to be media.... 😅
> >
>
> Yea wondering the media UMD can be rewritten to use less xe_engines, it
> is massive rewrite for VM bind + no implicit dependencies so let's just
> pile on some more work?
>

It could probably use fewer than it does today.  It currently creates and
throws away contexts like crazy, or did last I looked at it.  However, the
nature of media encode is that it often spreads across two or three
different types of engines.  There's not much you can do to change that.


> >
> > > One transcode stream from my experience typically is 3-4 GPU contexts
> > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> > > CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> > > 60fps = 8640 jobs submitted and completed per second.
> > >
> > > 144 active contexts in the proposed scheme means possibly means 144
> > > kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> > > don't think the pools would scale down given all are constantly pinged
> > > at 60fps.)
> > >
> > > And then each of 144 threads goes to grab the single GuC CT mutex.
> First
> > > threads are being made schedulable, then put to sleep as mutex
> > > contention is hit, then woken again as mutexes are getting released,
> > > rinse, repeat.
> > >
> >
> > Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
> > back-end yet but I was under the impression that most run_job() would be
> > just shoving another packet into a ring buffer.  If we have to send the
> GuC
> > a message on the control ring every single time we submit a job, that's
> > pretty horrible.
> >
>
> Run job writes the ring buffer and moves the tail as the first step (no
> lock required). Next it needs to tell the GuC the xe_engine LRC tail has
> moved, this is done from a single Host to GuC channel which is circular
> buffer, the writing of the channel protected by the mutex. There are
> little more nuances too but in practice there is always space in the
> channel so the time mutex needs to held is really, really small
> (check cached credits, write 3 dwords in payload, write 1 dword to move
> tail). I also believe mutexes in Linux are hybrid where they spin for a
> little bit before sleeping and certainly if there is space in the
> channel we shouldn't sleep mutex contention.
>

Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we had
some way to batch things up a bit so we only have to poke the GuC channel
once for every batch of things rather than once per job.  That's maybe
something we can look into as a future improvement; not fundamental.

Generally, though, it sounds like contention could be a real problem if we
end up ping-ponging that lock between cores.  It's going to depend on how
much work it takes to get the next ready thing vs. the cost of that
atomic.  But, also, anything we do is going to potentially run into
contention problems.  *shrug*  If we were going to go for
one-per-HW-engine, we may as well go one-per-device and then we wouldn't
need the lock.  Off the top of my head, that doesn't sound great either but
IDK.


> As far as this being horrible, well didn't design the GuC and this how
> it is implemented for KMD based submission. We also have 256 doorbells
> so we wouldn't need a lock but I think are other issues with that design
> too which need to be worked out in the Xe2 / Xe3 timeframe.
>

Yeah, not blaming you.  Just surprised, that's all.  How does it work for
userspace submission?  What would it look like if the kernel emulated
userspace submission?  Is that even possible?

What are these doorbell things?  How do they play into it?


> Also if you see my follow up response Xe is ~33k execs per second with
> the current implementation on a 8 core (or maybe 8 thread) TGL which
> seems to fine to me.
>

33k exec/sec is about 500/frame which should be fine.  500 is a lot for a
single frame.  I typically tell game devs to shoot for dozens per frame.
The important thing is that it stays low even with hundreds of memory
objects bound.  (Xe should be just fine there.)

--Jason



> Matt
>
> > --Jason
> >
> >
> > (And yes this backend contention is there regardless of 1:1:1, it would
> > > require a different re-design to solve that. But it is just a question
> > > whether there are 144 contending threads, or just 6 with the thread per
> > > engine class scheme.)
> > >
> > > Then multiply all by 10 for a 4U server use case and you get 1440
> worker
> > > kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> > > Just so they can grab a timeslice and maybe content on a mutex as the
> > > next step.
> > >
> > > This example is where it would hurt on large systems. Imagine only an
> > > even wider media transcode card...
> > >
> > > Second example is only a single engine class used (3d desktop?) but
> with
> > > a bunch of not-runnable jobs queued and waiting on a fence to signal.
> > > Implicit or explicit dependencies doesn't matter. Then the fence
> signals
> > > and call backs run. N work items get scheduled, but they all submit to
> > > the same HW engine. So we end up with:
> > >
> > >          /-- wi1 --\
> > >         / ..     .. \
> > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> > >         \ ..    ..  /
> > >          \-- wiN --/
> > >
> > >
> > > All that we have achieved is waking up N CPUs to contend on the same
> > > lock and effectively insert the job into the same single HW queue. I
> > > don't see any positives there.
> > >
> > > This example I think can particularly hurt small / low power devices
> > > because of needless waking up of many cores for no benefit. Granted, I
> > > don't have a good feel on how common this pattern is in practice.
> > >
> > > >
> > > >     That
> > > >     is the number which drives the maximum number of not-runnable
> jobs
> > > that
> > > >     can become runnable at once, and hence spawn that many work
> items,
> > > and
> > > >     in turn unbound worker threads.
> > > >
> > > >     Several problems there.
> > > >
> > > >     It is fundamentally pointless to have potentially that many more
> > > >     threads
> > > >     than the number of CPU cores - it simply creates a scheduling
> storm.
> > > >
> > > >     Unbound workers have no CPU / cache locality either and no
> connection
> > > >     with the CPU scheduler to optimize scheduling patterns. This may
> > > matter
> > > >     either on large systems or on small ones. Whereas the current
> design
> > > >     allows for scheduler to notice userspace CPU thread keeps waking
> up
> > > the
> > > >     same drm scheduler kernel thread, and so it can keep them on the
> same
> > > >     CPU, the unbound workers lose that ability and so 2nd CPU might
> be
> > > >     getting woken up from low sleep for every submission.
> > > >
> > > >     Hence, apart from being a bit of a impedance mismatch, the
> proposal
> > > has
> > > >     the potential to change performance and power patterns and both
> large
> > > >     and small machines.
> > > >
> > > >
> > > > Ok, thanks for explaining the issue you're seeing in more detail.
> Yes,
> > > > deferred kwork does appear to mismatch somewhat with what the
> scheduler
> > > > needs or at least how it's worked in the past.  How much impact will
> > > > that mismatch have?  Unclear.
> > > >
> > > >      >      >>> Secondly, it probably demands separate workers (not
> > > >     optional),
> > > >      >     otherwise
> > > >      >      >>> behaviour of shared workqueues has either the
> potential
> > > to
> > > >      >     explode number
> > > >      >      >>> kernel threads anyway, or add latency.
> > > >      >      >>>
> > > >      >      >>
> > > >      >      >> Right now the system_unbound_wq is used which does
> have a
> > > >     limit
> > > >      >     on the
> > > >      >      >> number of threads, right? I do have a FIXME to allow a
> > > >     worker to be
> > > >      >      >> passed in similar to TDR.
> > > >      >      >>
> > > >      >      >> WRT to latency, the 1:1 ratio could actually have
> lower
> > > >     latency
> > > >      >     as 2 GPU
> > > >      >      >> schedulers can be pushing jobs into the backend /
> > > cleaning up
> > > >      >     jobs in
> > > >      >      >> parallel.
> > > >      >      >>
> > > >      >      >
> > > >      >      > Thought of one more point here where why in Xe we
> > > >     absolutely want
> > > >      >     a 1 to
> > > >      >      > 1 ratio between entity and scheduler - the way we
> implement
> > > >      >     timeslicing
> > > >      >      > for preempt fences.
> > > >      >      >
> > > >      >      > Let me try to explain.
> > > >      >      >
> > > >      >      > Preempt fences are implemented via the generic
> messaging
> > > >      >     interface [1]
> > > >      >      > with suspend / resume messages. If a suspend messages
> is
> > > >     received to
> > > >      >      > soon after calling resume (this is per entity) we
> simply
> > > >     sleep in the
> > > >      >      > suspend call thus giving the entity a timeslice. This
> > > >     completely
> > > >      >     falls
> > > >      >      > apart with a many to 1 relationship as now a entity
> > > >     waiting for a
> > > >      >      > timeslice blocks the other entities. Could we work
> aroudn
> > > >     this,
> > > >      >     sure but
> > > >      >      > just another bunch of code we'd have to add in Xe.
> Being to
> > > >      >     freely sleep
> > > >      >      > in backend without affecting other entities is really,
> > > really
> > > >      >     nice IMO
> > > >      >      > and I bet Xe isn't the only driver that is going to
> feel
> > > >     this way.
> > > >      >      >
> > > >      >      > Last thing I'll say regardless of how anyone feels
> about
> > > >     Xe using
> > > >      >     a 1 to
> > > >      >      > 1 relationship this patch IMO makes sense as I hope we
> can
> > > all
> > > >      >     agree a
> > > >      >      > workqueue scales better than kthreads.
> > > >      >
> > > >      >     I don't know for sure what will scale better and for what
> use
> > > >     case,
> > > >      >     combination of CPU cores vs number of GPU engines to keep
> > > >     busy vs other
> > > >      >     system activity. But I wager someone is bound to ask for
> some
> > > >      >     numbers to
> > > >      >     make sure proposal is not negatively affecting any other
> > > drivers.
> > > >      >
> > > >      >
> > > >      > Then let them ask.  Waving your hands vaguely in the
> direction of
> > > >     the
> > > >      > rest of DRM and saying "Uh, someone (not me) might object" is
> > > >     profoundly
> > > >      > unhelpful.  Sure, someone might.  That's why it's on
> dri-devel.
> > > >     If you
> > > >      > think there's someone in particular who might have a useful
> > > >     opinion on
> > > >      > this, throw them in the CC so they don't miss the e-mail
> thread.
> > > >      >
> > > >      > Or are you asking for numbers?  If so, what numbers are you
> > > >     asking for?
> > > >
> > > >     It was a heads up to the Xe team in case people weren't
> appreciating
> > > >     how
> > > >     the proposed change has the potential influence power and
> performance
> > > >     across the board. And nothing in the follow up discussion made me
> > > think
> > > >     it was considered so I don't think it was redundant to raise it.
> > > >
> > > >     In my experience it is typical that such core changes come with
> some
> > > >     numbers. Which is in case of drm scheduler is tricky and probably
> > > >     requires explicitly asking everyone to test (rather than count on
> > > >     "don't
> > > >     miss the email thread"). Real products can fail to ship due ten
> mW
> > > here
> > > >     or there. Like suddenly an extra core prevented from getting into
> > > deep
> > > >     sleep.
> > > >
> > > >     If that was "profoundly unhelpful" so be it.
> > > >
> > > >
> > > > With your above explanation, it makes more sense what you're asking.
> > > > It's still not something Matt is likely to be able to provide on his
> > > > own.  We need to tag some other folks and ask them to test it out.
> We
> > > > could play around a bit with it on Xe but it's not exactly production
> > > > grade yet and is going to hit this differently from most.  Likely
> > > > candidates are probably AMD and Freedreno.
> > >
> > > Whoever is setup to check out power and performance would be good to
> > > give it a spin, yes.
> > >
> > > PS. I don't think I was asking Matt to test with other devices. To
> start
> > > with I think Xe is a team effort. I was asking for more background on
> > > the design decision since patch 4/20 does not say anything on that
> > > angle, nor later in the thread it was IMO sufficiently addressed.
> > >
> > > >      > Also, If we're talking about a design that might paint us
> into an
> > > >      > Intel-HW-specific hole, that would be one thing.  But we're
> not.
> > > >     We're
> > > >      > talking about switching which kernel threading/task mechanism
> to
> > > >     use for
> > > >      > what's really a very generic problem.  The core Xe design
> works
> > > >     without
> > > >      > this patch (just with more kthreads).  If we land this patch
> or
> > > >      > something like it and get it wrong and it causes a performance
> > > >     problem
> > > >      > for someone down the line, we can revisit it.
> > > >
> > > >     For some definition of "it works" - I really wouldn't suggest
> > > >     shipping a
> > > >     kthread per user context at any point.
> > > >
> > > >
> > > > You have yet to elaborate on why. What resources is it consuming
> that's
> > > > going to be a problem? Are you anticipating CPU affinity problems? Or
> > > > does it just seem wasteful?
> > >
> > > Well I don't know, commit message says the approach does not scale. :)
> > >
> > > > I think I largely agree that it's probably unnecessary/wasteful but
> > > > reducing the number of kthreads seems like a tractable problem to
> solve
> > > > regardless of where we put the gpu_scheduler object.  Is this the
> right
> > > > solution?  Maybe not.  It was also proposed at one point that we
> could
> > > > split the scheduler into two pieces: A scheduler which owns the
> kthread,
> > > > and a back-end which targets some HW ring thing where you can have
> > > > multiple back-ends per scheduler.  That's certainly more invasive
> from a
> > > > DRM scheduler internal API PoV but would solve the kthread problem
> in a
> > > > way that's more similar to what we have now.
> > > >
> > > >      >     In any case that's a low level question caused by the high
> > > >     level design
> > > >      >     decision. So I'd think first focus on the high level -
> which
> > > >     is the 1:1
> > > >      >     mapping of entity to scheduler instance proposal.
> > > >      >
> > > >      >     Fundamentally it will be up to the DRM maintainers and the
> > > >     community to
> > > >      >     bless your approach. And it is important to stress 1:1 is
> > > about
> > > >      >     userspace contexts, so I believe unlike any other current
> > > >     scheduler
> > > >      >     user. And also important to stress this effectively does
> not
> > > >     make Xe
> > > >      >     _really_ use the scheduler that much.
> > > >      >
> > > >      >
> > > >      > I don't think this makes Xe nearly as much of a one-off as you
> > > >     think it
> > > >      > does.  I've already told the Asahi team working on Apple M1/2
> > > >     hardware
> > > >      > to do it this way and it seems to be a pretty good mapping for
> > > >     them. I
> > > >      > believe this is roughly the plan for nouveau as well.  It's
> not
> > > >     the way
> > > >      > it currently works for anyone because most other groups aren't
> > > >     doing FW
> > > >      > scheduling yet.  In the world of FW scheduling and hardware
> > > >     designed to
> > > >      > support userspace direct-to-FW submit, I think the design
> makes
> > > >     perfect
> > > >      > sense (see below) and I expect we'll see more drivers move in
> this
> > > >      > direction as those drivers evolve.  (AMD is doing some
> customish
> > > >     thing
> > > >      > for how with gpu_scheduler on the front-end somehow. I've not
> dug
> > > >     into
> > > >      > those details.)
> > > >      >
> > > >      >     I can only offer my opinion, which is that the two options
> > > >     mentioned in
> > > >      >     this thread (either improve drm scheduler to cope with
> what is
> > > >      >     required,
> > > >      >     or split up the code so you can use just the parts of
> > > >     drm_sched which
> > > >      >     you want - which is frontend dependency tracking)
> shouldn't
> > > be so
> > > >      >     readily dismissed, given how I think the idea was for the
> new
> > > >     driver to
> > > >      >     work less in a silo and more in the community (not do
> kludges
> > > to
> > > >      >     workaround stuff because it is thought to be too hard to
> > > >     improve common
> > > >      >     code), but fundamentally, "goto previous paragraph" for
> what
> > > I am
> > > >      >     concerned.
> > > >      >
> > > >      >
> > > >      > Meta comment:  It appears as if you're falling into the
> standard
> > > >     i915
> > > >      > team trap of having an internal discussion about what the
> > > community
> > > >      > discussion might look like instead of actually having the
> > > community
> > > >      > discussion.  If you are seriously concerned about interactions
> > > with
> > > >      > other drivers or whether or setting common direction, the
> right
> > > >     way to
> > > >      > do that is to break a patch or two out into a separate RFC
> series
> > > >     and
> > > >      > tag a handful of driver maintainers.  Trying to predict the
> > > >     questions
> > > >      > other people might ask is pointless. Cc them and asking for
> their
> > > >     input
> > > >      > instead.
> > > >
> > > >     I don't follow you here. It's not an internal discussion - I am
> > > raising
> > > >     my concerns on the design publicly. I am supposed to write a
> patch to
> > > >     show something, but am allowed to comment on a RFC series?
> > > >
> > > >
> > > > I may have misread your tone a bit.  It felt a bit like too many
> > > > discussions I've had in the past where people are trying to predict
> what
> > > > others will say instead of just asking them.  Reading it again, I was
> > > > probably jumping to conclusions a bit.  Sorry about that.
> > >
> > > Okay no problem, thanks. In any case we don't have to keep discussing
> > > it, since I wrote one or two emails ago it is fundamentally on the
> > > maintainers and community to ack the approach. I only felt like RFC did
> > > not explain the potential downsides sufficiently so I wanted to probe
> > > that area a bit.
> > >
> > > >     It is "drm/sched: Convert drm scheduler to use a work queue
> rather
> > > than
> > > >     kthread" which should have Cc-ed _everyone_ who use drm
> scheduler.
> > > >
> > > >
> > > > Yeah, it probably should have.  I think that's mostly what I've been
> > > > trying to say.
> > > >
> > > >      >
> > > >      >     Regards,
> > > >      >
> > > >      >     Tvrtko
> > > >      >
> > > >      >     P.S. And as a related side note, there are more areas
> where
> > > >     drm_sched
> > > >      >     could be improved, like for instance priority handling.
> > > >      >     Take a look at msm_submitqueue_create /
> > > >     msm_gpu_convert_priority /
> > > >      >     get_sched_entity to see how msm works around the drm_sched
> > > >     hardcoded
> > > >      >     limit of available priority levels, in order to avoid
> having
> > > >     to leave a
> > > >      >     hw capability unused. I suspect msm would be happier if
> they
> > > >     could have
> > > >      >     all priority levels equal in terms of whether they apply
> only
> > > >     at the
> > > >      >     frontend level or completely throughout the pipeline.
> > > >      >
> > > >      >      > [1]
> > > >      >
> > > >
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > > >     <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > > >
> > > >      >
> > > >       <
> > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> > > >      >      >
> > > >      >      >>> What would be interesting to learn is whether the
> option
> > > of
> > > >      >     refactoring
> > > >      >      >>> drm_sched to deal with out of order completion was
> > > >     considered
> > > >      >     and what were
> > > >      >      >>> the conclusions.
> > > >      >      >>>
> > > >      >      >>
> > > >      >      >> I coded this up a while back when trying to convert
> the
> > > >     i915 to
> > > >      >     the DRM
> > > >      >      >> scheduler it isn't all that hard either. The free flow
> > > >     control
> > > >      >     on the
> > > >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB
> SIZE)
> > > is
> > > >      >     really what
> > > >      >      >> sold me on the this design.
> > > >      >
> > > >      >
> > > >      > You're not the only one to suggest supporting out-of-order
> > > >     completion.
> > > >      > However, it's tricky and breaks a lot of internal assumptions
> of
> > > the
> > > >      > scheduler. It also reduces functionality a bit because it can
> no
> > > >     longer
> > > >      > automatically rate-limit HW/FW queues which are often
> > > >     fixed-size.  (Ok,
> > > >      > yes, it probably could but it becomes a substantially harder
> > > >     problem.)
> > > >      >
> > > >      > It also seems like a worse mapping to me.  The goal here is to
> > > turn
> > > >      > submissions on a userspace-facing engine/queue into
> submissions
> > > >     to a FW
> > > >      > queue submissions, sorting out any dma_fence dependencies.
> Matt's
> > > >      > description of saying this is a 1:1 mapping between
> sched/entity
> > > >     doesn't
> > > >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > > >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> > > >     mapping?
> > > >      > Why is that better?
> > > >
> > > >     As I have stated before, what I think what would fit well for Xe
> is
> > > one
> > > >     drm_scheduler per engine class. In specific terms on our current
> > > >     hardware, one drm scheduler instance for render, compute,
> blitter,
> > > >     video
> > > >     and video enhance. Userspace contexts remain scheduler entities.
> > > >
> > > >
> > > > And this is where we fairly strongly disagree.  More in a bit.
> > > >
> > > >     That way you avoid the whole kthread/kworker story and you have
> it
> > > >     actually use the entity picking code in the scheduler, which may
> be
> > > >     useful when the backend is congested.
> > > >
> > > >
> > > > What back-end congestion are you referring to here?  Running out of
> FW
> > > > queue IDs?  Something else?
> > >
> > > CT channel, number of context ids.
> > >
> > > >
> > > >     Yes you have to solve the out of order problem so in my mind
> that is
> > > >     something to discuss. What the problem actually is (just TDR?),
> how
> > > >     tricky and why etc.
> > > >
> > > >     And yes you lose the handy LRCA ring buffer size management so
> you'd
> > > >     have to make those entities not runnable in some other way.
> > > >
> > > >     Regarding the argument you raise below - would any of that make
> the
> > > >     frontend / backend separation worse and why? Do you think it is
> less
> > > >     natural? If neither is true then all remains is that it appears
> extra
> > > >     work to support out of order completion of entities has been
> > > discounted
> > > >     in favour of an easy but IMO inelegant option.
> > > >
> > > >
> > > > Broadly speaking, the kernel needs to stop thinking about GPU
> scheduling
> > > > in terms of scheduling jobs and start thinking in terms of scheduling
> > > > contexts/engines.  There is still some need for scheduling individual
> > > > jobs but that is only for the purpose of delaying them as needed to
> > > > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > > > get shoved onto the context/engine queue and from there the kernel
> only
> > > > really manages whole contexts/engines.  This is a major architectural
> > > > shift, entirely different from the way i915 scheduling works.  It's
> also
> > > > different from the historical usage of DRM scheduler which I think is
> > > > why this all looks a bit funny.
> > > >
> > > > To justify this architectural shift, let's look at where we're
> headed.
> > > > In the glorious future...
> > > >
> > > >   1. Userspace submits directly to firmware queues.  The kernel has
> no
> > > > visibility whatsoever into individual jobs.  At most it can
> pause/resume
> > > > FW contexts as needed to handle eviction and memory management.
> > > >
> > > >   2. Because of 1, apart from handing out the FW queue IDs at the
> > > > beginning, the kernel can't really juggle them that much.  Depending
> on
> > > > FW design, it may be able to pause a client, give its IDs to another,
> > > > and then resume it later when IDs free up.  What it's not doing is
> > > > juggling IDs on a job-by-job basis like i915 currently is.
> > > >
> > > >   3. Long-running compute jobs may not complete for days.  This means
> > > > that memory management needs to happen in terms of pause/resume of
> > > > entire contexts/engines using the memory rather than based on waiting
> > > > for individual jobs to complete or pausing individual jobs until the
> > > > memory is available.
> > > >
> > > >   4. Synchronization happens via userspace memory fences (UMF) and
> the
> > > > kernel is mostly unaware of most dependencies and when a
> context/engine
> > > > is or is not runnable.  Instead, it keeps as many of them minimally
> > > > active (memory is available, even if it's in system RAM) as possible
> and
> > > > lets the FW sort out dependencies.  (There may need to be some
> facility
> > > > for sleeping a context until a memory change similar to futex() or
> > > > poll() for userspace threads.  There are some details TBD.)
> > > >
> > > > Are there potential problems that will need to be solved here?
> Yes.  Is
> > > > it a good design?  Well, Microsoft has been living in this future for
> > > > half a decade or better and it's working quite well for them.  It's
> also
> > > > the way all modern game consoles work.  It really is just Linux
> that's
> > > > stuck with the same old job model we've had since the monumental
> shift
> > > > to DRI2.
> > > >
> > > > To that end, one of the core goals of the Xe project was to make the
> > > > driver internally behave as close to the above model as possible
> while
> > > > keeping the old-school job model as a very thin layer on top.  As the
> > > > broader ecosystem problems (window-system support for UMF, for
> instance)
> > > > are solved, that layer can be peeled back.  The core driver will
> already
> > > > be ready for it.
> > > >
> > > > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > > > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > > > and stuff jobs into their respective context/engine queue once
> they're
> > > > ready.  All the actual scheduling happens in firmware and any
> scheduling
> > > > the kernel does to deal with contention, oversubscriptions, too many
> > > > contexts, etc. is between contexts/engines, not individual jobs.
> Sure,
> > > > the individual job visibility is nice, but if we design around it,
> we'll
> > > > never get to the glorious future.
> > > >
> > > > I really need to turn the above (with a bit more detail) into a blog
> > > > post.... Maybe I'll do that this week.
> > > >
> > > > In any case, I hope that provides more insight into why Xe is
> designed
> > > > the way it is and why I'm pushing back so hard on trying to make it
> more
> > > > of a "classic" driver as far as scheduling is concerned.  Are there
> > > > potential problems here?  Yes, that's why Xe has been labeled a
> > > > prototype.  Are such radical changes necessary to get to said
> glorious
> > > > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> > >
> > > Right, that's all solid I think. My takeaway is that frontend priority
> > > sorting and that stuff isn't needed and that is okay. And that there
> are
> > > multiple options to maybe improve drm scheduler, like the fore
> mentioned
> > > making it deal with out of order, or split into functional components,
> > > or split frontend/backend what you suggested. For most of them cost vs
> > > benefit is more or less not completely clear, neither how much effort
> > > was invested to look into them.
> > >
> > > One thing I missed from this explanation is how drm_scheduler per
> engine
> > > class interferes with the high level concepts. And I did not manage to
> > > pick up on what exactly is the TDR problem in that case. Maybe the two
> > > are one and the same.
> > >
> > > Bottom line is I still have the concern that conversion to kworkers has
> > > an opportunity to regress. Possibly more opportunity for some Xe use
> > > cases than to affect other vendors, since they would still be using per
> > > physical engine / queue scheduler instances.
> > >
> > > And to put my money where my mouth is I will try to put testing Xe
> > > inside the full blown ChromeOS environment in my team plans. It would
> > > probably also be beneficial if Xe team could take a look at real world
> > > behaviour of the extreme transcode use cases too. If the stack is ready
> > > for that and all. It would be better to know earlier rather than later
> > > if there is a fundamental issue.
> > >
> > > For the patch at hand, and the cover letter, it certainly feels it
> would
> > > benefit to record the past design discussion had with AMD folks, to
> > > explicitly copy other drivers, and to record the theoretical pros and
> > > cons of threads vs unbound workers as I have tried to highlight them.
> > >
> > > Regards,
> > >
> > > Tvrtko
> > >
>

[-- Attachment #2: Type: text/html, Size: 48571 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-11 22:56                               ` Jason Ekstrand
  0 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-11 22:56 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 36928 bytes --]

On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.brost@intel.com>
wrote:

> On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> > tvrtko.ursulin@linux.intel.com> wrote:
> >
> > >
> > > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > > <tvrtko.ursulin@linux.intel.com <mailto:
> tvrtko.ursulin@linux.intel.com>>
> > >
> > > > wrote:
> > > >
> > > >
> > > >
> > > >     On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > >
> > > >     [snip]
> > > >
> > > >      >      >>> AFAICT it proposes to have 1:1 between *userspace*
> > > created
> > > >      >     contexts (per
> > > >      >      >>> context _and_ engine) and drm_sched. I am not sure
> > > avoiding
> > > >      >     invasive changes
> > > >      >      >>> to the shared code is in the spirit of the overall
> idea
> > > >     and instead
> > > >      >      >>> opportunity should be used to look at way to
> > > >     refactor/improve
> > > >      >     drm_sched.
> > > >      >
> > > >      >
> > > >      > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > > >     all or
> > > >      > really needs to drive a re-factor.  (More on that later.)
> > > >     There's only
> > > >      > one real issue which is that it fires off potentially a lot of
> > > >     kthreads.
> > > >      > Even that's not that bad given that kthreads are pretty light
> and
> > > >     you're
> > > >      > not likely to have more kthreads than userspace threads which
> are
> > > >     much
> > > >      > heavier.  Not ideal, but not the end of the world either.
> > > >     Definitely
> > > >      > something we can/should optimize but if we went through with
> Xe
> > > >     without
> > > >      > this patch, it would probably be mostly ok.
> > > >      >
> > > >      >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > >      >      >>
> > > >      >      >> I'm not really prepared to make large changes to DRM
> > > >     scheduler
> > > >      >     at the
> > > >      >      >> moment for Xe as they are not really required nor does
> > > Boris
> > > >      >     seem they
> > > >      >      >> will be required for his work either. I am interested
> to
> > > see
> > > >      >     what Boris
> > > >      >      >> comes up with.
> > > >      >      >>
> > > >      >      >>> Even on the low level, the idea to replace drm_sched
> > > threads
> > > >      >     with workers
> > > >      >      >>> has a few problems.
> > > >      >      >>>
> > > >      >      >>> To start with, the pattern of:
> > > >      >      >>>
> > > >      >      >>>    while (not_stopped) {
> > > >      >      >>>     keep picking jobs
> > > >      >      >>>    }
> > > >      >      >>>
> > > >      >      >>> Feels fundamentally in disagreement with workers
> (while
> > > >      >     obviously fits
> > > >      >      >>> perfectly with the current kthread design).
> > > >      >      >>
> > > >      >      >> The while loop breaks and worker exists if no jobs are
> > > ready.
> > > >      >
> > > >      >
> > > >      > I'm not very familiar with workqueues. What are you saying
> would
> > > fit
> > > >      > better? One scheduling job per work item rather than one big
> work
> > > >     item
> > > >      > which handles all available jobs?
> > > >
> > > >     Yes and no, it indeed IMO does not fit to have a work item which
> is
> > > >     potentially unbound in runtime. But it is a bit moot conceptual
> > > >     mismatch
> > > >     because it is a worst case / theoretical, and I think due more
> > > >     fundamental concerns.
> > > >
> > > >     If we have to go back to the low level side of things, I've
> picked
> > > this
> > > >     random spot to consolidate what I have already mentioned and
> perhaps
> > > >     expand.
> > > >
> > > >     To start with, let me pull out some thoughts from workqueue.rst:
> > > >
> > > >     """
> > > >     Generally, work items are not expected to hog a CPU and consume
> many
> > > >     cycles. That means maintaining just enough concurrency to prevent
> > > work
> > > >     processing from stalling should be optimal.
> > > >     """
> > > >
> > > >     For unbound queues:
> > > >     """
> > > >     The responsibility of regulating concurrency level is on the
> users.
> > > >     """
> > > >
> > > >     Given the unbound queues will be spawned on demand to service all
> > > >     queued
> > > >     work items (more interesting when mixing up with the
> > > >     system_unbound_wq),
> > > >     in the proposed design the number of instantiated worker threads
> does
> > > >     not correspond to the number of user threads (as you have
> elsewhere
> > > >     stated), but pessimistically to the number of active user
> contexts.
> > > >
> > > >
> > > > Those are pretty much the same in practice.  Rather, user threads is
> > > > typically an upper bound on the number of contexts.  Yes, a single
> user
> > > > thread could have a bunch of contexts but basically nothing does that
> > > > except IGT.  In real-world usage, it's at most one context per user
> > > thread.
> > >
> > > Typically is the key here. But I am not sure it is good enough.
> Consider
> > > this example - Intel Flex 170:
> > >
> > >   * Delivers up to 36 streams 1080p60 transcode throughput per card.
> > >   * When scaled to 10 cards in a 4U server configuration, it can
> support
> > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> > >
> >
> > I had a feeling it was going to be media.... 😅
> >
>
> Yea wondering the media UMD can be rewritten to use less xe_engines, it
> is massive rewrite for VM bind + no implicit dependencies so let's just
> pile on some more work?
>

It could probably use fewer than it does today.  It currently creates and
throws away contexts like crazy, or did last I looked at it.  However, the
nature of media encode is that it often spreads across two or three
different types of engines.  There's not much you can do to change that.


> >
> > > One transcode stream from my experience typically is 3-4 GPU contexts
> > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> > > CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by
> > > 60fps = 8640 jobs submitted and completed per second.
> > >
> > > 144 active contexts in the proposed scheme means possibly means 144
> > > kernel worker threads spawned (driven by 36 transcode CPU threads). (I
> > > don't think the pools would scale down given all are constantly pinged
> > > at 60fps.)
> > >
> > > And then each of 144 threads goes to grab the single GuC CT mutex.
> First
> > > threads are being made schedulable, then put to sleep as mutex
> > > contention is hit, then woken again as mutexes are getting released,
> > > rinse, repeat.
> > >
> >
> > Why is every submission grabbing the GuC CT mutex?  I've not read the GuC
> > back-end yet but I was under the impression that most run_job() would be
> > just shoving another packet into a ring buffer.  If we have to send the
> GuC
> > a message on the control ring every single time we submit a job, that's
> > pretty horrible.
> >
>
> Run job writes the ring buffer and moves the tail as the first step (no
> lock required). Next it needs to tell the GuC the xe_engine LRC tail has
> moved, this is done from a single Host to GuC channel which is circular
> buffer, the writing of the channel protected by the mutex. There are
> little more nuances too but in practice there is always space in the
> channel so the time mutex needs to held is really, really small
> (check cached credits, write 3 dwords in payload, write 1 dword to move
> tail). I also believe mutexes in Linux are hybrid where they spin for a
> little bit before sleeping and certainly if there is space in the
> channel we shouldn't sleep mutex contention.
>

Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we had
some way to batch things up a bit so we only have to poke the GuC channel
once for every batch of things rather than once per job.  That's maybe
something we can look into as a future improvement; not fundamental.

Generally, though, it sounds like contention could be a real problem if we
end up ping-ponging that lock between cores.  It's going to depend on how
much work it takes to get the next ready thing vs. the cost of that
atomic.  But, also, anything we do is going to potentially run into
contention problems.  *shrug*  If we were going to go for
one-per-HW-engine, we may as well go one-per-device and then we wouldn't
need the lock.  Off the top of my head, that doesn't sound great either but
IDK.


> As far as this being horrible, well didn't design the GuC and this how
> it is implemented for KMD based submission. We also have 256 doorbells
> so we wouldn't need a lock but I think are other issues with that design
> too which need to be worked out in the Xe2 / Xe3 timeframe.
>

Yeah, not blaming you.  Just surprised, that's all.  How does it work for
userspace submission?  What would it look like if the kernel emulated
userspace submission?  Is that even possible?

What are these doorbell things?  How do they play into it?


> Also if you see my follow up response Xe is ~33k execs per second with
> the current implementation on a 8 core (or maybe 8 thread) TGL which
> seems to fine to me.
>

33k exec/sec is about 500/frame which should be fine.  500 is a lot for a
single frame.  I typically tell game devs to shoot for dozens per frame.
The important thing is that it stays low even with hundreds of memory
objects bound.  (Xe should be just fine there.)

--Jason



> Matt
>
> > --Jason
> >
> >
> > (And yes this backend contention is there regardless of 1:1:1, it would
> > > require a different re-design to solve that. But it is just a question
> > > whether there are 144 contending threads, or just 6 with the thread per
> > > engine class scheme.)
> > >
> > > Then multiply all by 10 for a 4U server use case and you get 1440
> worker
> > > kthreads, yes 10 more CT locks, but contending on how many CPU cores?
> > > Just so they can grab a timeslice and maybe content on a mutex as the
> > > next step.
> > >
> > > This example is where it would hurt on large systems. Imagine only an
> > > even wider media transcode card...
> > >
> > > Second example is only a single engine class used (3d desktop?) but
> with
> > > a bunch of not-runnable jobs queued and waiting on a fence to signal.
> > > Implicit or explicit dependencies doesn't matter. Then the fence
> signals
> > > and call backs run. N work items get scheduled, but they all submit to
> > > the same HW engine. So we end up with:
> > >
> > >          /-- wi1 --\
> > >         / ..     .. \
> > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> > >         \ ..    ..  /
> > >          \-- wiN --/
> > >
> > >
> > > All that we have achieved is waking up N CPUs to contend on the same
> > > lock and effectively insert the job into the same single HW queue. I
> > > don't see any positives there.
> > >
> > > This example I think can particularly hurt small / low power devices
> > > because of needless waking up of many cores for no benefit. Granted, I
> > > don't have a good feel on how common this pattern is in practice.
> > >
> > > >
> > > >     That
> > > >     is the number which drives the maximum number of not-runnable
> jobs
> > > that
> > > >     can become runnable at once, and hence spawn that many work
> items,
> > > and
> > > >     in turn unbound worker threads.
> > > >
> > > >     Several problems there.
> > > >
> > > >     It is fundamentally pointless to have potentially that many more
> > > >     threads
> > > >     than the number of CPU cores - it simply creates a scheduling
> storm.
> > > >
> > > >     Unbound workers have no CPU / cache locality either and no
> connection
> > > >     with the CPU scheduler to optimize scheduling patterns. This may
> > > matter
> > > >     either on large systems or on small ones. Whereas the current
> design
> > > >     allows for scheduler to notice userspace CPU thread keeps waking
> up
> > > the
> > > >     same drm scheduler kernel thread, and so it can keep them on the
> same
> > > >     CPU, the unbound workers lose that ability and so 2nd CPU might
> be
> > > >     getting woken up from low sleep for every submission.
> > > >
> > > >     Hence, apart from being a bit of a impedance mismatch, the
> proposal
> > > has
> > > >     the potential to change performance and power patterns and both
> large
> > > >     and small machines.
> > > >
> > > >
> > > > Ok, thanks for explaining the issue you're seeing in more detail.
> Yes,
> > > > deferred kwork does appear to mismatch somewhat with what the
> scheduler
> > > > needs or at least how it's worked in the past.  How much impact will
> > > > that mismatch have?  Unclear.
> > > >
> > > >      >      >>> Secondly, it probably demands separate workers (not
> > > >     optional),
> > > >      >     otherwise
> > > >      >      >>> behaviour of shared workqueues has either the
> potential
> > > to
> > > >      >     explode number
> > > >      >      >>> kernel threads anyway, or add latency.
> > > >      >      >>>
> > > >      >      >>
> > > >      >      >> Right now the system_unbound_wq is used which does
> have a
> > > >     limit
> > > >      >     on the
> > > >      >      >> number of threads, right? I do have a FIXME to allow a
> > > >     worker to be
> > > >      >      >> passed in similar to TDR.
> > > >      >      >>
> > > >      >      >> WRT to latency, the 1:1 ratio could actually have
> lower
> > > >     latency
> > > >      >     as 2 GPU
> > > >      >      >> schedulers can be pushing jobs into the backend /
> > > cleaning up
> > > >      >     jobs in
> > > >      >      >> parallel.
> > > >      >      >>
> > > >      >      >
> > > >      >      > Thought of one more point here where why in Xe we
> > > >     absolutely want
> > > >      >     a 1 to
> > > >      >      > 1 ratio between entity and scheduler - the way we
> implement
> > > >      >     timeslicing
> > > >      >      > for preempt fences.
> > > >      >      >
> > > >      >      > Let me try to explain.
> > > >      >      >
> > > >      >      > Preempt fences are implemented via the generic
> messaging
> > > >      >     interface [1]
> > > >      >      > with suspend / resume messages. If a suspend messages
> is
> > > >     received to
> > > >      >      > soon after calling resume (this is per entity) we
> simply
> > > >     sleep in the
> > > >      >      > suspend call thus giving the entity a timeslice. This
> > > >     completely
> > > >      >     falls
> > > >      >      > apart with a many to 1 relationship as now a entity
> > > >     waiting for a
> > > >      >      > timeslice blocks the other entities. Could we work
> aroudn
> > > >     this,
> > > >      >     sure but
> > > >      >      > just another bunch of code we'd have to add in Xe.
> Being to
> > > >      >     freely sleep
> > > >      >      > in backend without affecting other entities is really,
> > > really
> > > >      >     nice IMO
> > > >      >      > and I bet Xe isn't the only driver that is going to
> feel
> > > >     this way.
> > > >      >      >
> > > >      >      > Last thing I'll say regardless of how anyone feels
> about
> > > >     Xe using
> > > >      >     a 1 to
> > > >      >      > 1 relationship this patch IMO makes sense as I hope we
> can
> > > all
> > > >      >     agree a
> > > >      >      > workqueue scales better than kthreads.
> > > >      >
> > > >      >     I don't know for sure what will scale better and for what
> use
> > > >     case,
> > > >      >     combination of CPU cores vs number of GPU engines to keep
> > > >     busy vs other
> > > >      >     system activity. But I wager someone is bound to ask for
> some
> > > >      >     numbers to
> > > >      >     make sure proposal is not negatively affecting any other
> > > drivers.
> > > >      >
> > > >      >
> > > >      > Then let them ask.  Waving your hands vaguely in the
> direction of
> > > >     the
> > > >      > rest of DRM and saying "Uh, someone (not me) might object" is
> > > >     profoundly
> > > >      > unhelpful.  Sure, someone might.  That's why it's on
> dri-devel.
> > > >     If you
> > > >      > think there's someone in particular who might have a useful
> > > >     opinion on
> > > >      > this, throw them in the CC so they don't miss the e-mail
> thread.
> > > >      >
> > > >      > Or are you asking for numbers?  If so, what numbers are you
> > > >     asking for?
> > > >
> > > >     It was a heads up to the Xe team in case people weren't
> appreciating
> > > >     how
> > > >     the proposed change has the potential influence power and
> performance
> > > >     across the board. And nothing in the follow up discussion made me
> > > think
> > > >     it was considered so I don't think it was redundant to raise it.
> > > >
> > > >     In my experience it is typical that such core changes come with
> some
> > > >     numbers. Which is in case of drm scheduler is tricky and probably
> > > >     requires explicitly asking everyone to test (rather than count on
> > > >     "don't
> > > >     miss the email thread"). Real products can fail to ship due ten
> mW
> > > here
> > > >     or there. Like suddenly an extra core prevented from getting into
> > > deep
> > > >     sleep.
> > > >
> > > >     If that was "profoundly unhelpful" so be it.
> > > >
> > > >
> > > > With your above explanation, it makes more sense what you're asking.
> > > > It's still not something Matt is likely to be able to provide on his
> > > > own.  We need to tag some other folks and ask them to test it out.
> We
> > > > could play around a bit with it on Xe but it's not exactly production
> > > > grade yet and is going to hit this differently from most.  Likely
> > > > candidates are probably AMD and Freedreno.
> > >
> > > Whoever is setup to check out power and performance would be good to
> > > give it a spin, yes.
> > >
> > > PS. I don't think I was asking Matt to test with other devices. To
> start
> > > with I think Xe is a team effort. I was asking for more background on
> > > the design decision since patch 4/20 does not say anything on that
> > > angle, nor later in the thread it was IMO sufficiently addressed.
> > >
> > > >      > Also, If we're talking about a design that might paint us
> into an
> > > >      > Intel-HW-specific hole, that would be one thing.  But we're
> not.
> > > >     We're
> > > >      > talking about switching which kernel threading/task mechanism
> to
> > > >     use for
> > > >      > what's really a very generic problem.  The core Xe design
> works
> > > >     without
> > > >      > this patch (just with more kthreads).  If we land this patch
> or
> > > >      > something like it and get it wrong and it causes a performance
> > > >     problem
> > > >      > for someone down the line, we can revisit it.
> > > >
> > > >     For some definition of "it works" - I really wouldn't suggest
> > > >     shipping a
> > > >     kthread per user context at any point.
> > > >
> > > >
> > > > You have yet to elaborate on why. What resources is it consuming
> that's
> > > > going to be a problem? Are you anticipating CPU affinity problems? Or
> > > > does it just seem wasteful?
> > >
> > > Well I don't know, commit message says the approach does not scale. :)
> > >
> > > > I think I largely agree that it's probably unnecessary/wasteful but
> > > > reducing the number of kthreads seems like a tractable problem to
> solve
> > > > regardless of where we put the gpu_scheduler object.  Is this the
> right
> > > > solution?  Maybe not.  It was also proposed at one point that we
> could
> > > > split the scheduler into two pieces: A scheduler which owns the
> kthread,
> > > > and a back-end which targets some HW ring thing where you can have
> > > > multiple back-ends per scheduler.  That's certainly more invasive
> from a
> > > > DRM scheduler internal API PoV but would solve the kthread problem
> in a
> > > > way that's more similar to what we have now.
> > > >
> > > >      >     In any case that's a low level question caused by the high
> > > >     level design
> > > >      >     decision. So I'd think first focus on the high level -
> which
> > > >     is the 1:1
> > > >      >     mapping of entity to scheduler instance proposal.
> > > >      >
> > > >      >     Fundamentally it will be up to the DRM maintainers and the
> > > >     community to
> > > >      >     bless your approach. And it is important to stress 1:1 is
> > > about
> > > >      >     userspace contexts, so I believe unlike any other current
> > > >     scheduler
> > > >      >     user. And also important to stress this effectively does
> not
> > > >     make Xe
> > > >      >     _really_ use the scheduler that much.
> > > >      >
> > > >      >
> > > >      > I don't think this makes Xe nearly as much of a one-off as you
> > > >     think it
> > > >      > does.  I've already told the Asahi team working on Apple M1/2
> > > >     hardware
> > > >      > to do it this way and it seems to be a pretty good mapping for
> > > >     them. I
> > > >      > believe this is roughly the plan for nouveau as well.  It's
> not
> > > >     the way
> > > >      > it currently works for anyone because most other groups aren't
> > > >     doing FW
> > > >      > scheduling yet.  In the world of FW scheduling and hardware
> > > >     designed to
> > > >      > support userspace direct-to-FW submit, I think the design
> makes
> > > >     perfect
> > > >      > sense (see below) and I expect we'll see more drivers move in
> this
> > > >      > direction as those drivers evolve.  (AMD is doing some
> customish
> > > >     thing
> > > >      > for how with gpu_scheduler on the front-end somehow. I've not
> dug
> > > >     into
> > > >      > those details.)
> > > >      >
> > > >      >     I can only offer my opinion, which is that the two options
> > > >     mentioned in
> > > >      >     this thread (either improve drm scheduler to cope with
> what is
> > > >      >     required,
> > > >      >     or split up the code so you can use just the parts of
> > > >     drm_sched which
> > > >      >     you want - which is frontend dependency tracking)
> shouldn't
> > > be so
> > > >      >     readily dismissed, given how I think the idea was for the
> new
> > > >     driver to
> > > >      >     work less in a silo and more in the community (not do
> kludges
> > > to
> > > >      >     workaround stuff because it is thought to be too hard to
> > > >     improve common
> > > >      >     code), but fundamentally, "goto previous paragraph" for
> what
> > > I am
> > > >      >     concerned.
> > > >      >
> > > >      >
> > > >      > Meta comment:  It appears as if you're falling into the
> standard
> > > >     i915
> > > >      > team trap of having an internal discussion about what the
> > > community
> > > >      > discussion might look like instead of actually having the
> > > community
> > > >      > discussion.  If you are seriously concerned about interactions
> > > with
> > > >      > other drivers or whether or setting common direction, the
> right
> > > >     way to
> > > >      > do that is to break a patch or two out into a separate RFC
> series
> > > >     and
> > > >      > tag a handful of driver maintainers.  Trying to predict the
> > > >     questions
> > > >      > other people might ask is pointless. Cc them and asking for
> their
> > > >     input
> > > >      > instead.
> > > >
> > > >     I don't follow you here. It's not an internal discussion - I am
> > > raising
> > > >     my concerns on the design publicly. I am supposed to write a
> patch to
> > > >     show something, but am allowed to comment on a RFC series?
> > > >
> > > >
> > > > I may have misread your tone a bit.  It felt a bit like too many
> > > > discussions I've had in the past where people are trying to predict
> what
> > > > others will say instead of just asking them.  Reading it again, I was
> > > > probably jumping to conclusions a bit.  Sorry about that.
> > >
> > > Okay no problem, thanks. In any case we don't have to keep discussing
> > > it, since I wrote one or two emails ago it is fundamentally on the
> > > maintainers and community to ack the approach. I only felt like RFC did
> > > not explain the potential downsides sufficiently so I wanted to probe
> > > that area a bit.
> > >
> > > >     It is "drm/sched: Convert drm scheduler to use a work queue
> rather
> > > than
> > > >     kthread" which should have Cc-ed _everyone_ who use drm
> scheduler.
> > > >
> > > >
> > > > Yeah, it probably should have.  I think that's mostly what I've been
> > > > trying to say.
> > > >
> > > >      >
> > > >      >     Regards,
> > > >      >
> > > >      >     Tvrtko
> > > >      >
> > > >      >     P.S. And as a related side note, there are more areas
> where
> > > >     drm_sched
> > > >      >     could be improved, like for instance priority handling.
> > > >      >     Take a look at msm_submitqueue_create /
> > > >     msm_gpu_convert_priority /
> > > >      >     get_sched_entity to see how msm works around the drm_sched
> > > >     hardcoded
> > > >      >     limit of available priority levels, in order to avoid
> having
> > > >     to leave a
> > > >      >     hw capability unused. I suspect msm would be happier if
> they
> > > >     could have
> > > >      >     all priority levels equal in terms of whether they apply
> only
> > > >     at the
> > > >      >     frontend level or completely throughout the pipeline.
> > > >      >
> > > >      >      > [1]
> > > >      >
> > > >
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > > >     <
> https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> > > >
> > > >      >
> > > >       <
> > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 <
> > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>
> > > >      >      >
> > > >      >      >>> What would be interesting to learn is whether the
> option
> > > of
> > > >      >     refactoring
> > > >      >      >>> drm_sched to deal with out of order completion was
> > > >     considered
> > > >      >     and what were
> > > >      >      >>> the conclusions.
> > > >      >      >>>
> > > >      >      >>
> > > >      >      >> I coded this up a while back when trying to convert
> the
> > > >     i915 to
> > > >      >     the DRM
> > > >      >      >> scheduler it isn't all that hard either. The free flow
> > > >     control
> > > >      >     on the
> > > >      >      >> ring (e.g. set job limit == SIZE OF RING / MAX JOB
> SIZE)
> > > is
> > > >      >     really what
> > > >      >      >> sold me on the this design.
> > > >      >
> > > >      >
> > > >      > You're not the only one to suggest supporting out-of-order
> > > >     completion.
> > > >      > However, it's tricky and breaks a lot of internal assumptions
> of
> > > the
> > > >      > scheduler. It also reduces functionality a bit because it can
> no
> > > >     longer
> > > >      > automatically rate-limit HW/FW queues which are often
> > > >     fixed-size.  (Ok,
> > > >      > yes, it probably could but it becomes a substantially harder
> > > >     problem.)
> > > >      >
> > > >      > It also seems like a worse mapping to me.  The goal here is to
> > > turn
> > > >      > submissions on a userspace-facing engine/queue into
> submissions
> > > >     to a FW
> > > >      > queue submissions, sorting out any dma_fence dependencies.
> Matt's
> > > >      > description of saying this is a 1:1 mapping between
> sched/entity
> > > >     doesn't
> > > >      > tell the whole story. It's a 1:1:1 mapping between xe_engine,
> > > >      > gpu_scheduler, and GuC FW engine.  Why make it a 1:something:1
> > > >     mapping?
> > > >      > Why is that better?
> > > >
> > > >     As I have stated before, what I think what would fit well for Xe
> is
> > > one
> > > >     drm_scheduler per engine class. In specific terms on our current
> > > >     hardware, one drm scheduler instance for render, compute,
> blitter,
> > > >     video
> > > >     and video enhance. Userspace contexts remain scheduler entities.
> > > >
> > > >
> > > > And this is where we fairly strongly disagree.  More in a bit.
> > > >
> > > >     That way you avoid the whole kthread/kworker story and you have
> it
> > > >     actually use the entity picking code in the scheduler, which may
> be
> > > >     useful when the backend is congested.
> > > >
> > > >
> > > > What back-end congestion are you referring to here?  Running out of
> FW
> > > > queue IDs?  Something else?
> > >
> > > CT channel, number of context ids.
> > >
> > > >
> > > >     Yes you have to solve the out of order problem so in my mind
> that is
> > > >     something to discuss. What the problem actually is (just TDR?),
> how
> > > >     tricky and why etc.
> > > >
> > > >     And yes you lose the handy LRCA ring buffer size management so
> you'd
> > > >     have to make those entities not runnable in some other way.
> > > >
> > > >     Regarding the argument you raise below - would any of that make
> the
> > > >     frontend / backend separation worse and why? Do you think it is
> less
> > > >     natural? If neither is true then all remains is that it appears
> extra
> > > >     work to support out of order completion of entities has been
> > > discounted
> > > >     in favour of an easy but IMO inelegant option.
> > > >
> > > >
> > > > Broadly speaking, the kernel needs to stop thinking about GPU
> scheduling
> > > > in terms of scheduling jobs and start thinking in terms of scheduling
> > > > contexts/engines.  There is still some need for scheduling individual
> > > > jobs but that is only for the purpose of delaying them as needed to
> > > > resolve dma_fence dependencies.  Once dependencies are resolved, they
> > > > get shoved onto the context/engine queue and from there the kernel
> only
> > > > really manages whole contexts/engines.  This is a major architectural
> > > > shift, entirely different from the way i915 scheduling works.  It's
> also
> > > > different from the historical usage of DRM scheduler which I think is
> > > > why this all looks a bit funny.
> > > >
> > > > To justify this architectural shift, let's look at where we're
> headed.
> > > > In the glorious future...
> > > >
> > > >   1. Userspace submits directly to firmware queues.  The kernel has
> no
> > > > visibility whatsoever into individual jobs.  At most it can
> pause/resume
> > > > FW contexts as needed to handle eviction and memory management.
> > > >
> > > >   2. Because of 1, apart from handing out the FW queue IDs at the
> > > > beginning, the kernel can't really juggle them that much.  Depending
> on
> > > > FW design, it may be able to pause a client, give its IDs to another,
> > > > and then resume it later when IDs free up.  What it's not doing is
> > > > juggling IDs on a job-by-job basis like i915 currently is.
> > > >
> > > >   3. Long-running compute jobs may not complete for days.  This means
> > > > that memory management needs to happen in terms of pause/resume of
> > > > entire contexts/engines using the memory rather than based on waiting
> > > > for individual jobs to complete or pausing individual jobs until the
> > > > memory is available.
> > > >
> > > >   4. Synchronization happens via userspace memory fences (UMF) and
> the
> > > > kernel is mostly unaware of most dependencies and when a
> context/engine
> > > > is or is not runnable.  Instead, it keeps as many of them minimally
> > > > active (memory is available, even if it's in system RAM) as possible
> and
> > > > lets the FW sort out dependencies.  (There may need to be some
> facility
> > > > for sleeping a context until a memory change similar to futex() or
> > > > poll() for userspace threads.  There are some details TBD.)
> > > >
> > > > Are there potential problems that will need to be solved here?
> Yes.  Is
> > > > it a good design?  Well, Microsoft has been living in this future for
> > > > half a decade or better and it's working quite well for them.  It's
> also
> > > > the way all modern game consoles work.  It really is just Linux
> that's
> > > > stuck with the same old job model we've had since the monumental
> shift
> > > > to DRI2.
> > > >
> > > > To that end, one of the core goals of the Xe project was to make the
> > > > driver internally behave as close to the above model as possible
> while
> > > > keeping the old-school job model as a very thin layer on top.  As the
> > > > broader ecosystem problems (window-system support for UMF, for
> instance)
> > > > are solved, that layer can be peeled back.  The core driver will
> already
> > > > be ready for it.
> > > >
> > > > To that end, the point of the DRM scheduler in Xe isn't to schedule
> > > > jobs.  It's to resolve syncobj and dma-buf implicit sync dependencies
> > > > and stuff jobs into their respective context/engine queue once
> they're
> > > > ready.  All the actual scheduling happens in firmware and any
> scheduling
> > > > the kernel does to deal with contention, oversubscriptions, too many
> > > > contexts, etc. is between contexts/engines, not individual jobs.
> Sure,
> > > > the individual job visibility is nice, but if we design around it,
> we'll
> > > > never get to the glorious future.
> > > >
> > > > I really need to turn the above (with a bit more detail) into a blog
> > > > post.... Maybe I'll do that this week.
> > > >
> > > > In any case, I hope that provides more insight into why Xe is
> designed
> > > > the way it is and why I'm pushing back so hard on trying to make it
> more
> > > > of a "classic" driver as far as scheduling is concerned.  Are there
> > > > potential problems here?  Yes, that's why Xe has been labeled a
> > > > prototype.  Are such radical changes necessary to get to said
> glorious
> > > > future?  Yes, I think they are.  Will it be worth it?  I believe so.
> > >
> > > Right, that's all solid I think. My takeaway is that frontend priority
> > > sorting and that stuff isn't needed and that is okay. And that there
> are
> > > multiple options to maybe improve drm scheduler, like the fore
> mentioned
> > > making it deal with out of order, or split into functional components,
> > > or split frontend/backend what you suggested. For most of them cost vs
> > > benefit is more or less not completely clear, neither how much effort
> > > was invested to look into them.
> > >
> > > One thing I missed from this explanation is how drm_scheduler per
> engine
> > > class interferes with the high level concepts. And I did not manage to
> > > pick up on what exactly is the TDR problem in that case. Maybe the two
> > > are one and the same.
> > >
> > > Bottom line is I still have the concern that conversion to kworkers has
> > > an opportunity to regress. Possibly more opportunity for some Xe use
> > > cases than to affect other vendors, since they would still be using per
> > > physical engine / queue scheduler instances.
> > >
> > > And to put my money where my mouth is I will try to put testing Xe
> > > inside the full blown ChromeOS environment in my team plans. It would
> > > probably also be beneficial if Xe team could take a look at real world
> > > behaviour of the extreme transcode use cases too. If the stack is ready
> > > for that and all. It would be better to know earlier rather than later
> > > if there is a fundamental issue.
> > >
> > > For the patch at hand, and the cover letter, it certainly feels it
> would
> > > benefit to record the past design discussion had with AMD folks, to
> > > explicitly copy other drivers, and to record the theoretical pros and
> > > cons of threads vs unbound workers as I have tried to highlight them.
> > >
> > > Regards,
> > >
> > > Tvrtko
> > >
>

[-- Attachment #2: Type: text/html, Size: 48571 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 21:47                     ` Daniel Vetter
@ 2023-01-12  9:10                       ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12  9:10 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

Hi Daniel,

On Wed, 11 Jan 2023 22:47:02 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > Hi Daniel,
> >
> > On Mon, 9 Jan 2023 21:40:21 +0100
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> >  
> > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > Hi Jason,
> > > >
> > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >  
> > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > wrote:
> > > > >  
> > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > >  
> > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > >  
> > > > > > > > > Hello Matthew,
> > > > > > > > >
> > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > >  
> > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > > > > > this  
> > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > >
> > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > guaranteed to be the same completion even if targeting the same  
> > > > > > hardware  
> > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > > > > GuC,  
> > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > > > > > using  
> > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > > > > > falls  
> > > > > > > > > > apart as the TDR expects submission order == completion order.  
> > > > > > Using a  
> > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > > > > problem.  
> > > > > > > > >
> > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > FW-assisted  
> > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > because the integration with drm_sched was painful, but also because  
> > > > > > I  
> > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > > > > Giving  
> > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > > > > might  
> > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > > > > >
> > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > > > > >
> > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > scheduler fits quite well.
> > > > > >  
> > > > >
> > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > Mali. Roughly, there's about three categories here:
> > > > >
> > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > categorization.)
> > > > >
> > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > >
> > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > windows.
> > > > >
> > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > here and will require different designs. For Xe, we really are just using
> > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > >
> > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > eventually want to break that off into a separate discussion and not litter
> > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > >
> > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > able to juggle if you run out.  
> > > >
> > > > That's more or less what I do right now, except I don't use the
> > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > ticks being handled with a delayed-work (tick happening every X
> > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > generalize that part, but the dep-tracking logic implemented by
> > > > drm_sched looked like something I could easily re-use, hence my
> > > > interest in Xe's approach.  
> > >
> > > So another option for these few fw queue slots schedulers would be to
> > > treat them as vram and enlist ttm.
> > >
> > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > handle idr (or xarray or whatever you want) and then help you with all the
> > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > then also preferentially "evict" low-priority queus you pretty much have
> > > the perfect thing.
> > >
> > > Note that GuC with sriov splits up the id space and together with some
> > > restrictions due to multi-engine contexts media needs might also need this
> > > all.
> > >
> > > If you're balking at the idea of enlisting ttm just for fw queue
> > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > index allocation. Might be worth it to instead lift that into some sched
> > > helper code.  
> >
> > Would you mind pointing me to the amdgpu code you're mentioning here?
> > Still have a hard time seeing what TTM has to do with scheduling, but I
> > also don't know much about TTM, so I'll keep digging.  
> 
> ttm is about moving stuff in&out of a limited space and gives you some
> nice tooling for pipelining it all. It doesn't care whether that space
> is vram or some limited id space. vmwgfx used ttm as an id manager
> iirc.

Ok.

> 
> > > Either way there's two imo rather solid approaches available to sort this
> > > out. And once you have that, then there shouldn't be any big difference in
> > > driver design between fw with defacto unlimited queue ids, and those with
> > > severe restrictions in number of queues.  
> >
> > Honestly, I don't think there's much difference between those two cases
> > already. There's just a bunch of additional code to schedule queues on
> > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > driver specific. The job queuing front-end pretty much achieves what
> > drm_sched does already: queuing job to entities, checking deps,
> > submitting job to HW (in our case, writing to the command stream ring
> > buffer). Things start to differ after that point: once a scheduling
> > entity has pending jobs, we add it to one of the runnable queues (one
> > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > re-evaluate, if needed.
> >
> > I'm all for using generic code when it makes sense, even if that means
> > adding this common code when it doesn't exists, but I don't want to be
> > dragged into some major refactoring that might take years to land.
> > Especially if pancsf is the first
> > FW-assisted-scheduler-with-few-FW-slot driver.  
> 
> I don't see where there's a major refactoring that you're getting dragged into?

Oh, no, I'm not saying this is the case just yet, just wanted to make
sure we're on the same page :-).

> 
> Yes there's a huge sprawling discussion right now, but I think that's
> just largely people getting confused.

I definitely am :-).

> 
> Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> of code. See the amdgpu_vmid_grab stuff for the simplest starting
> point.

Ok, thanks for the pointers. I'll have a look and see how I could use
that. I guess that's about getting access to the FW slots with some
sort of priority+FIFO ordering guarantees given by TTM. If that's the
case, I'll have to think about it, because that's a major shift from
what we're doing now, and I'm afraid this could lead to starving
non-resident entities if all resident entities keep receiving new jobs
to execute. Unless we put some sort of barrier when giving access to a
slot, so we evict the entity when it's done executing the stuff it had
when it was given access to this slot. But then, again, there are other
constraints to take into account for the Arm Mali CSF case:

- it's more efficient to update all FW slots at once, because each
  update of a slot might require updating priorities of the other slots
  (FW mandates unique slot priorities, and those priorities depend on
  the entity priority/queue-ordering)
- context/FW slot switches have a non-negligible cost (FW needs to
  suspend the context and save the state every time there such a
  switch), so, limiting the number of FW slot updates might prove
  important

> 
> And also yes a scheduler frontend for dependency sorting shouldn't
> really be a that big thing, so there's not going to be huge amounts of
> code sharing in the end.

Agreed.

> It's the conceptual sharing, and sharing
> stuff like drm_sched_entity to eventual build some cross driver gpu
> context stuff on top that really is going to matter.

And I agree with that too.

> 
> Also like I mentioned, at least in some cases i915-guc might also have
> a need for fw scheduler slot allocation for a bunch of running things.

Ok.

> 
> Finally I'm a bit confused why you're building a time sharing
> scheduler in the kernel if you have one in fw already. Or do I get
> that part wrong?

It's here to overcome the low number of FW-slot (which is as low as 8
on the HW I'm testing on). If you don't do time sharing scheduling
kernel-side, you have no guarantee of fairness, since one could keep
queuing jobs to an entity/queue, making it permanently resident,
without giving a chance to non-resident entities/queues to ever run. To
sum-up, the scheduler is not entirely handled by the FW, it's a mixed
design, where part of it is in the FW (scheduling between currently
active entities passed to the FW), and the other part in the kernel
driver (rotating runnable entities on the limited amount of FW slots we
have). But overall, it shouldn't make a difference compared to Xe. The
fact some of the scheduling happens kernel-side is completely opaque to
the drm_sched_entity frontend if we go the Xe way (one
drm_gpu_scheduler per drm_sched_entity, real scheduling is handled by
some black box, either entirely in the FW, or with shared
responsibility between FW and kernel).

Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12  9:10                       ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12  9:10 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Daniel,

On Wed, 11 Jan 2023 22:47:02 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > Hi Daniel,
> >
> > On Mon, 9 Jan 2023 21:40:21 +0100
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> >  
> > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > Hi Jason,
> > > >
> > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >  
> > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > wrote:
> > > > >  
> > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > >  
> > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > >  
> > > > > > > > > Hello Matthew,
> > > > > > > > >
> > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > >  
> > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > > > > > this  
> > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > >
> > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > guaranteed to be the same completion even if targeting the same  
> > > > > > hardware  
> > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > > > > GuC,  
> > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > > > > > using  
> > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > > > > > falls  
> > > > > > > > > > apart as the TDR expects submission order == completion order.  
> > > > > > Using a  
> > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > > > > problem.  
> > > > > > > > >
> > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > FW-assisted  
> > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > because the integration with drm_sched was painful, but also because  
> > > > > > I  
> > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > > > > Giving  
> > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > > > > might  
> > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > > > > >
> > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > > > > >
> > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > scheduler fits quite well.
> > > > > >  
> > > > >
> > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > Mali. Roughly, there's about three categories here:
> > > > >
> > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > categorization.)
> > > > >
> > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > >
> > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > windows.
> > > > >
> > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > here and will require different designs. For Xe, we really are just using
> > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > >
> > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > eventually want to break that off into a separate discussion and not litter
> > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > >
> > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > able to juggle if you run out.  
> > > >
> > > > That's more or less what I do right now, except I don't use the
> > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > ticks being handled with a delayed-work (tick happening every X
> > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > generalize that part, but the dep-tracking logic implemented by
> > > > drm_sched looked like something I could easily re-use, hence my
> > > > interest in Xe's approach.  
> > >
> > > So another option for these few fw queue slots schedulers would be to
> > > treat them as vram and enlist ttm.
> > >
> > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > handle idr (or xarray or whatever you want) and then help you with all the
> > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > then also preferentially "evict" low-priority queus you pretty much have
> > > the perfect thing.
> > >
> > > Note that GuC with sriov splits up the id space and together with some
> > > restrictions due to multi-engine contexts media needs might also need this
> > > all.
> > >
> > > If you're balking at the idea of enlisting ttm just for fw queue
> > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > index allocation. Might be worth it to instead lift that into some sched
> > > helper code.  
> >
> > Would you mind pointing me to the amdgpu code you're mentioning here?
> > Still have a hard time seeing what TTM has to do with scheduling, but I
> > also don't know much about TTM, so I'll keep digging.  
> 
> ttm is about moving stuff in&out of a limited space and gives you some
> nice tooling for pipelining it all. It doesn't care whether that space
> is vram or some limited id space. vmwgfx used ttm as an id manager
> iirc.

Ok.

> 
> > > Either way there's two imo rather solid approaches available to sort this
> > > out. And once you have that, then there shouldn't be any big difference in
> > > driver design between fw with defacto unlimited queue ids, and those with
> > > severe restrictions in number of queues.  
> >
> > Honestly, I don't think there's much difference between those two cases
> > already. There's just a bunch of additional code to schedule queues on
> > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > driver specific. The job queuing front-end pretty much achieves what
> > drm_sched does already: queuing job to entities, checking deps,
> > submitting job to HW (in our case, writing to the command stream ring
> > buffer). Things start to differ after that point: once a scheduling
> > entity has pending jobs, we add it to one of the runnable queues (one
> > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > re-evaluate, if needed.
> >
> > I'm all for using generic code when it makes sense, even if that means
> > adding this common code when it doesn't exists, but I don't want to be
> > dragged into some major refactoring that might take years to land.
> > Especially if pancsf is the first
> > FW-assisted-scheduler-with-few-FW-slot driver.  
> 
> I don't see where there's a major refactoring that you're getting dragged into?

Oh, no, I'm not saying this is the case just yet, just wanted to make
sure we're on the same page :-).

> 
> Yes there's a huge sprawling discussion right now, but I think that's
> just largely people getting confused.

I definitely am :-).

> 
> Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> of code. See the amdgpu_vmid_grab stuff for the simplest starting
> point.

Ok, thanks for the pointers. I'll have a look and see how I could use
that. I guess that's about getting access to the FW slots with some
sort of priority+FIFO ordering guarantees given by TTM. If that's the
case, I'll have to think about it, because that's a major shift from
what we're doing now, and I'm afraid this could lead to starving
non-resident entities if all resident entities keep receiving new jobs
to execute. Unless we put some sort of barrier when giving access to a
slot, so we evict the entity when it's done executing the stuff it had
when it was given access to this slot. But then, again, there are other
constraints to take into account for the Arm Mali CSF case:

- it's more efficient to update all FW slots at once, because each
  update of a slot might require updating priorities of the other slots
  (FW mandates unique slot priorities, and those priorities depend on
  the entity priority/queue-ordering)
- context/FW slot switches have a non-negligible cost (FW needs to
  suspend the context and save the state every time there such a
  switch), so, limiting the number of FW slot updates might prove
  important

> 
> And also yes a scheduler frontend for dependency sorting shouldn't
> really be a that big thing, so there's not going to be huge amounts of
> code sharing in the end.

Agreed.

> It's the conceptual sharing, and sharing
> stuff like drm_sched_entity to eventual build some cross driver gpu
> context stuff on top that really is going to matter.

And I agree with that too.

> 
> Also like I mentioned, at least in some cases i915-guc might also have
> a need for fw scheduler slot allocation for a bunch of running things.

Ok.

> 
> Finally I'm a bit confused why you're building a time sharing
> scheduler in the kernel if you have one in fw already. Or do I get
> that part wrong?

It's here to overcome the low number of FW-slot (which is as low as 8
on the HW I'm testing on). If you don't do time sharing scheduling
kernel-side, you have no guarantee of fairness, since one could keep
queuing jobs to an entity/queue, making it permanently resident,
without giving a chance to non-resident entities/queues to ever run. To
sum-up, the scheduler is not entirely handled by the FW, it's a mixed
design, where part of it is in the FW (scheduling between currently
active entities passed to the FW), and the other part in the kernel
driver (rotating runnable entities on the limited amount of FW slots we
have). But overall, it shouldn't make a difference compared to Xe. The
fact some of the scheduling happens kernel-side is completely opaque to
the drm_sched_entity frontend if we go the Xe way (one
drm_gpu_scheduler per drm_sched_entity, real scheduling is handled by
some black box, either entirely in the FW, or with shared
responsibility between FW and kernel).

Regards,

Boris

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12  9:10                       ` Boris Brezillon
@ 2023-01-12  9:32                         ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-12  9:32 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> Hi Daniel,
> 
> On Wed, 11 Jan 2023 22:47:02 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > <boris.brezillon@collabora.com> wrote:
> > >
> > > Hi Daniel,
> > >
> > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >  
> > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > > Hi Jason,
> > > > >
> > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > >  
> > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > wrote:
> > > > > >  
> > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > >  
> > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > >  
> > > > > > > > > > Hello Matthew,
> > > > > > > > > >
> > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > >  
> > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > > > > > > this  
> > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > >
> > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > guaranteed to be the same completion even if targeting the same  
> > > > > > > hardware  
> > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > > > > > GuC,  
> > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > > > > > > using  
> > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > > > > > > falls  
> > > > > > > > > > > apart as the TDR expects submission order == completion order.  
> > > > > > > Using a  
> > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > > > > > problem.  
> > > > > > > > > >
> > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > > FW-assisted  
> > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > because the integration with drm_sched was painful, but also because  
> > > > > > > I  
> > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > > > > > Giving  
> > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > > > > > might  
> > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > > > > > >
> > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > > > > > >
> > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > scheduler fits quite well.
> > > > > > >  
> > > > > >
> > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > Mali. Roughly, there's about three categories here:
> > > > > >
> > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > categorization.)
> > > > > >
> > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > >
> > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > windows.
> > > > > >
> > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > >
> > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > >
> > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > able to juggle if you run out.  
> > > > >
> > > > > That's more or less what I do right now, except I don't use the
> > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > interest in Xe's approach.  
> > > >
> > > > So another option for these few fw queue slots schedulers would be to
> > > > treat them as vram and enlist ttm.
> > > >
> > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > the perfect thing.
> > > >
> > > > Note that GuC with sriov splits up the id space and together with some
> > > > restrictions due to multi-engine contexts media needs might also need this
> > > > all.
> > > >
> > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > index allocation. Might be worth it to instead lift that into some sched
> > > > helper code.  
> > >
> > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > also don't know much about TTM, so I'll keep digging.  
> > 
> > ttm is about moving stuff in&out of a limited space and gives you some
> > nice tooling for pipelining it all. It doesn't care whether that space
> > is vram or some limited id space. vmwgfx used ttm as an id manager
> > iirc.
> 
> Ok.
> 
> > 
> > > > Either way there's two imo rather solid approaches available to sort this
> > > > out. And once you have that, then there shouldn't be any big difference in
> > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > severe restrictions in number of queues.  
> > >
> > > Honestly, I don't think there's much difference between those two cases
> > > already. There's just a bunch of additional code to schedule queues on
> > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > driver specific. The job queuing front-end pretty much achieves what
> > > drm_sched does already: queuing job to entities, checking deps,
> > > submitting job to HW (in our case, writing to the command stream ring
> > > buffer). Things start to differ after that point: once a scheduling
> > > entity has pending jobs, we add it to one of the runnable queues (one
> > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > re-evaluate, if needed.
> > >
> > > I'm all for using generic code when it makes sense, even if that means
> > > adding this common code when it doesn't exists, but I don't want to be
> > > dragged into some major refactoring that might take years to land.
> > > Especially if pancsf is the first
> > > FW-assisted-scheduler-with-few-FW-slot driver.  
> > 
> > I don't see where there's a major refactoring that you're getting dragged into?
> 
> Oh, no, I'm not saying this is the case just yet, just wanted to make
> sure we're on the same page :-).
> 
> > 
> > Yes there's a huge sprawling discussion right now, but I think that's
> > just largely people getting confused.
> 
> I definitely am :-).
> 
> > 
> > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > point.
> 
> Ok, thanks for the pointers. I'll have a look and see how I could use
> that. I guess that's about getting access to the FW slots with some
> sort of priority+FIFO ordering guarantees given by TTM. If that's the
> case, I'll have to think about it, because that's a major shift from
> what we're doing now, and I'm afraid this could lead to starving
> non-resident entities if all resident entities keep receiving new jobs
> to execute. Unless we put some sort of barrier when giving access to a
> slot, so we evict the entity when it's done executing the stuff it had
> when it was given access to this slot. But then, again, there are other
> constraints to take into account for the Arm Mali CSF case:
> 
> - it's more efficient to update all FW slots at once, because each
>   update of a slot might require updating priorities of the other slots
>   (FW mandates unique slot priorities, and those priorities depend on
>   the entity priority/queue-ordering)
> - context/FW slot switches have a non-negligible cost (FW needs to
>   suspend the context and save the state every time there such a
>   switch), so, limiting the number of FW slot updates might prove
>   important

I frankly think you're overworrying. When you have 31+ contexts running at
the same time, you have bigger problems. At that point there's two
use-cases:
1. system is overloaded, the user will reach for reset button anyway
2. temporary situation, all you have to do is be roughly fair enough to get
   through it before case 1 happens.
 
Trying to write a perfect scheduler for this before we have actual
benchmarks that justify the effort seems like pretty serious overkill.
That's why I think the simplest solution is the one we should have:

- drm/sched frontend. If you get into slot exhaustion that alone will
  ensure enough fairness

- LRU list of slots, with dma_fence so you can pipeline/batch up changes
  as needed (but I honestly wouldn't worry about the batching before
  you've shown an actual need for this in some benchmark/workload, even
  piglit shouldn't have this many things running concurrently I think, you
  don't have that many cpu cores). Between drm/sched and the lru you will
  have an emergent scheduler that cycles through all runnable gpu jobs.

- If you want to go fancy, have eviction tricks like skipping currently
  still active gpu context with higher priority than the one that you need
  to find a slot for.

- You don't need time slicing in this, not even for compute. compute is
  done with preempt context fences, if you give them a minimum scheduling
  quanta you'll have a very basic round robin scheduler as an emergent
  thing.

Any workload were it matters will be scheduled by the fw directly, with
drm/sched only being the dma_fence dependcy sorter. My take is that if you
spend more than a hundred or so lines with slot allocation logic
(excluding the hw code to load/unload a slot) you're probably doing some
serious overengineering.

> > And also yes a scheduler frontend for dependency sorting shouldn't
> > really be a that big thing, so there's not going to be huge amounts of
> > code sharing in the end.
> 
> Agreed.
> 
> > It's the conceptual sharing, and sharing
> > stuff like drm_sched_entity to eventual build some cross driver gpu
> > context stuff on top that really is going to matter.
> 
> And I agree with that too.
> 
> > 
> > Also like I mentioned, at least in some cases i915-guc might also have
> > a need for fw scheduler slot allocation for a bunch of running things.
> 
> Ok.
> 
> > 
> > Finally I'm a bit confused why you're building a time sharing
> > scheduler in the kernel if you have one in fw already. Or do I get
> > that part wrong?
> 
> It's here to overcome the low number of FW-slot (which is as low as 8
> on the HW I'm testing on). If you don't do time sharing scheduling
> kernel-side, you have no guarantee of fairness, since one could keep
> queuing jobs to an entity/queue, making it permanently resident,
> without giving a chance to non-resident entities/queues to ever run. To
> sum-up, the scheduler is not entirely handled by the FW, it's a mixed
> design, where part of it is in the FW (scheduling between currently
> active entities passed to the FW), and the other part in the kernel
> driver (rotating runnable entities on the limited amount of FW slots we
> have). But overall, it shouldn't make a difference compared to Xe. The
> fact some of the scheduling happens kernel-side is completely opaque to
> the drm_sched_entity frontend if we go the Xe way (one
> drm_gpu_scheduler per drm_sched_entity, real scheduling is handled by
> some black box, either entirely in the FW, or with shared
> responsibility between FW and kernel).

See above. I don't think you need three scheduler (dma_fence sorting
frontend, kernel round robin, fw round robin here). I'm pretty sure you do
not _want_ 3 schedulers. And if you just take the 3 pieces above, you will
have a scheduler that's Fair Enough (tm) even when you have more than 31
context.

I would frankly not even be surprised if you can get away with full
stalls, so not even the dma_fence pipelining needed. even if you stall out
a handful of context, there should still be 20+ available for the fw to
schedule and keep the gpu busy. After all, this is still a gpu, there's
only 2 things you need:
- fair enough to avoid completely stalling out some app and the user
  reaching the reset button
- throughput. as long as you can keep enough runnable slots for the fw to
  schedule, it really shouldn't matter how shoddily you push in new stuff.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12  9:32                         ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-12  9:32 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel, Daniel Vetter

On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> Hi Daniel,
> 
> On Wed, 11 Jan 2023 22:47:02 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > <boris.brezillon@collabora.com> wrote:
> > >
> > > Hi Daniel,
> > >
> > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >  
> > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > > Hi Jason,
> > > > >
> > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > >  
> > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > wrote:
> > > > > >  
> > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > >  
> > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > >  
> > > > > > > > > > Hello Matthew,
> > > > > > > > > >
> > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > >  
> > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > > > > > > this  
> > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > >
> > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > guaranteed to be the same completion even if targeting the same  
> > > > > > > hardware  
> > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > > > > > GuC,  
> > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > > > > > > using  
> > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > > > > > > falls  
> > > > > > > > > > > apart as the TDR expects submission order == completion order.  
> > > > > > > Using a  
> > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > > > > > problem.  
> > > > > > > > > >
> > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > > FW-assisted  
> > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > because the integration with drm_sched was painful, but also because  
> > > > > > > I  
> > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > > > > > Giving  
> > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > > > > > might  
> > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > > > > > >
> > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > > > > > >
> > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > scheduler fits quite well.
> > > > > > >  
> > > > > >
> > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > Mali. Roughly, there's about three categories here:
> > > > > >
> > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > categorization.)
> > > > > >
> > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > >
> > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > windows.
> > > > > >
> > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > >
> > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > >
> > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > able to juggle if you run out.  
> > > > >
> > > > > That's more or less what I do right now, except I don't use the
> > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > interest in Xe's approach.  
> > > >
> > > > So another option for these few fw queue slots schedulers would be to
> > > > treat them as vram and enlist ttm.
> > > >
> > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > the perfect thing.
> > > >
> > > > Note that GuC with sriov splits up the id space and together with some
> > > > restrictions due to multi-engine contexts media needs might also need this
> > > > all.
> > > >
> > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > index allocation. Might be worth it to instead lift that into some sched
> > > > helper code.  
> > >
> > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > also don't know much about TTM, so I'll keep digging.  
> > 
> > ttm is about moving stuff in&out of a limited space and gives you some
> > nice tooling for pipelining it all. It doesn't care whether that space
> > is vram or some limited id space. vmwgfx used ttm as an id manager
> > iirc.
> 
> Ok.
> 
> > 
> > > > Either way there's two imo rather solid approaches available to sort this
> > > > out. And once you have that, then there shouldn't be any big difference in
> > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > severe restrictions in number of queues.  
> > >
> > > Honestly, I don't think there's much difference between those two cases
> > > already. There's just a bunch of additional code to schedule queues on
> > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > driver specific. The job queuing front-end pretty much achieves what
> > > drm_sched does already: queuing job to entities, checking deps,
> > > submitting job to HW (in our case, writing to the command stream ring
> > > buffer). Things start to differ after that point: once a scheduling
> > > entity has pending jobs, we add it to one of the runnable queues (one
> > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > re-evaluate, if needed.
> > >
> > > I'm all for using generic code when it makes sense, even if that means
> > > adding this common code when it doesn't exists, but I don't want to be
> > > dragged into some major refactoring that might take years to land.
> > > Especially if pancsf is the first
> > > FW-assisted-scheduler-with-few-FW-slot driver.  
> > 
> > I don't see where there's a major refactoring that you're getting dragged into?
> 
> Oh, no, I'm not saying this is the case just yet, just wanted to make
> sure we're on the same page :-).
> 
> > 
> > Yes there's a huge sprawling discussion right now, but I think that's
> > just largely people getting confused.
> 
> I definitely am :-).
> 
> > 
> > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > point.
> 
> Ok, thanks for the pointers. I'll have a look and see how I could use
> that. I guess that's about getting access to the FW slots with some
> sort of priority+FIFO ordering guarantees given by TTM. If that's the
> case, I'll have to think about it, because that's a major shift from
> what we're doing now, and I'm afraid this could lead to starving
> non-resident entities if all resident entities keep receiving new jobs
> to execute. Unless we put some sort of barrier when giving access to a
> slot, so we evict the entity when it's done executing the stuff it had
> when it was given access to this slot. But then, again, there are other
> constraints to take into account for the Arm Mali CSF case:
> 
> - it's more efficient to update all FW slots at once, because each
>   update of a slot might require updating priorities of the other slots
>   (FW mandates unique slot priorities, and those priorities depend on
>   the entity priority/queue-ordering)
> - context/FW slot switches have a non-negligible cost (FW needs to
>   suspend the context and save the state every time there such a
>   switch), so, limiting the number of FW slot updates might prove
>   important

I frankly think you're overworrying. When you have 31+ contexts running at
the same time, you have bigger problems. At that point there's two
use-cases:
1. system is overloaded, the user will reach for reset button anyway
2. temporary situation, all you have to do is be roughly fair enough to get
   through it before case 1 happens.
 
Trying to write a perfect scheduler for this before we have actual
benchmarks that justify the effort seems like pretty serious overkill.
That's why I think the simplest solution is the one we should have:

- drm/sched frontend. If you get into slot exhaustion that alone will
  ensure enough fairness

- LRU list of slots, with dma_fence so you can pipeline/batch up changes
  as needed (but I honestly wouldn't worry about the batching before
  you've shown an actual need for this in some benchmark/workload, even
  piglit shouldn't have this many things running concurrently I think, you
  don't have that many cpu cores). Between drm/sched and the lru you will
  have an emergent scheduler that cycles through all runnable gpu jobs.

- If you want to go fancy, have eviction tricks like skipping currently
  still active gpu context with higher priority than the one that you need
  to find a slot for.

- You don't need time slicing in this, not even for compute. compute is
  done with preempt context fences, if you give them a minimum scheduling
  quanta you'll have a very basic round robin scheduler as an emergent
  thing.

Any workload were it matters will be scheduled by the fw directly, with
drm/sched only being the dma_fence dependcy sorter. My take is that if you
spend more than a hundred or so lines with slot allocation logic
(excluding the hw code to load/unload a slot) you're probably doing some
serious overengineering.

> > And also yes a scheduler frontend for dependency sorting shouldn't
> > really be a that big thing, so there's not going to be huge amounts of
> > code sharing in the end.
> 
> Agreed.
> 
> > It's the conceptual sharing, and sharing
> > stuff like drm_sched_entity to eventual build some cross driver gpu
> > context stuff on top that really is going to matter.
> 
> And I agree with that too.
> 
> > 
> > Also like I mentioned, at least in some cases i915-guc might also have
> > a need for fw scheduler slot allocation for a bunch of running things.
> 
> Ok.
> 
> > 
> > Finally I'm a bit confused why you're building a time sharing
> > scheduler in the kernel if you have one in fw already. Or do I get
> > that part wrong?
> 
> It's here to overcome the low number of FW-slot (which is as low as 8
> on the HW I'm testing on). If you don't do time sharing scheduling
> kernel-side, you have no guarantee of fairness, since one could keep
> queuing jobs to an entity/queue, making it permanently resident,
> without giving a chance to non-resident entities/queues to ever run. To
> sum-up, the scheduler is not entirely handled by the FW, it's a mixed
> design, where part of it is in the FW (scheduling between currently
> active entities passed to the FW), and the other part in the kernel
> driver (rotating runnable entities on the limited amount of FW slots we
> have). But overall, it shouldn't make a difference compared to Xe. The
> fact some of the scheduling happens kernel-side is completely opaque to
> the drm_sched_entity frontend if we go the Xe way (one
> drm_gpu_scheduler per drm_sched_entity, real scheduling is handled by
> some black box, either entirely in the FW, or with shared
> responsibility between FW and kernel).

See above. I don't think you need three scheduler (dma_fence sorting
frontend, kernel round robin, fw round robin here). I'm pretty sure you do
not _want_ 3 schedulers. And if you just take the 3 pieces above, you will
have a scheduler that's Fair Enough (tm) even when you have more than 31
context.

I would frankly not even be surprised if you can get away with full
stalls, so not even the dma_fence pipelining needed. even if you stall out
a handful of context, there should still be 20+ available for the fw to
schedule and keep the gpu busy. After all, this is still a gpu, there's
only 2 things you need:
- fair enough to avoid completely stalling out some app and the user
  reaching the reset button
- throughput. as long as you can keep enough runnable slots for the fw to
  schedule, it really shouldn't matter how shoddily you push in new stuff.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-05 21:27   ` Matthew Brost
@ 2023-01-12  9:54       ` Lucas De Marchi
  0 siblings, 0 replies; 161+ messages in thread
From: Lucas De Marchi @ 2023-01-12  9:54 UTC (permalink / raw)
  To: Matthew Brost; +Cc: Tvrtko Ursulin, intel-gfx, dri-devel

On Thu, Jan 05, 2023 at 09:27:57PM +0000, Matthew Brost wrote:
>On Tue, Jan 03, 2023 at 12:21:08PM +0000, Tvrtko Ursulin wrote:
>>
>> On 22/12/2022 22:21, Matthew Brost wrote:
>> > Hello,
>> >
>> > This is a submission for Xe, a new driver for Intel GPUs that supports both
>> > integrated and discrete platforms starting with Tiger Lake (first platform with
>> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
>> > to work from that is unencumbered by older platforms, whilst also taking the
>> > opportunity to rearchitect our driver to increase sharing across the drm
>> > subsystem, both leveraging and allowing us to contribute more towards other
>> > shared components like TTM and drm/scheduler. The memory model is based on VM
>> > bind which is similar to the i915 implementation. Likewise the execbuf
>> > implementation for Xe is very similar to execbuf3 in the i915 [1].
>> >
>> > The code is at a stage where it is already functional and has experimental
>> > support for multiple platforms starting from Tiger Lake, with initial support
>> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
>> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
>> > implementation will be released publicly early next year. We also have a suite
>> > of IGTs for XE that will appear on the IGT list shortly.
>> >
>> > It has been built with the assumption of supporting multiple architectures from
>> > the get-go, right now with tests running both on X86 and ARM hosts. And we
>> > intend to continue working on it and improving on it as part of the kernel
>> > community upstream.
>> >
>> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
>> > ready Xe for production throughout 2023.
>> >
>> > As for display, the intent is to share the display code with the i915 driver so
>> > that there is maximum reuse there. Currently this is being done by compiling the
>> > display code twice, but alternatives to that are under consideration and we want
>> > to have more discussion on what the best final solution will look like over the
>> > next few months. Right now, work is ongoing in refactoring the display codebase
>> > to remove as much as possible any unnecessary dependencies on i915 specific data
>> > structures there..
>> >
>> > We currently have 2 submission backends, execlists and GuC. The execlist is
>> > meant mostly for testing and is not fully functional while GuC backend is fully
>> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
>> > required and should be placed in /lib/firmware/xe.
>>
>> What is the plan going forward for the execlists backend? I think it would
>> be preferable to not upstream something semi-functional and so to carry
>> technical debt in the brand new code base, from the very start. If it is for
>> Tigerlake, which is the starting platform for Xe, could it be made GuC only
>> Tigerlake for instance?
>>
>
>A little background here. In the original PoC written by Jason and Dave,
>the execlist backend was the only one present and it was in semi-working
>state. As soon as myself and a few others started working on Xe we went
>full in a on the GuC backend. We left the execlist backend basically in
>the state it was in. We left it in place for 2 reasons.
>
>1. Having 2 backends from the start ensured we layered our code
>correctly. The layer was a complete disaster in the i915 so we really
>wanted to avoid that.
>2. The thought was it might be needed for early product bring up one
>day.
>
>As I think about this a bit more, we likely just delete execlist backend
>before merging this upstream and perhaps just carry 1 large patch
>internally with this implementation that we can use as needed. Final
>decession TDB though.

but that might regress after some time on "let's keep 2 backends so we
layer the code correctly". Leaving the additional backend behind
CONFIG_BROKEN or XE_EXPERIMENTAL, or something like that, not
enabled by distros, but enabled in CI would be a good idea IMO.

Carrying a large patch out of tree would make things harder for new
platforms. A perfect backend split would make it possible, but like I
said, we are likely not to have it if we delete the second backend.

Lucas De Marchi

>
>Matt
>
>> Regards,
>>
>> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-12  9:54       ` Lucas De Marchi
  0 siblings, 0 replies; 161+ messages in thread
From: Lucas De Marchi @ 2023-01-12  9:54 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

On Thu, Jan 05, 2023 at 09:27:57PM +0000, Matthew Brost wrote:
>On Tue, Jan 03, 2023 at 12:21:08PM +0000, Tvrtko Ursulin wrote:
>>
>> On 22/12/2022 22:21, Matthew Brost wrote:
>> > Hello,
>> >
>> > This is a submission for Xe, a new driver for Intel GPUs that supports both
>> > integrated and discrete platforms starting with Tiger Lake (first platform with
>> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
>> > to work from that is unencumbered by older platforms, whilst also taking the
>> > opportunity to rearchitect our driver to increase sharing across the drm
>> > subsystem, both leveraging and allowing us to contribute more towards other
>> > shared components like TTM and drm/scheduler. The memory model is based on VM
>> > bind which is similar to the i915 implementation. Likewise the execbuf
>> > implementation for Xe is very similar to execbuf3 in the i915 [1].
>> >
>> > The code is at a stage where it is already functional and has experimental
>> > support for multiple platforms starting from Tiger Lake, with initial support
>> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
>> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
>> > implementation will be released publicly early next year. We also have a suite
>> > of IGTs for XE that will appear on the IGT list shortly.
>> >
>> > It has been built with the assumption of supporting multiple architectures from
>> > the get-go, right now with tests running both on X86 and ARM hosts. And we
>> > intend to continue working on it and improving on it as part of the kernel
>> > community upstream.
>> >
>> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
>> > ready Xe for production throughout 2023.
>> >
>> > As for display, the intent is to share the display code with the i915 driver so
>> > that there is maximum reuse there. Currently this is being done by compiling the
>> > display code twice, but alternatives to that are under consideration and we want
>> > to have more discussion on what the best final solution will look like over the
>> > next few months. Right now, work is ongoing in refactoring the display codebase
>> > to remove as much as possible any unnecessary dependencies on i915 specific data
>> > structures there..
>> >
>> > We currently have 2 submission backends, execlists and GuC. The execlist is
>> > meant mostly for testing and is not fully functional while GuC backend is fully
>> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
>> > required and should be placed in /lib/firmware/xe.
>>
>> What is the plan going forward for the execlists backend? I think it would
>> be preferable to not upstream something semi-functional and so to carry
>> technical debt in the brand new code base, from the very start. If it is for
>> Tigerlake, which is the starting platform for Xe, could it be made GuC only
>> Tigerlake for instance?
>>
>
>A little background here. In the original PoC written by Jason and Dave,
>the execlist backend was the only one present and it was in semi-working
>state. As soon as myself and a few others started working on Xe we went
>full in a on the GuC backend. We left the execlist backend basically in
>the state it was in. We left it in place for 2 reasons.
>
>1. Having 2 backends from the start ensured we layered our code
>correctly. The layer was a complete disaster in the i915 so we really
>wanted to avoid that.
>2. The thought was it might be needed for early product bring up one
>day.
>
>As I think about this a bit more, we likely just delete execlist backend
>before merging this upstream and perhaps just carry 1 large patch
>internally with this implementation that we can use as needed. Final
>decession TDB though.

but that might regress after some time on "let's keep 2 backends so we
layer the code correctly". Leaving the additional backend behind
CONFIG_BROKEN or XE_EXPERIMENTAL, or something like that, not
enabled by distros, but enabled in CI would be a good idea IMO.

Carrying a large patch out of tree would make things harder for new
platforms. A perfect backend split would make it possible, but like I
said, we are likely not to have it if we delete the second backend.

Lucas De Marchi

>
>Matt
>
>> Regards,
>>
>> Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12  9:32                         ` Daniel Vetter
@ 2023-01-12 10:11                           ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 10:11 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, 12 Jan 2023 10:32:18 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > Hi Daniel,
> > 
> > On Wed, 11 Jan 2023 22:47:02 +0100
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> >   
> > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > <boris.brezillon@collabora.com> wrote:  
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >    
> > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:    
> > > > > > Hi Jason,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > >    
> > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > wrote:
> > > > > > >    
> > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:    
> > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > >    
> > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > >    
> > > > > > > > > > > Hello Matthew,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > >    
> > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first    
> > > > > > > > this    
> > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > guaranteed to be the same completion even if targeting the same    
> > > > > > > > hardware    
> > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the    
> > > > > > > > GuC,    
> > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a    
> > > > > > > > using    
> > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR    
> > > > > > > > falls    
> > > > > > > > > > > > apart as the TDR expects submission order == completion order.    
> > > > > > > > Using a    
> > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this    
> > > > > > > > problem.    
> > > > > > > > > > >
> > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a    
> > > > > > > > FW-assisted    
> > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > because the integration with drm_sched was painful, but also because    
> > > > > > > > I    
> > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.    
> > > > > > > > Giving    
> > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably    
> > > > > > > > might    
> > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.    
> > > > > > > > > >
> > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue    
> > > > > > > >
> > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > scheduler fits quite well.
> > > > > > > >    
> > > > > > >
> > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > >
> > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > categorization.)
> > > > > > >
> > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > >
> > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > windows.
> > > > > > >
> > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > >
> > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > >
> > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > able to juggle if you run out.    
> > > > > >
> > > > > > That's more or less what I do right now, except I don't use the
> > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > interest in Xe's approach.    
> > > > >
> > > > > So another option for these few fw queue slots schedulers would be to
> > > > > treat them as vram and enlist ttm.
> > > > >
> > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > the perfect thing.
> > > > >
> > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > all.
> > > > >
> > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > helper code.    
> > > >
> > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > also don't know much about TTM, so I'll keep digging.    
> > > 
> > > ttm is about moving stuff in&out of a limited space and gives you some
> > > nice tooling for pipelining it all. It doesn't care whether that space
> > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > iirc.  
> > 
> > Ok.
> >   
> > >   
> > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > severe restrictions in number of queues.    
> > > >
> > > > Honestly, I don't think there's much difference between those two cases
> > > > already. There's just a bunch of additional code to schedule queues on
> > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > driver specific. The job queuing front-end pretty much achieves what
> > > > drm_sched does already: queuing job to entities, checking deps,
> > > > submitting job to HW (in our case, writing to the command stream ring
> > > > buffer). Things start to differ after that point: once a scheduling
> > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > re-evaluate, if needed.
> > > >
> > > > I'm all for using generic code when it makes sense, even if that means
> > > > adding this common code when it doesn't exists, but I don't want to be
> > > > dragged into some major refactoring that might take years to land.
> > > > Especially if pancsf is the first
> > > > FW-assisted-scheduler-with-few-FW-slot driver.    
> > > 
> > > I don't see where there's a major refactoring that you're getting dragged into?  
> > 
> > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > sure we're on the same page :-).
> >   
> > > 
> > > Yes there's a huge sprawling discussion right now, but I think that's
> > > just largely people getting confused.  
> > 
> > I definitely am :-).
> >   
> > > 
> > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > point.  
> > 
> > Ok, thanks for the pointers. I'll have a look and see how I could use
> > that. I guess that's about getting access to the FW slots with some
> > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > case, I'll have to think about it, because that's a major shift from
> > what we're doing now, and I'm afraid this could lead to starving
> > non-resident entities if all resident entities keep receiving new jobs
> > to execute. Unless we put some sort of barrier when giving access to a
> > slot, so we evict the entity when it's done executing the stuff it had
> > when it was given access to this slot. But then, again, there are other
> > constraints to take into account for the Arm Mali CSF case:
> > 
> > - it's more efficient to update all FW slots at once, because each
> >   update of a slot might require updating priorities of the other slots
> >   (FW mandates unique slot priorities, and those priorities depend on
> >   the entity priority/queue-ordering)
> > - context/FW slot switches have a non-negligible cost (FW needs to
> >   suspend the context and save the state every time there such a
> >   switch), so, limiting the number of FW slot updates might prove
> >   important  
> 
> I frankly think you're overworrying. When you have 31+ contexts running at
> the same time, you have bigger problems. At that point there's two
> use-cases:
> 1. system is overloaded, the user will reach for reset button anyway
> 2. temporary situation, all you have to do is be roughly fair enough to get
>    through it before case 1 happens.
>  
> Trying to write a perfect scheduler for this before we have actual
> benchmarks that justify the effort seems like pretty serious overkill.
> That's why I think the simplest solution is the one we should have:
> 
> - drm/sched frontend. If you get into slot exhaustion that alone will
>   ensure enough fairness

We're talking about the CS ring buffer slots here, right?

> 
> - LRU list of slots, with dma_fence so you can pipeline/batch up changes
>   as needed (but I honestly wouldn't worry about the batching before
>   you've shown an actual need for this in some benchmark/workload, even
>   piglit shouldn't have this many things running concurrently I think, you
>   don't have that many cpu cores). Between drm/sched and the lru you will
>   have an emergent scheduler that cycles through all runnable gpu jobs.
> 
> - If you want to go fancy, have eviction tricks like skipping currently
>   still active gpu context with higher priority than the one that you need
>   to find a slot for.
> 
> - You don't need time slicing in this, not even for compute. compute is
>   done with preempt context fences, if you give them a minimum scheduling
>   quanta you'll have a very basic round robin scheduler as an emergent
>   thing.
> 
> Any workload were it matters will be scheduled by the fw directly, with
> drm/sched only being the dma_fence dependcy sorter. My take is that if you
> spend more than a hundred or so lines with slot allocation logic
> (excluding the hw code to load/unload a slot) you're probably doing some
> serious overengineering.

Let me see if I got this right:

- we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
  where hw_submission_limit == available_slots_in_ring_buf
- when ->run_job() is called, we write the RUN_JOB() instruction
  sequence to the next available ringbuf slot and queue the entity to
  the FW-slot queue
  * if a slot is directly available, we program the slot directly
  * if no slots are available, but some slots are done with the jobs
    they were given (last job fence signaled), we evict the LRU entity
    (possibly taking priority into account) and use this slot for the
    new entity
  * if no slots are available and all currently assigned slots
    contain busy entities, we queue the entity to a pending list
    (possibly one list per prio)

I'll need to make sure this still works with the concept of group (it's
not a single queue we schedule, it's a group of queues, meaning that we
have N fences to watch to determine if the slot is busy or not, but
that should be okay).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 10:11                           ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 10:11 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Thu, 12 Jan 2023 10:32:18 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > Hi Daniel,
> > 
> > On Wed, 11 Jan 2023 22:47:02 +0100
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> >   
> > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > <boris.brezillon@collabora.com> wrote:  
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >    
> > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:    
> > > > > > Hi Jason,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > >    
> > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > wrote:
> > > > > > >    
> > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:    
> > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > >    
> > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > >    
> > > > > > > > > > > Hello Matthew,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > >    
> > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first    
> > > > > > > > this    
> > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > guaranteed to be the same completion even if targeting the same    
> > > > > > > > hardware    
> > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the    
> > > > > > > > GuC,    
> > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a    
> > > > > > > > using    
> > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR    
> > > > > > > > falls    
> > > > > > > > > > > > apart as the TDR expects submission order == completion order.    
> > > > > > > > Using a    
> > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this    
> > > > > > > > problem.    
> > > > > > > > > > >
> > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a    
> > > > > > > > FW-assisted    
> > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > because the integration with drm_sched was painful, but also because    
> > > > > > > > I    
> > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.    
> > > > > > > > Giving    
> > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably    
> > > > > > > > might    
> > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.    
> > > > > > > > > >
> > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue    
> > > > > > > >
> > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > scheduler fits quite well.
> > > > > > > >    
> > > > > > >
> > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > >
> > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > categorization.)
> > > > > > >
> > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > >
> > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > windows.
> > > > > > >
> > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > >
> > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > >
> > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > able to juggle if you run out.    
> > > > > >
> > > > > > That's more or less what I do right now, except I don't use the
> > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > interest in Xe's approach.    
> > > > >
> > > > > So another option for these few fw queue slots schedulers would be to
> > > > > treat them as vram and enlist ttm.
> > > > >
> > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > the perfect thing.
> > > > >
> > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > all.
> > > > >
> > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > helper code.    
> > > >
> > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > also don't know much about TTM, so I'll keep digging.    
> > > 
> > > ttm is about moving stuff in&out of a limited space and gives you some
> > > nice tooling for pipelining it all. It doesn't care whether that space
> > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > iirc.  
> > 
> > Ok.
> >   
> > >   
> > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > severe restrictions in number of queues.    
> > > >
> > > > Honestly, I don't think there's much difference between those two cases
> > > > already. There's just a bunch of additional code to schedule queues on
> > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > driver specific. The job queuing front-end pretty much achieves what
> > > > drm_sched does already: queuing job to entities, checking deps,
> > > > submitting job to HW (in our case, writing to the command stream ring
> > > > buffer). Things start to differ after that point: once a scheduling
> > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > re-evaluate, if needed.
> > > >
> > > > I'm all for using generic code when it makes sense, even if that means
> > > > adding this common code when it doesn't exists, but I don't want to be
> > > > dragged into some major refactoring that might take years to land.
> > > > Especially if pancsf is the first
> > > > FW-assisted-scheduler-with-few-FW-slot driver.    
> > > 
> > > I don't see where there's a major refactoring that you're getting dragged into?  
> > 
> > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > sure we're on the same page :-).
> >   
> > > 
> > > Yes there's a huge sprawling discussion right now, but I think that's
> > > just largely people getting confused.  
> > 
> > I definitely am :-).
> >   
> > > 
> > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > point.  
> > 
> > Ok, thanks for the pointers. I'll have a look and see how I could use
> > that. I guess that's about getting access to the FW slots with some
> > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > case, I'll have to think about it, because that's a major shift from
> > what we're doing now, and I'm afraid this could lead to starving
> > non-resident entities if all resident entities keep receiving new jobs
> > to execute. Unless we put some sort of barrier when giving access to a
> > slot, so we evict the entity when it's done executing the stuff it had
> > when it was given access to this slot. But then, again, there are other
> > constraints to take into account for the Arm Mali CSF case:
> > 
> > - it's more efficient to update all FW slots at once, because each
> >   update of a slot might require updating priorities of the other slots
> >   (FW mandates unique slot priorities, and those priorities depend on
> >   the entity priority/queue-ordering)
> > - context/FW slot switches have a non-negligible cost (FW needs to
> >   suspend the context and save the state every time there such a
> >   switch), so, limiting the number of FW slot updates might prove
> >   important  
> 
> I frankly think you're overworrying. When you have 31+ contexts running at
> the same time, you have bigger problems. At that point there's two
> use-cases:
> 1. system is overloaded, the user will reach for reset button anyway
> 2. temporary situation, all you have to do is be roughly fair enough to get
>    through it before case 1 happens.
>  
> Trying to write a perfect scheduler for this before we have actual
> benchmarks that justify the effort seems like pretty serious overkill.
> That's why I think the simplest solution is the one we should have:
> 
> - drm/sched frontend. If you get into slot exhaustion that alone will
>   ensure enough fairness

We're talking about the CS ring buffer slots here, right?

> 
> - LRU list of slots, with dma_fence so you can pipeline/batch up changes
>   as needed (but I honestly wouldn't worry about the batching before
>   you've shown an actual need for this in some benchmark/workload, even
>   piglit shouldn't have this many things running concurrently I think, you
>   don't have that many cpu cores). Between drm/sched and the lru you will
>   have an emergent scheduler that cycles through all runnable gpu jobs.
> 
> - If you want to go fancy, have eviction tricks like skipping currently
>   still active gpu context with higher priority than the one that you need
>   to find a slot for.
> 
> - You don't need time slicing in this, not even for compute. compute is
>   done with preempt context fences, if you give them a minimum scheduling
>   quanta you'll have a very basic round robin scheduler as an emergent
>   thing.
> 
> Any workload were it matters will be scheduled by the fw directly, with
> drm/sched only being the dma_fence dependcy sorter. My take is that if you
> spend more than a hundred or so lines with slot allocation logic
> (excluding the hw code to load/unload a slot) you're probably doing some
> serious overengineering.

Let me see if I got this right:

- we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
  where hw_submission_limit == available_slots_in_ring_buf
- when ->run_job() is called, we write the RUN_JOB() instruction
  sequence to the next available ringbuf slot and queue the entity to
  the FW-slot queue
  * if a slot is directly available, we program the slot directly
  * if no slots are available, but some slots are done with the jobs
    they were given (last job fence signaled), we evict the LRU entity
    (possibly taking priority into account) and use this slot for the
    new entity
  * if no slots are available and all currently assigned slots
    contain busy entities, we queue the entity to a pending list
    (possibly one list per prio)

I'll need to make sure this still works with the concept of group (it's
not a single queue we schedule, it's a group of queues, meaning that we
have N fences to watch to determine if the slot is busy or not, but
that should be okay).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12 10:11                           ` Boris Brezillon
@ 2023-01-12 10:25                             ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 10:25 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >     
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > <boris.brezillon@collabora.com> wrote:    
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >      
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:      
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > >      
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > wrote:
> > > > > > > >      
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:      
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > >      
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > >      
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > >      
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first      
> > > > > > > > > this      
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same      
> > > > > > > > > hardware      
> > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the      
> > > > > > > > > GuC,      
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a      
> > > > > > > > > using      
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR      
> > > > > > > > > falls      
> > > > > > > > > > > > > apart as the TDR expects submission order == completion order.      
> > > > > > > > > Using a      
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this      
> > > > > > > > > problem.      
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a      
> > > > > > > > > FW-assisted      
> > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > because the integration with drm_sched was painful, but also because      
> > > > > > > > > I      
> > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.      
> > > > > > > > > Giving      
> > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably      
> > > > > > > > > might      
> > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.      
> > > > > > > > > > >
> > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue      
> > > > > > > > >
> > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > scheduler fits quite well.
> > > > > > > > >      
> > > > > > > >
> > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > >
> > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > categorization.)
> > > > > > > >
> > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > >
> > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > windows.
> > > > > > > >
> > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > >
> > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > >
> > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > able to juggle if you run out.      
> > > > > > >
> > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > interest in Xe's approach.      
> > > > > >
> > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > treat them as vram and enlist ttm.
> > > > > >
> > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > the perfect thing.
> > > > > >
> > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > all.
> > > > > >
> > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > helper code.      
> > > > >
> > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > also don't know much about TTM, so I'll keep digging.      
> > > > 
> > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > iirc.    
> > > 
> > > Ok.
> > >     
> > > >     
> > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > severe restrictions in number of queues.      
> > > > >
> > > > > Honestly, I don't think there's much difference between those two cases
> > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > re-evaluate, if needed.
> > > > >
> > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > dragged into some major refactoring that might take years to land.
> > > > > Especially if pancsf is the first
> > > > > FW-assisted-scheduler-with-few-FW-slot driver.      
> > > > 
> > > > I don't see where there's a major refactoring that you're getting dragged into?    
> > > 
> > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > sure we're on the same page :-).
> > >     
> > > > 
> > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > just largely people getting confused.    
> > > 
> > > I definitely am :-).
> > >     
> > > > 
> > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > point.    
> > > 
> > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > that. I guess that's about getting access to the FW slots with some
> > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > case, I'll have to think about it, because that's a major shift from
> > > what we're doing now, and I'm afraid this could lead to starving
> > > non-resident entities if all resident entities keep receiving new jobs
> > > to execute. Unless we put some sort of barrier when giving access to a
> > > slot, so we evict the entity when it's done executing the stuff it had
> > > when it was given access to this slot. But then, again, there are other
> > > constraints to take into account for the Arm Mali CSF case:
> > > 
> > > - it's more efficient to update all FW slots at once, because each
> > >   update of a slot might require updating priorities of the other slots
> > >   (FW mandates unique slot priorities, and those priorities depend on
> > >   the entity priority/queue-ordering)
> > > - context/FW slot switches have a non-negligible cost (FW needs to
> > >   suspend the context and save the state every time there such a
> > >   switch), so, limiting the number of FW slot updates might prove
> > >   important    
> > 
> > I frankly think you're overworrying. When you have 31+ contexts running at
> > the same time, you have bigger problems. At that point there's two
> > use-cases:
> > 1. system is overloaded, the user will reach for reset button anyway
> > 2. temporary situation, all you have to do is be roughly fair enough to get
> >    through it before case 1 happens.
> >  
> > Trying to write a perfect scheduler for this before we have actual
> > benchmarks that justify the effort seems like pretty serious overkill.
> > That's why I think the simplest solution is the one we should have:
> > 
> > - drm/sched frontend. If you get into slot exhaustion that alone will
> >   ensure enough fairness  
> 
> We're talking about the CS ring buffer slots here, right?
> 
> > 
> > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> >   as needed (but I honestly wouldn't worry about the batching before
> >   you've shown an actual need for this in some benchmark/workload, even
> >   piglit shouldn't have this many things running concurrently I think, you
> >   don't have that many cpu cores). Between drm/sched and the lru you will
> >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > 
> > - If you want to go fancy, have eviction tricks like skipping currently
> >   still active gpu context with higher priority than the one that you need
> >   to find a slot for.
> > 
> > - You don't need time slicing in this, not even for compute. compute is
> >   done with preempt context fences, if you give them a minimum scheduling
> >   quanta you'll have a very basic round robin scheduler as an emergent
> >   thing.
> > 
> > Any workload were it matters will be scheduled by the fw directly, with
> > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > spend more than a hundred or so lines with slot allocation logic
> > (excluding the hw code to load/unload a slot) you're probably doing some
> > serious overengineering.  
> 
> Let me see if I got this right:
> 
> - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
>   where hw_submission_limit == available_slots_in_ring_buf
> - when ->run_job() is called, we write the RUN_JOB() instruction
>   sequence to the next available ringbuf slot and queue the entity to
>   the FW-slot queue
>   * if a slot is directly available, we program the slot directly
>   * if no slots are available, but some slots are done with the jobs
>     they were given (last job fence signaled), we evict the LRU entity
>     (possibly taking priority into account) and use this slot for the
>     new entity
>   * if no slots are available and all currently assigned slots
>     contain busy entities, we queue the entity to a pending list
>     (possibly one list per prio)
> 
> I'll need to make sure this still works with the concept of group (it's
> not a single queue we schedule, it's a group of queues, meaning that we
> have N fences to watch to determine if the slot is busy or not, but
> that should be okay).

Oh, there's one other thing I forgot to mention: the FW scheduler is
not entirely fair, it does take the slot priority (which has to be
unique across all currently assigned slots) into account when
scheduling groups. So, ideally, we'd want to rotate group priorities
when they share the same drm_sched_priority (probably based on the
position in the LRU).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 10:25                             ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 10:25 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >     
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > <boris.brezillon@collabora.com> wrote:    
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >      
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:      
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > >      
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > wrote:
> > > > > > > >      
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:      
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > >      
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > >      
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > >      
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first      
> > > > > > > > > this      
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same      
> > > > > > > > > hardware      
> > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the      
> > > > > > > > > GuC,      
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a      
> > > > > > > > > using      
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR      
> > > > > > > > > falls      
> > > > > > > > > > > > > apart as the TDR expects submission order == completion order.      
> > > > > > > > > Using a      
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this      
> > > > > > > > > problem.      
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a      
> > > > > > > > > FW-assisted      
> > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > because the integration with drm_sched was painful, but also because      
> > > > > > > > > I      
> > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.      
> > > > > > > > > Giving      
> > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably      
> > > > > > > > > might      
> > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.      
> > > > > > > > > > >
> > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue      
> > > > > > > > >
> > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > scheduler fits quite well.
> > > > > > > > >      
> > > > > > > >
> > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > >
> > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > categorization.)
> > > > > > > >
> > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > >
> > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > windows.
> > > > > > > >
> > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > >
> > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > >
> > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > able to juggle if you run out.      
> > > > > > >
> > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > interest in Xe's approach.      
> > > > > >
> > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > treat them as vram and enlist ttm.
> > > > > >
> > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > the perfect thing.
> > > > > >
> > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > all.
> > > > > >
> > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > helper code.      
> > > > >
> > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > also don't know much about TTM, so I'll keep digging.      
> > > > 
> > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > iirc.    
> > > 
> > > Ok.
> > >     
> > > >     
> > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > severe restrictions in number of queues.      
> > > > >
> > > > > Honestly, I don't think there's much difference between those two cases
> > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > re-evaluate, if needed.
> > > > >
> > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > dragged into some major refactoring that might take years to land.
> > > > > Especially if pancsf is the first
> > > > > FW-assisted-scheduler-with-few-FW-slot driver.      
> > > > 
> > > > I don't see where there's a major refactoring that you're getting dragged into?    
> > > 
> > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > sure we're on the same page :-).
> > >     
> > > > 
> > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > just largely people getting confused.    
> > > 
> > > I definitely am :-).
> > >     
> > > > 
> > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > point.    
> > > 
> > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > that. I guess that's about getting access to the FW slots with some
> > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > case, I'll have to think about it, because that's a major shift from
> > > what we're doing now, and I'm afraid this could lead to starving
> > > non-resident entities if all resident entities keep receiving new jobs
> > > to execute. Unless we put some sort of barrier when giving access to a
> > > slot, so we evict the entity when it's done executing the stuff it had
> > > when it was given access to this slot. But then, again, there are other
> > > constraints to take into account for the Arm Mali CSF case:
> > > 
> > > - it's more efficient to update all FW slots at once, because each
> > >   update of a slot might require updating priorities of the other slots
> > >   (FW mandates unique slot priorities, and those priorities depend on
> > >   the entity priority/queue-ordering)
> > > - context/FW slot switches have a non-negligible cost (FW needs to
> > >   suspend the context and save the state every time there such a
> > >   switch), so, limiting the number of FW slot updates might prove
> > >   important    
> > 
> > I frankly think you're overworrying. When you have 31+ contexts running at
> > the same time, you have bigger problems. At that point there's two
> > use-cases:
> > 1. system is overloaded, the user will reach for reset button anyway
> > 2. temporary situation, all you have to do is be roughly fair enough to get
> >    through it before case 1 happens.
> >  
> > Trying to write a perfect scheduler for this before we have actual
> > benchmarks that justify the effort seems like pretty serious overkill.
> > That's why I think the simplest solution is the one we should have:
> > 
> > - drm/sched frontend. If you get into slot exhaustion that alone will
> >   ensure enough fairness  
> 
> We're talking about the CS ring buffer slots here, right?
> 
> > 
> > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> >   as needed (but I honestly wouldn't worry about the batching before
> >   you've shown an actual need for this in some benchmark/workload, even
> >   piglit shouldn't have this many things running concurrently I think, you
> >   don't have that many cpu cores). Between drm/sched and the lru you will
> >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > 
> > - If you want to go fancy, have eviction tricks like skipping currently
> >   still active gpu context with higher priority than the one that you need
> >   to find a slot for.
> > 
> > - You don't need time slicing in this, not even for compute. compute is
> >   done with preempt context fences, if you give them a minimum scheduling
> >   quanta you'll have a very basic round robin scheduler as an emergent
> >   thing.
> > 
> > Any workload were it matters will be scheduled by the fw directly, with
> > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > spend more than a hundred or so lines with slot allocation logic
> > (excluding the hw code to load/unload a slot) you're probably doing some
> > serious overengineering.  
> 
> Let me see if I got this right:
> 
> - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
>   where hw_submission_limit == available_slots_in_ring_buf
> - when ->run_job() is called, we write the RUN_JOB() instruction
>   sequence to the next available ringbuf slot and queue the entity to
>   the FW-slot queue
>   * if a slot is directly available, we program the slot directly
>   * if no slots are available, but some slots are done with the jobs
>     they were given (last job fence signaled), we evict the LRU entity
>     (possibly taking priority into account) and use this slot for the
>     new entity
>   * if no slots are available and all currently assigned slots
>     contain busy entities, we queue the entity to a pending list
>     (possibly one list per prio)
> 
> I'll need to make sure this still works with the concept of group (it's
> not a single queue we schedule, it's a group of queues, meaning that we
> have N fences to watch to determine if the slot is busy or not, but
> that should be okay).

Oh, there's one other thing I forgot to mention: the FW scheduler is
not entirely fair, it does take the slot priority (which has to be
unique across all currently assigned slots) into account when
scheduling groups. So, ideally, we'd want to rotate group priorities
when they share the same drm_sched_priority (probably based on the
position in the LRU).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12 10:11                           ` Boris Brezillon
@ 2023-01-12 10:30                             ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 10:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >     
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > <boris.brezillon@collabora.com> wrote:    
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >      
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:      
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > >      
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > wrote:
> > > > > > > >      
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:      
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > >      
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > >      
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > >      
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first      
> > > > > > > > > this      
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same      
> > > > > > > > > hardware      
> > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the      
> > > > > > > > > GuC,      
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a      
> > > > > > > > > using      
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR      
> > > > > > > > > falls      
> > > > > > > > > > > > > apart as the TDR expects submission order == completion order.      
> > > > > > > > > Using a      
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this      
> > > > > > > > > problem.      
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a      
> > > > > > > > > FW-assisted      
> > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > because the integration with drm_sched was painful, but also because      
> > > > > > > > > I      
> > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.      
> > > > > > > > > Giving      
> > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably      
> > > > > > > > > might      
> > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.      
> > > > > > > > > > >
> > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue      
> > > > > > > > >
> > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > scheduler fits quite well.
> > > > > > > > >      
> > > > > > > >
> > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > >
> > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > categorization.)
> > > > > > > >
> > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > >
> > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > windows.
> > > > > > > >
> > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > >
> > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > >
> > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > able to juggle if you run out.      
> > > > > > >
> > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > interest in Xe's approach.      
> > > > > >
> > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > treat them as vram and enlist ttm.
> > > > > >
> > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > the perfect thing.
> > > > > >
> > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > all.
> > > > > >
> > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > helper code.      
> > > > >
> > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > also don't know much about TTM, so I'll keep digging.      
> > > > 
> > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > iirc.    
> > > 
> > > Ok.
> > >     
> > > >     
> > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > severe restrictions in number of queues.      
> > > > >
> > > > > Honestly, I don't think there's much difference between those two cases
> > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > re-evaluate, if needed.
> > > > >
> > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > dragged into some major refactoring that might take years to land.
> > > > > Especially if pancsf is the first
> > > > > FW-assisted-scheduler-with-few-FW-slot driver.      
> > > > 
> > > > I don't see where there's a major refactoring that you're getting dragged into?    
> > > 
> > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > sure we're on the same page :-).
> > >     
> > > > 
> > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > just largely people getting confused.    
> > > 
> > > I definitely am :-).
> > >     
> > > > 
> > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > point.    
> > > 
> > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > that. I guess that's about getting access to the FW slots with some
> > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > case, I'll have to think about it, because that's a major shift from
> > > what we're doing now, and I'm afraid this could lead to starving
> > > non-resident entities if all resident entities keep receiving new jobs
> > > to execute. Unless we put some sort of barrier when giving access to a
> > > slot, so we evict the entity when it's done executing the stuff it had
> > > when it was given access to this slot. But then, again, there are other
> > > constraints to take into account for the Arm Mali CSF case:
> > > 
> > > - it's more efficient to update all FW slots at once, because each
> > >   update of a slot might require updating priorities of the other slots
> > >   (FW mandates unique slot priorities, and those priorities depend on
> > >   the entity priority/queue-ordering)
> > > - context/FW slot switches have a non-negligible cost (FW needs to
> > >   suspend the context and save the state every time there such a
> > >   switch), so, limiting the number of FW slot updates might prove
> > >   important    
> > 
> > I frankly think you're overworrying. When you have 31+ contexts running at
> > the same time, you have bigger problems. At that point there's two
> > use-cases:
> > 1. system is overloaded, the user will reach for reset button anyway
> > 2. temporary situation, all you have to do is be roughly fair enough to get
> >    through it before case 1 happens.
> >  
> > Trying to write a perfect scheduler for this before we have actual
> > benchmarks that justify the effort seems like pretty serious overkill.
> > That's why I think the simplest solution is the one we should have:
> > 
> > - drm/sched frontend. If you get into slot exhaustion that alone will
> >   ensure enough fairness  
> 
> We're talking about the CS ring buffer slots here, right?
> 
> > 
> > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> >   as needed (but I honestly wouldn't worry about the batching before
> >   you've shown an actual need for this in some benchmark/workload, even
> >   piglit shouldn't have this many things running concurrently I think, you
> >   don't have that many cpu cores). Between drm/sched and the lru you will
> >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > 
> > - If you want to go fancy, have eviction tricks like skipping currently
> >   still active gpu context with higher priority than the one that you need
> >   to find a slot for.
> > 
> > - You don't need time slicing in this, not even for compute. compute is
> >   done with preempt context fences, if you give them a minimum scheduling
> >   quanta you'll have a very basic round robin scheduler as an emergent
> >   thing.
> > 
> > Any workload were it matters will be scheduled by the fw directly, with
> > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > spend more than a hundred or so lines with slot allocation logic
> > (excluding the hw code to load/unload a slot) you're probably doing some
> > serious overengineering.  
> 
> Let me see if I got this right:
> 
> - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
>   where hw_submission_limit == available_slots_in_ring_buf
> - when ->run_job() is called, we write the RUN_JOB() instruction
>   sequence to the next available ringbuf slot and queue the entity to
>   the FW-slot queue
>   * if a slot is directly available, we program the slot directly
>   * if no slots are available, but some slots are done with the jobs
>     they were given (last job fence signaled), we evict the LRU entity
>     (possibly taking priority into account) and use this slot for the
>     new entity
>   * if no slots are available and all currently assigned slots
>     contain busy entities, we queue the entity to a pending list
>     (possibly one list per prio)

Forgot:

   * if the group is already resident, we just move the slot to the
     LRU list head.

> 
> I'll need to make sure this still works with the concept of group (it's
> not a single queue we schedule, it's a group of queues, meaning that we
> have N fences to watch to determine if the slot is busy or not, but
> that should be okay).


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 10:30                             ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 10:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >     
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > <boris.brezillon@collabora.com> wrote:    
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >      
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:      
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > >      
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > wrote:
> > > > > > > >      
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:      
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > >      
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > >      
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > >      
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first      
> > > > > > > > > this      
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same      
> > > > > > > > > hardware      
> > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the      
> > > > > > > > > GuC,      
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a      
> > > > > > > > > using      
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR      
> > > > > > > > > falls      
> > > > > > > > > > > > > apart as the TDR expects submission order == completion order.      
> > > > > > > > > Using a      
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this      
> > > > > > > > > problem.      
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a      
> > > > > > > > > FW-assisted      
> > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > because the integration with drm_sched was painful, but also because      
> > > > > > > > > I      
> > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.      
> > > > > > > > > Giving      
> > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably      
> > > > > > > > > might      
> > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.      
> > > > > > > > > > >
> > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue      
> > > > > > > > >
> > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > scheduler fits quite well.
> > > > > > > > >      
> > > > > > > >
> > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > >
> > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > categorization.)
> > > > > > > >
> > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > >
> > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > windows.
> > > > > > > >
> > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > >
> > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > >
> > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > able to juggle if you run out.      
> > > > > > >
> > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > interest in Xe's approach.      
> > > > > >
> > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > treat them as vram and enlist ttm.
> > > > > >
> > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > the perfect thing.
> > > > > >
> > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > all.
> > > > > >
> > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > helper code.      
> > > > >
> > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > also don't know much about TTM, so I'll keep digging.      
> > > > 
> > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > iirc.    
> > > 
> > > Ok.
> > >     
> > > >     
> > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > severe restrictions in number of queues.      
> > > > >
> > > > > Honestly, I don't think there's much difference between those two cases
> > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > re-evaluate, if needed.
> > > > >
> > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > dragged into some major refactoring that might take years to land.
> > > > > Especially if pancsf is the first
> > > > > FW-assisted-scheduler-with-few-FW-slot driver.      
> > > > 
> > > > I don't see where there's a major refactoring that you're getting dragged into?    
> > > 
> > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > sure we're on the same page :-).
> > >     
> > > > 
> > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > just largely people getting confused.    
> > > 
> > > I definitely am :-).
> > >     
> > > > 
> > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > point.    
> > > 
> > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > that. I guess that's about getting access to the FW slots with some
> > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > case, I'll have to think about it, because that's a major shift from
> > > what we're doing now, and I'm afraid this could lead to starving
> > > non-resident entities if all resident entities keep receiving new jobs
> > > to execute. Unless we put some sort of barrier when giving access to a
> > > slot, so we evict the entity when it's done executing the stuff it had
> > > when it was given access to this slot. But then, again, there are other
> > > constraints to take into account for the Arm Mali CSF case:
> > > 
> > > - it's more efficient to update all FW slots at once, because each
> > >   update of a slot might require updating priorities of the other slots
> > >   (FW mandates unique slot priorities, and those priorities depend on
> > >   the entity priority/queue-ordering)
> > > - context/FW slot switches have a non-negligible cost (FW needs to
> > >   suspend the context and save the state every time there such a
> > >   switch), so, limiting the number of FW slot updates might prove
> > >   important    
> > 
> > I frankly think you're overworrying. When you have 31+ contexts running at
> > the same time, you have bigger problems. At that point there's two
> > use-cases:
> > 1. system is overloaded, the user will reach for reset button anyway
> > 2. temporary situation, all you have to do is be roughly fair enough to get
> >    through it before case 1 happens.
> >  
> > Trying to write a perfect scheduler for this before we have actual
> > benchmarks that justify the effort seems like pretty serious overkill.
> > That's why I think the simplest solution is the one we should have:
> > 
> > - drm/sched frontend. If you get into slot exhaustion that alone will
> >   ensure enough fairness  
> 
> We're talking about the CS ring buffer slots here, right?
> 
> > 
> > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> >   as needed (but I honestly wouldn't worry about the batching before
> >   you've shown an actual need for this in some benchmark/workload, even
> >   piglit shouldn't have this many things running concurrently I think, you
> >   don't have that many cpu cores). Between drm/sched and the lru you will
> >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > 
> > - If you want to go fancy, have eviction tricks like skipping currently
> >   still active gpu context with higher priority than the one that you need
> >   to find a slot for.
> > 
> > - You don't need time slicing in this, not even for compute. compute is
> >   done with preempt context fences, if you give them a minimum scheduling
> >   quanta you'll have a very basic round robin scheduler as an emergent
> >   thing.
> > 
> > Any workload were it matters will be scheduled by the fw directly, with
> > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > spend more than a hundred or so lines with slot allocation logic
> > (excluding the hw code to load/unload a slot) you're probably doing some
> > serious overengineering.  
> 
> Let me see if I got this right:
> 
> - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
>   where hw_submission_limit == available_slots_in_ring_buf
> - when ->run_job() is called, we write the RUN_JOB() instruction
>   sequence to the next available ringbuf slot and queue the entity to
>   the FW-slot queue
>   * if a slot is directly available, we program the slot directly
>   * if no slots are available, but some slots are done with the jobs
>     they were given (last job fence signaled), we evict the LRU entity
>     (possibly taking priority into account) and use this slot for the
>     new entity
>   * if no slots are available and all currently assigned slots
>     contain busy entities, we queue the entity to a pending list
>     (possibly one list per prio)

Forgot:

   * if the group is already resident, we just move the slot to the
     LRU list head.

> 
> I'll need to make sure this still works with the concept of group (it's
> not a single queue we schedule, it's a group of queues, meaning that we
> have N fences to watch to determine if the slot is busy or not, but
> that should be okay).


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12 10:25                             ` Boris Brezillon
@ 2023-01-12 10:42                               ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-12 10:42 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> On Thu, 12 Jan 2023 11:11:03 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > On Thu, 12 Jan 2023 10:32:18 +0100
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> > 
> > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > > Hi Daniel,
> > > > 
> > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >     
> > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > <boris.brezillon@collabora.com> wrote:    
> > > > > >
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >      
> > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:      
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > >      
> > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > wrote:
> > > > > > > > >      
> > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:      
> > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > >      
> > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > >      
> > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > > >      
> > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first      
> > > > > > > > > > this      
> > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same      
> > > > > > > > > > hardware      
> > > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the      
> > > > > > > > > > GuC,      
> > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a      
> > > > > > > > > > using      
> > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR      
> > > > > > > > > > falls      
> > > > > > > > > > > > > > apart as the TDR expects submission order == completion order.      
> > > > > > > > > > Using a      
> > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this      
> > > > > > > > > > problem.      
> > > > > > > > > > > > >
> > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a      
> > > > > > > > > > FW-assisted      
> > > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > > because the integration with drm_sched was painful, but also because      
> > > > > > > > > > I      
> > > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.      
> > > > > > > > > > Giving      
> > > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably      
> > > > > > > > > > might      
> > > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.      
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue      
> > > > > > > > > >
> > > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > > scheduler fits quite well.
> > > > > > > > > >      
> > > > > > > > >
> > > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > > >
> > > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > > categorization.)
> > > > > > > > >
> > > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > > >
> > > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > > windows.
> > > > > > > > >
> > > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > > >
> > > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > > >
> > > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > > able to juggle if you run out.      
> > > > > > > >
> > > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > > interest in Xe's approach.      
> > > > > > >
> > > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > > treat them as vram and enlist ttm.
> > > > > > >
> > > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > > the perfect thing.
> > > > > > >
> > > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > > all.
> > > > > > >
> > > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > > helper code.      
> > > > > >
> > > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > > also don't know much about TTM, so I'll keep digging.      
> > > > > 
> > > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > > iirc.    
> > > > 
> > > > Ok.
> > > >     
> > > > >     
> > > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > > severe restrictions in number of queues.      
> > > > > >
> > > > > > Honestly, I don't think there's much difference between those two cases
> > > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > > re-evaluate, if needed.
> > > > > >
> > > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > > dragged into some major refactoring that might take years to land.
> > > > > > Especially if pancsf is the first
> > > > > > FW-assisted-scheduler-with-few-FW-slot driver.      
> > > > > 
> > > > > I don't see where there's a major refactoring that you're getting dragged into?    
> > > > 
> > > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > > sure we're on the same page :-).
> > > >     
> > > > > 
> > > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > > just largely people getting confused.    
> > > > 
> > > > I definitely am :-).
> > > >     
> > > > > 
> > > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > > point.    
> > > > 
> > > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > > that. I guess that's about getting access to the FW slots with some
> > > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > > case, I'll have to think about it, because that's a major shift from
> > > > what we're doing now, and I'm afraid this could lead to starving
> > > > non-resident entities if all resident entities keep receiving new jobs
> > > > to execute. Unless we put some sort of barrier when giving access to a
> > > > slot, so we evict the entity when it's done executing the stuff it had
> > > > when it was given access to this slot. But then, again, there are other
> > > > constraints to take into account for the Arm Mali CSF case:
> > > > 
> > > > - it's more efficient to update all FW slots at once, because each
> > > >   update of a slot might require updating priorities of the other slots
> > > >   (FW mandates unique slot priorities, and those priorities depend on
> > > >   the entity priority/queue-ordering)
> > > > - context/FW slot switches have a non-negligible cost (FW needs to
> > > >   suspend the context and save the state every time there such a
> > > >   switch), so, limiting the number of FW slot updates might prove
> > > >   important    
> > > 
> > > I frankly think you're overworrying. When you have 31+ contexts running at
> > > the same time, you have bigger problems. At that point there's two
> > > use-cases:
> > > 1. system is overloaded, the user will reach for reset button anyway
> > > 2. temporary situation, all you have to do is be roughly fair enough to get
> > >    through it before case 1 happens.
> > >  
> > > Trying to write a perfect scheduler for this before we have actual
> > > benchmarks that justify the effort seems like pretty serious overkill.
> > > That's why I think the simplest solution is the one we should have:
> > > 
> > > - drm/sched frontend. If you get into slot exhaustion that alone will
> > >   ensure enough fairness  
> > 
> > We're talking about the CS ring buffer slots here, right?
> > 
> > > 
> > > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> > >   as needed (but I honestly wouldn't worry about the batching before
> > >   you've shown an actual need for this in some benchmark/workload, even
> > >   piglit shouldn't have this many things running concurrently I think, you
> > >   don't have that many cpu cores). Between drm/sched and the lru you will
> > >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > > 
> > > - If you want to go fancy, have eviction tricks like skipping currently
> > >   still active gpu context with higher priority than the one that you need
> > >   to find a slot for.
> > > 
> > > - You don't need time slicing in this, not even for compute. compute is
> > >   done with preempt context fences, if you give them a minimum scheduling
> > >   quanta you'll have a very basic round robin scheduler as an emergent
> > >   thing.
> > > 
> > > Any workload were it matters will be scheduled by the fw directly, with
> > > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > > spend more than a hundred or so lines with slot allocation logic
> > > (excluding the hw code to load/unload a slot) you're probably doing some
> > > serious overengineering.  
> > 
> > Let me see if I got this right:
> > 
> > - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
> >   where hw_submission_limit == available_slots_in_ring_buf
> > - when ->run_job() is called, we write the RUN_JOB() instruction
> >   sequence to the next available ringbuf slot and queue the entity to
> >   the FW-slot queue
> >   * if a slot is directly available, we program the slot directly
> >   * if no slots are available, but some slots are done with the jobs
> >     they were given (last job fence signaled), we evict the LRU entity
> >     (possibly taking priority into account) and use this slot for the
> >     new entity
> >   * if no slots are available and all currently assigned slots
> >     contain busy entities, we queue the entity to a pending list
> >     (possibly one list per prio)

You could also handle this in ->prepare_job, which is called after all the
default fences have signalled. That allows you to put the "wait for a
previous job to finnish/unload" behind a dma_fence, which is how (I think
at least) you can get the round-robin emergent behaviour: If there's no
idle slot, you just pick all the fences from the currently busy job you
want to steal the slot from (with priority and lru taken into account),
let the scheduler wait for that to finnish, and then it'll call your
run_job when the slot is already available.

Also if you do the allocation in ->prepare_job with dma_fence and not
run_job, then I think can sort out fairness issues (if they do pop up) in
the drm/sched code instead of having to think about this in each driver.
Few fw sched slots essentially just make fw scheduling unfairness more
prominent than with others, but I don't think it's fundamentally something
else really.

If every ctx does that and the lru isn't too busted, they should then form
a nice orderly queue and cycle through the fw scheduler, while still being
able to get some work done. It's essentially the exact same thing that
happens with ttm vram eviction, when you have a total working set where
each process fits in vram individually, but in total they're too big and
you need to cycle things through.

> > I'll need to make sure this still works with the concept of group (it's
> > not a single queue we schedule, it's a group of queues, meaning that we
> > have N fences to watch to determine if the slot is busy or not, but
> > that should be okay).
> 
> Oh, there's one other thing I forgot to mention: the FW scheduler is
> not entirely fair, it does take the slot priority (which has to be
> unique across all currently assigned slots) into account when
> scheduling groups. So, ideally, we'd want to rotate group priorities
> when they share the same drm_sched_priority (probably based on the
> position in the LRU).

Hm that will make things a bit more fun I guess, especially with your
constraint to not update this too often. How strict is that priority
difference? If it's a lot, we might need to treat this more like execlist
and less like a real fw scheduler ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 10:42                               ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-12 10:42 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel, Daniel Vetter

On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> On Thu, 12 Jan 2023 11:11:03 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > On Thu, 12 Jan 2023 10:32:18 +0100
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> > 
> > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > > Hi Daniel,
> > > > 
> > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >     
> > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > <boris.brezillon@collabora.com> wrote:    
> > > > > >
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >      
> > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:      
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > >      
> > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > wrote:
> > > > > > > > >      
> > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:      
> > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > >      
> > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > >      
> > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > > >      
> > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first      
> > > > > > > > > > this      
> > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same      
> > > > > > > > > > hardware      
> > > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the      
> > > > > > > > > > GuC,      
> > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a      
> > > > > > > > > > using      
> > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR      
> > > > > > > > > > falls      
> > > > > > > > > > > > > > apart as the TDR expects submission order == completion order.      
> > > > > > > > > > Using a      
> > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this      
> > > > > > > > > > problem.      
> > > > > > > > > > > > >
> > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a      
> > > > > > > > > > FW-assisted      
> > > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > > because the integration with drm_sched was painful, but also because      
> > > > > > > > > > I      
> > > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.      
> > > > > > > > > > Giving      
> > > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably      
> > > > > > > > > > might      
> > > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.      
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue      
> > > > > > > > > >
> > > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > > scheduler fits quite well.
> > > > > > > > > >      
> > > > > > > > >
> > > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > > >
> > > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > > categorization.)
> > > > > > > > >
> > > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > > >
> > > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > > windows.
> > > > > > > > >
> > > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > > >
> > > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > > >
> > > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > > able to juggle if you run out.      
> > > > > > > >
> > > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > > interest in Xe's approach.      
> > > > > > >
> > > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > > treat them as vram and enlist ttm.
> > > > > > >
> > > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > > the perfect thing.
> > > > > > >
> > > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > > all.
> > > > > > >
> > > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > > helper code.      
> > > > > >
> > > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > > also don't know much about TTM, so I'll keep digging.      
> > > > > 
> > > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > > iirc.    
> > > > 
> > > > Ok.
> > > >     
> > > > >     
> > > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > > severe restrictions in number of queues.      
> > > > > >
> > > > > > Honestly, I don't think there's much difference between those two cases
> > > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > > re-evaluate, if needed.
> > > > > >
> > > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > > dragged into some major refactoring that might take years to land.
> > > > > > Especially if pancsf is the first
> > > > > > FW-assisted-scheduler-with-few-FW-slot driver.      
> > > > > 
> > > > > I don't see where there's a major refactoring that you're getting dragged into?    
> > > > 
> > > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > > sure we're on the same page :-).
> > > >     
> > > > > 
> > > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > > just largely people getting confused.    
> > > > 
> > > > I definitely am :-).
> > > >     
> > > > > 
> > > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > > point.    
> > > > 
> > > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > > that. I guess that's about getting access to the FW slots with some
> > > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > > case, I'll have to think about it, because that's a major shift from
> > > > what we're doing now, and I'm afraid this could lead to starving
> > > > non-resident entities if all resident entities keep receiving new jobs
> > > > to execute. Unless we put some sort of barrier when giving access to a
> > > > slot, so we evict the entity when it's done executing the stuff it had
> > > > when it was given access to this slot. But then, again, there are other
> > > > constraints to take into account for the Arm Mali CSF case:
> > > > 
> > > > - it's more efficient to update all FW slots at once, because each
> > > >   update of a slot might require updating priorities of the other slots
> > > >   (FW mandates unique slot priorities, and those priorities depend on
> > > >   the entity priority/queue-ordering)
> > > > - context/FW slot switches have a non-negligible cost (FW needs to
> > > >   suspend the context and save the state every time there such a
> > > >   switch), so, limiting the number of FW slot updates might prove
> > > >   important    
> > > 
> > > I frankly think you're overworrying. When you have 31+ contexts running at
> > > the same time, you have bigger problems. At that point there's two
> > > use-cases:
> > > 1. system is overloaded, the user will reach for reset button anyway
> > > 2. temporary situation, all you have to do is be roughly fair enough to get
> > >    through it before case 1 happens.
> > >  
> > > Trying to write a perfect scheduler for this before we have actual
> > > benchmarks that justify the effort seems like pretty serious overkill.
> > > That's why I think the simplest solution is the one we should have:
> > > 
> > > - drm/sched frontend. If you get into slot exhaustion that alone will
> > >   ensure enough fairness  
> > 
> > We're talking about the CS ring buffer slots here, right?
> > 
> > > 
> > > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> > >   as needed (but I honestly wouldn't worry about the batching before
> > >   you've shown an actual need for this in some benchmark/workload, even
> > >   piglit shouldn't have this many things running concurrently I think, you
> > >   don't have that many cpu cores). Between drm/sched and the lru you will
> > >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > > 
> > > - If you want to go fancy, have eviction tricks like skipping currently
> > >   still active gpu context with higher priority than the one that you need
> > >   to find a slot for.
> > > 
> > > - You don't need time slicing in this, not even for compute. compute is
> > >   done with preempt context fences, if you give them a minimum scheduling
> > >   quanta you'll have a very basic round robin scheduler as an emergent
> > >   thing.
> > > 
> > > Any workload were it matters will be scheduled by the fw directly, with
> > > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > > spend more than a hundred or so lines with slot allocation logic
> > > (excluding the hw code to load/unload a slot) you're probably doing some
> > > serious overengineering.  
> > 
> > Let me see if I got this right:
> > 
> > - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
> >   where hw_submission_limit == available_slots_in_ring_buf
> > - when ->run_job() is called, we write the RUN_JOB() instruction
> >   sequence to the next available ringbuf slot and queue the entity to
> >   the FW-slot queue
> >   * if a slot is directly available, we program the slot directly
> >   * if no slots are available, but some slots are done with the jobs
> >     they were given (last job fence signaled), we evict the LRU entity
> >     (possibly taking priority into account) and use this slot for the
> >     new entity
> >   * if no slots are available and all currently assigned slots
> >     contain busy entities, we queue the entity to a pending list
> >     (possibly one list per prio)

You could also handle this in ->prepare_job, which is called after all the
default fences have signalled. That allows you to put the "wait for a
previous job to finnish/unload" behind a dma_fence, which is how (I think
at least) you can get the round-robin emergent behaviour: If there's no
idle slot, you just pick all the fences from the currently busy job you
want to steal the slot from (with priority and lru taken into account),
let the scheduler wait for that to finnish, and then it'll call your
run_job when the slot is already available.

Also if you do the allocation in ->prepare_job with dma_fence and not
run_job, then I think can sort out fairness issues (if they do pop up) in
the drm/sched code instead of having to think about this in each driver.
Few fw sched slots essentially just make fw scheduling unfairness more
prominent than with others, but I don't think it's fundamentally something
else really.

If every ctx does that and the lru isn't too busted, they should then form
a nice orderly queue and cycle through the fw scheduler, while still being
able to get some work done. It's essentially the exact same thing that
happens with ttm vram eviction, when you have a total working set where
each process fits in vram individually, but in total they're too big and
you need to cycle things through.

> > I'll need to make sure this still works with the concept of group (it's
> > not a single queue we schedule, it's a group of queues, meaning that we
> > have N fences to watch to determine if the slot is busy or not, but
> > that should be okay).
> 
> Oh, there's one other thing I forgot to mention: the FW scheduler is
> not entirely fair, it does take the slot priority (which has to be
> unique across all currently assigned slots) into account when
> scheduling groups. So, ideally, we'd want to rotate group priorities
> when they share the same drm_sched_priority (probably based on the
> position in the LRU).

Hm that will make things a bit more fun I guess, especially with your
constraint to not update this too often. How strict is that priority
difference? If it's a lot, we might need to treat this more like execlist
and less like a real fw scheduler ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12 10:42                               ` Daniel Vetter
@ 2023-01-12 12:08                                 ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 12:08 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, 12 Jan 2023 11:42:57 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > On Thu, 12 Jan 2023 11:11:03 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >   
> > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:    
> > > > > Hi Daniel,
> > > > > 
> > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >       
> > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > > <boris.brezillon@collabora.com> wrote:      
> > > > > > >
> > > > > > > Hi Daniel,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > >        
> > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:        
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > >        
> > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > wrote:
> > > > > > > > > >        
> > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:        
> > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > >        
> > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > > >        
> > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > > > >        
> > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first        
> > > > > > > > > > > this        
> > > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same        
> > > > > > > > > > > hardware        
> > > > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the        
> > > > > > > > > > > GuC,        
> > > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a        
> > > > > > > > > > > using        
> > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR        
> > > > > > > > > > > falls        
> > > > > > > > > > > > > > > apart as the TDR expects submission order == completion order.        
> > > > > > > > > > > Using a        
> > > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this        
> > > > > > > > > > > problem.        
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a        
> > > > > > > > > > > FW-assisted        
> > > > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > > > because the integration with drm_sched was painful, but also because        
> > > > > > > > > > > I        
> > > > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.        
> > > > > > > > > > > Giving        
> > > > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably        
> > > > > > > > > > > might        
> > > > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.        
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue        
> > > > > > > > > > >
> > > > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > > > scheduler fits quite well.
> > > > > > > > > > >        
> > > > > > > > > >
> > > > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > > > >
> > > > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > > > categorization.)
> > > > > > > > > >
> > > > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > > > >
> > > > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > > > windows.
> > > > > > > > > >
> > > > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > > > >
> > > > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > > > >
> > > > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > > > able to juggle if you run out.        
> > > > > > > > >
> > > > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > > > interest in Xe's approach.        
> > > > > > > >
> > > > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > > > treat them as vram and enlist ttm.
> > > > > > > >
> > > > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > > > the perfect thing.
> > > > > > > >
> > > > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > > > all.
> > > > > > > >
> > > > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > > > helper code.        
> > > > > > >
> > > > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > > > also don't know much about TTM, so I'll keep digging.        
> > > > > > 
> > > > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > > > iirc.      
> > > > > 
> > > > > Ok.
> > > > >       
> > > > > >       
> > > > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > > > severe restrictions in number of queues.        
> > > > > > >
> > > > > > > Honestly, I don't think there's much difference between those two cases
> > > > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > > > re-evaluate, if needed.
> > > > > > >
> > > > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > > > dragged into some major refactoring that might take years to land.
> > > > > > > Especially if pancsf is the first
> > > > > > > FW-assisted-scheduler-with-few-FW-slot driver.        
> > > > > > 
> > > > > > I don't see where there's a major refactoring that you're getting dragged into?      
> > > > > 
> > > > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > > > sure we're on the same page :-).
> > > > >       
> > > > > > 
> > > > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > > > just largely people getting confused.      
> > > > > 
> > > > > I definitely am :-).
> > > > >       
> > > > > > 
> > > > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > > > point.      
> > > > > 
> > > > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > > > that. I guess that's about getting access to the FW slots with some
> > > > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > > > case, I'll have to think about it, because that's a major shift from
> > > > > what we're doing now, and I'm afraid this could lead to starving
> > > > > non-resident entities if all resident entities keep receiving new jobs
> > > > > to execute. Unless we put some sort of barrier when giving access to a
> > > > > slot, so we evict the entity when it's done executing the stuff it had
> > > > > when it was given access to this slot. But then, again, there are other
> > > > > constraints to take into account for the Arm Mali CSF case:
> > > > > 
> > > > > - it's more efficient to update all FW slots at once, because each
> > > > >   update of a slot might require updating priorities of the other slots
> > > > >   (FW mandates unique slot priorities, and those priorities depend on
> > > > >   the entity priority/queue-ordering)
> > > > > - context/FW slot switches have a non-negligible cost (FW needs to
> > > > >   suspend the context and save the state every time there such a
> > > > >   switch), so, limiting the number of FW slot updates might prove
> > > > >   important      
> > > > 
> > > > I frankly think you're overworrying. When you have 31+ contexts running at
> > > > the same time, you have bigger problems. At that point there's two
> > > > use-cases:
> > > > 1. system is overloaded, the user will reach for reset button anyway
> > > > 2. temporary situation, all you have to do is be roughly fair enough to get
> > > >    through it before case 1 happens.
> > > >  
> > > > Trying to write a perfect scheduler for this before we have actual
> > > > benchmarks that justify the effort seems like pretty serious overkill.
> > > > That's why I think the simplest solution is the one we should have:
> > > > 
> > > > - drm/sched frontend. If you get into slot exhaustion that alone will
> > > >   ensure enough fairness    
> > > 
> > > We're talking about the CS ring buffer slots here, right?
> > >   
> > > > 
> > > > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> > > >   as needed (but I honestly wouldn't worry about the batching before
> > > >   you've shown an actual need for this in some benchmark/workload, even
> > > >   piglit shouldn't have this many things running concurrently I think, you
> > > >   don't have that many cpu cores). Between drm/sched and the lru you will
> > > >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > > > 
> > > > - If you want to go fancy, have eviction tricks like skipping currently
> > > >   still active gpu context with higher priority than the one that you need
> > > >   to find a slot for.
> > > > 
> > > > - You don't need time slicing in this, not even for compute. compute is
> > > >   done with preempt context fences, if you give them a minimum scheduling
> > > >   quanta you'll have a very basic round robin scheduler as an emergent
> > > >   thing.
> > > > 
> > > > Any workload were it matters will be scheduled by the fw directly, with
> > > > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > > > spend more than a hundred or so lines with slot allocation logic
> > > > (excluding the hw code to load/unload a slot) you're probably doing some
> > > > serious overengineering.    
> > > 
> > > Let me see if I got this right:
> > > 
> > > - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
> > >   where hw_submission_limit == available_slots_in_ring_buf
> > > - when ->run_job() is called, we write the RUN_JOB() instruction
> > >   sequence to the next available ringbuf slot and queue the entity to
> > >   the FW-slot queue
> > >   * if a slot is directly available, we program the slot directly
> > >   * if no slots are available, but some slots are done with the jobs
> > >     they were given (last job fence signaled), we evict the LRU entity
> > >     (possibly taking priority into account) and use this slot for the
> > >     new entity
> > >   * if no slots are available and all currently assigned slots
> > >     contain busy entities, we queue the entity to a pending list
> > >     (possibly one list per prio)  
> 
> You could also handle this in ->prepare_job, which is called after all the
> default fences have signalled. That allows you to put the "wait for a
> previous job to finnish/unload" behind a dma_fence, which is how (I think
> at least) you can get the round-robin emergent behaviour: If there's no
> idle slot, you just pick all the fences from the currently busy job you
> want to steal the slot from (with priority and lru taken into account),
> let the scheduler wait for that to finnish, and then it'll call your
> run_job when the slot is already available.

Ah, nice! It would also avoid queuing new jobs to a resident entity
when others are waiting for a FW slot, even if, in practice, I'm not
sure we should do that: context will be suspended when the group is
evicted anyway, and things could keep running in the meantime.
I'll give it a try, thanks for the suggestion!

> 
> Also if you do the allocation in ->prepare_job with dma_fence and not
> run_job, then I think can sort out fairness issues (if they do pop up) in
> the drm/sched code instead of having to think about this in each driver.

By allocation, you mean assigning a FW slot ID? If we do this allocation
in ->prepare_job(), couldn't we mess up ordering? Like,
lower-prio/later-queuing entity being scheduled before its pairs,
because there's no guarantee on the job completion order (and thus the
queue idleness order). I mean, completion order depends on the kind of
job being executed by the queues, the time the FW actually lets the
queue execute things and probably other factors. You can use metrics
like the position in the LRU list + the amount of jobs currently
queued to a group to guess which one will be idle first, but that's
just a guess. And I'm not sure I see what doing this slot selection in
->prepare_job() would bring us compared to doing it in ->run_job(),
where we can just pick the least recently used slot.

> Few fw sched slots essentially just make fw scheduling unfairness more
> prominent than with others, but I don't think it's fundamentally something
> else really.
> 
> If every ctx does that and the lru isn't too busted, they should then form
> a nice orderly queue and cycle through the fw scheduler, while still being
> able to get some work done. It's essentially the exact same thing that
> happens with ttm vram eviction, when you have a total working set where
> each process fits in vram individually, but in total they're too big and
> you need to cycle things through.

I see.

> 
> > > I'll need to make sure this still works with the concept of group (it's
> > > not a single queue we schedule, it's a group of queues, meaning that we
> > > have N fences to watch to determine if the slot is busy or not, but
> > > that should be okay).  
> > 
> > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > not entirely fair, it does take the slot priority (which has to be
> > unique across all currently assigned slots) into account when
> > scheduling groups. So, ideally, we'd want to rotate group priorities
> > when they share the same drm_sched_priority (probably based on the
> > position in the LRU).  
> 
> Hm that will make things a bit more fun I guess, especially with your
> constraint to not update this too often. How strict is that priority
> difference? If it's a lot, we might need to treat this more like execlist
> and less like a real fw scheduler ...

Strict as in, if two groups with same priority try to request an
overlapping set of resources (cores or tilers), it can deadlock, so
pretty strict I would say :-).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 12:08                                 ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 12:08 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Thu, 12 Jan 2023 11:42:57 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > On Thu, 12 Jan 2023 11:11:03 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > >   
> > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:    
> > > > > Hi Daniel,
> > > > > 
> > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >       
> > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > > <boris.brezillon@collabora.com> wrote:      
> > > > > > >
> > > > > > > Hi Daniel,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > >        
> > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:        
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > >        
> > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > wrote:
> > > > > > > > > >        
> > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:        
> > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > >        
> > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > > >        
> > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > > > >        
> > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first        
> > > > > > > > > > > this        
> > > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same        
> > > > > > > > > > > hardware        
> > > > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the        
> > > > > > > > > > > GuC,        
> > > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a        
> > > > > > > > > > > using        
> > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR        
> > > > > > > > > > > falls        
> > > > > > > > > > > > > > > apart as the TDR expects submission order == completion order.        
> > > > > > > > > > > Using a        
> > > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this        
> > > > > > > > > > > problem.        
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a        
> > > > > > > > > > > FW-assisted        
> > > > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > > > because the integration with drm_sched was painful, but also because        
> > > > > > > > > > > I        
> > > > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.        
> > > > > > > > > > > Giving        
> > > > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably        
> > > > > > > > > > > might        
> > > > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.        
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue        
> > > > > > > > > > >
> > > > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > > > scheduler fits quite well.
> > > > > > > > > > >        
> > > > > > > > > >
> > > > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > > > >
> > > > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > > > categorization.)
> > > > > > > > > >
> > > > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > > > >
> > > > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > > > windows.
> > > > > > > > > >
> > > > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > > > >
> > > > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > > > >
> > > > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > > > able to juggle if you run out.        
> > > > > > > > >
> > > > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > > > interest in Xe's approach.        
> > > > > > > >
> > > > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > > > treat them as vram and enlist ttm.
> > > > > > > >
> > > > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > > > the perfect thing.
> > > > > > > >
> > > > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > > > all.
> > > > > > > >
> > > > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > > > helper code.        
> > > > > > >
> > > > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > > > also don't know much about TTM, so I'll keep digging.        
> > > > > > 
> > > > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > > > iirc.      
> > > > > 
> > > > > Ok.
> > > > >       
> > > > > >       
> > > > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > > > severe restrictions in number of queues.        
> > > > > > >
> > > > > > > Honestly, I don't think there's much difference between those two cases
> > > > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > > > re-evaluate, if needed.
> > > > > > >
> > > > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > > > dragged into some major refactoring that might take years to land.
> > > > > > > Especially if pancsf is the first
> > > > > > > FW-assisted-scheduler-with-few-FW-slot driver.        
> > > > > > 
> > > > > > I don't see where there's a major refactoring that you're getting dragged into?      
> > > > > 
> > > > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > > > sure we're on the same page :-).
> > > > >       
> > > > > > 
> > > > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > > > just largely people getting confused.      
> > > > > 
> > > > > I definitely am :-).
> > > > >       
> > > > > > 
> > > > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > > > point.      
> > > > > 
> > > > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > > > that. I guess that's about getting access to the FW slots with some
> > > > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > > > case, I'll have to think about it, because that's a major shift from
> > > > > what we're doing now, and I'm afraid this could lead to starving
> > > > > non-resident entities if all resident entities keep receiving new jobs
> > > > > to execute. Unless we put some sort of barrier when giving access to a
> > > > > slot, so we evict the entity when it's done executing the stuff it had
> > > > > when it was given access to this slot. But then, again, there are other
> > > > > constraints to take into account for the Arm Mali CSF case:
> > > > > 
> > > > > - it's more efficient to update all FW slots at once, because each
> > > > >   update of a slot might require updating priorities of the other slots
> > > > >   (FW mandates unique slot priorities, and those priorities depend on
> > > > >   the entity priority/queue-ordering)
> > > > > - context/FW slot switches have a non-negligible cost (FW needs to
> > > > >   suspend the context and save the state every time there such a
> > > > >   switch), so, limiting the number of FW slot updates might prove
> > > > >   important      
> > > > 
> > > > I frankly think you're overworrying. When you have 31+ contexts running at
> > > > the same time, you have bigger problems. At that point there's two
> > > > use-cases:
> > > > 1. system is overloaded, the user will reach for reset button anyway
> > > > 2. temporary situation, all you have to do is be roughly fair enough to get
> > > >    through it before case 1 happens.
> > > >  
> > > > Trying to write a perfect scheduler for this before we have actual
> > > > benchmarks that justify the effort seems like pretty serious overkill.
> > > > That's why I think the simplest solution is the one we should have:
> > > > 
> > > > - drm/sched frontend. If you get into slot exhaustion that alone will
> > > >   ensure enough fairness    
> > > 
> > > We're talking about the CS ring buffer slots here, right?
> > >   
> > > > 
> > > > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> > > >   as needed (but I honestly wouldn't worry about the batching before
> > > >   you've shown an actual need for this in some benchmark/workload, even
> > > >   piglit shouldn't have this many things running concurrently I think, you
> > > >   don't have that many cpu cores). Between drm/sched and the lru you will
> > > >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > > > 
> > > > - If you want to go fancy, have eviction tricks like skipping currently
> > > >   still active gpu context with higher priority than the one that you need
> > > >   to find a slot for.
> > > > 
> > > > - You don't need time slicing in this, not even for compute. compute is
> > > >   done with preempt context fences, if you give them a minimum scheduling
> > > >   quanta you'll have a very basic round robin scheduler as an emergent
> > > >   thing.
> > > > 
> > > > Any workload were it matters will be scheduled by the fw directly, with
> > > > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > > > spend more than a hundred or so lines with slot allocation logic
> > > > (excluding the hw code to load/unload a slot) you're probably doing some
> > > > serious overengineering.    
> > > 
> > > Let me see if I got this right:
> > > 
> > > - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
> > >   where hw_submission_limit == available_slots_in_ring_buf
> > > - when ->run_job() is called, we write the RUN_JOB() instruction
> > >   sequence to the next available ringbuf slot and queue the entity to
> > >   the FW-slot queue
> > >   * if a slot is directly available, we program the slot directly
> > >   * if no slots are available, but some slots are done with the jobs
> > >     they were given (last job fence signaled), we evict the LRU entity
> > >     (possibly taking priority into account) and use this slot for the
> > >     new entity
> > >   * if no slots are available and all currently assigned slots
> > >     contain busy entities, we queue the entity to a pending list
> > >     (possibly one list per prio)  
> 
> You could also handle this in ->prepare_job, which is called after all the
> default fences have signalled. That allows you to put the "wait for a
> previous job to finnish/unload" behind a dma_fence, which is how (I think
> at least) you can get the round-robin emergent behaviour: If there's no
> idle slot, you just pick all the fences from the currently busy job you
> want to steal the slot from (with priority and lru taken into account),
> let the scheduler wait for that to finnish, and then it'll call your
> run_job when the slot is already available.

Ah, nice! It would also avoid queuing new jobs to a resident entity
when others are waiting for a FW slot, even if, in practice, I'm not
sure we should do that: context will be suspended when the group is
evicted anyway, and things could keep running in the meantime.
I'll give it a try, thanks for the suggestion!

> 
> Also if you do the allocation in ->prepare_job with dma_fence and not
> run_job, then I think can sort out fairness issues (if they do pop up) in
> the drm/sched code instead of having to think about this in each driver.

By allocation, you mean assigning a FW slot ID? If we do this allocation
in ->prepare_job(), couldn't we mess up ordering? Like,
lower-prio/later-queuing entity being scheduled before its pairs,
because there's no guarantee on the job completion order (and thus the
queue idleness order). I mean, completion order depends on the kind of
job being executed by the queues, the time the FW actually lets the
queue execute things and probably other factors. You can use metrics
like the position in the LRU list + the amount of jobs currently
queued to a group to guess which one will be idle first, but that's
just a guess. And I'm not sure I see what doing this slot selection in
->prepare_job() would bring us compared to doing it in ->run_job(),
where we can just pick the least recently used slot.

> Few fw sched slots essentially just make fw scheduling unfairness more
> prominent than with others, but I don't think it's fundamentally something
> else really.
> 
> If every ctx does that and the lru isn't too busted, they should then form
> a nice orderly queue and cycle through the fw scheduler, while still being
> able to get some work done. It's essentially the exact same thing that
> happens with ttm vram eviction, when you have a total working set where
> each process fits in vram individually, but in total they're too big and
> you need to cycle things through.

I see.

> 
> > > I'll need to make sure this still works with the concept of group (it's
> > > not a single queue we schedule, it's a group of queues, meaning that we
> > > have N fences to watch to determine if the slot is busy or not, but
> > > that should be okay).  
> > 
> > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > not entirely fair, it does take the slot priority (which has to be
> > unique across all currently assigned slots) into account when
> > scheduling groups. So, ideally, we'd want to rotate group priorities
> > when they share the same drm_sched_priority (probably based on the
> > position in the LRU).  
> 
> Hm that will make things a bit more fun I guess, especially with your
> constraint to not update this too often. How strict is that priority
> difference? If it's a lot, we might need to treat this more like execlist
> and less like a real fw scheduler ...

Strict as in, if two groups with same priority try to request an
overlapping set of resources (cores or tilers), it can deadlock, so
pretty strict I would say :-).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12 12:08                                 ` Boris Brezillon
@ 2023-01-12 15:38                                   ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-12 15:38 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, 12 Jan 2023 at 13:08, Boris Brezillon
<boris.brezillon@collabora.com> wrote:
> On Thu, 12 Jan 2023 11:42:57 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
> > On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > > On Thu, 12 Jan 2023 11:11:03 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >
> > > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >
> > > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > > > <boris.brezillon@collabora.com> wrote:
> > > > > > > >
> > > > > > > > Hi Daniel,
> > > > > > > >
> > > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > > >
> > > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same
> > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the
> > > > > > > > > > > > GuC,
> > > > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> > > > > > > > > > > > using
> > > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> > > > > > > > > > > > falls
> > > > > > > > > > > > > > > > apart as the TDR expects submission order == completion order.
> > > > > > > > > > > > Using a
> > > > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > > > > > > > > > problem.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > > > > > > > > > FW-assisted
> > > > > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > > > > because the integration with drm_sched was painful, but also because
> > > > > > > > > > > > I
> > > > > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.
> > > > > > > > > > > > Giving
> > > > > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> > > > > > > > > > > > might
> > > > > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > > > > > > > > >
> > > > > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > > > > scheduler fits quite well.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > > > > >
> > > > > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > > > > categorization.)
> > > > > > > > > > >
> > > > > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > > > > >
> > > > > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > > > > windows.
> > > > > > > > > > >
> > > > > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > > > > >
> > > > > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > > > > >
> > > > > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > > > > able to juggle if you run out.
> > > > > > > > > >
> > > > > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > > > > interest in Xe's approach.
> > > > > > > > >
> > > > > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > > > > treat them as vram and enlist ttm.
> > > > > > > > >
> > > > > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > > > > the perfect thing.
> > > > > > > > >
> > > > > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > > > > all.
> > > > > > > > >
> > > > > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > > > > helper code.
> > > > > > > >
> > > > > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > > > > also don't know much about TTM, so I'll keep digging.
> > > > > > >
> > > > > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > > > > iirc.
> > > > > >
> > > > > > Ok.
> > > > > >
> > > > > > >
> > > > > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > > > > severe restrictions in number of queues.
> > > > > > > >
> > > > > > > > Honestly, I don't think there's much difference between those two cases
> > > > > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > > > > re-evaluate, if needed.
> > > > > > > >
> > > > > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > > > > dragged into some major refactoring that might take years to land.
> > > > > > > > Especially if pancsf is the first
> > > > > > > > FW-assisted-scheduler-with-few-FW-slot driver.
> > > > > > >
> > > > > > > I don't see where there's a major refactoring that you're getting dragged into?
> > > > > >
> > > > > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > > > > sure we're on the same page :-).
> > > > > >
> > > > > > >
> > > > > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > > > > just largely people getting confused.
> > > > > >
> > > > > > I definitely am :-).
> > > > > >
> > > > > > >
> > > > > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > > > > point.
> > > > > >
> > > > > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > > > > that. I guess that's about getting access to the FW slots with some
> > > > > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > > > > case, I'll have to think about it, because that's a major shift from
> > > > > > what we're doing now, and I'm afraid this could lead to starving
> > > > > > non-resident entities if all resident entities keep receiving new jobs
> > > > > > to execute. Unless we put some sort of barrier when giving access to a
> > > > > > slot, so we evict the entity when it's done executing the stuff it had
> > > > > > when it was given access to this slot. But then, again, there are other
> > > > > > constraints to take into account for the Arm Mali CSF case:
> > > > > >
> > > > > > - it's more efficient to update all FW slots at once, because each
> > > > > >   update of a slot might require updating priorities of the other slots
> > > > > >   (FW mandates unique slot priorities, and those priorities depend on
> > > > > >   the entity priority/queue-ordering)
> > > > > > - context/FW slot switches have a non-negligible cost (FW needs to
> > > > > >   suspend the context and save the state every time there such a
> > > > > >   switch), so, limiting the number of FW slot updates might prove
> > > > > >   important
> > > > >
> > > > > I frankly think you're overworrying. When you have 31+ contexts running at
> > > > > the same time, you have bigger problems. At that point there's two
> > > > > use-cases:
> > > > > 1. system is overloaded, the user will reach for reset button anyway
> > > > > 2. temporary situation, all you have to do is be roughly fair enough to get
> > > > >    through it before case 1 happens.
> > > > >
> > > > > Trying to write a perfect scheduler for this before we have actual
> > > > > benchmarks that justify the effort seems like pretty serious overkill.
> > > > > That's why I think the simplest solution is the one we should have:
> > > > >
> > > > > - drm/sched frontend. If you get into slot exhaustion that alone will
> > > > >   ensure enough fairness
> > > >
> > > > We're talking about the CS ring buffer slots here, right?
> > > >
> > > > >
> > > > > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> > > > >   as needed (but I honestly wouldn't worry about the batching before
> > > > >   you've shown an actual need for this in some benchmark/workload, even
> > > > >   piglit shouldn't have this many things running concurrently I think, you
> > > > >   don't have that many cpu cores). Between drm/sched and the lru you will
> > > > >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > > > >
> > > > > - If you want to go fancy, have eviction tricks like skipping currently
> > > > >   still active gpu context with higher priority than the one that you need
> > > > >   to find a slot for.
> > > > >
> > > > > - You don't need time slicing in this, not even for compute. compute is
> > > > >   done with preempt context fences, if you give them a minimum scheduling
> > > > >   quanta you'll have a very basic round robin scheduler as an emergent
> > > > >   thing.
> > > > >
> > > > > Any workload were it matters will be scheduled by the fw directly, with
> > > > > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > > > > spend more than a hundred or so lines with slot allocation logic
> > > > > (excluding the hw code to load/unload a slot) you're probably doing some
> > > > > serious overengineering.
> > > >
> > > > Let me see if I got this right:
> > > >
> > > > - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
> > > >   where hw_submission_limit == available_slots_in_ring_buf
> > > > - when ->run_job() is called, we write the RUN_JOB() instruction
> > > >   sequence to the next available ringbuf slot and queue the entity to
> > > >   the FW-slot queue
> > > >   * if a slot is directly available, we program the slot directly
> > > >   * if no slots are available, but some slots are done with the jobs
> > > >     they were given (last job fence signaled), we evict the LRU entity
> > > >     (possibly taking priority into account) and use this slot for the
> > > >     new entity
> > > >   * if no slots are available and all currently assigned slots
> > > >     contain busy entities, we queue the entity to a pending list
> > > >     (possibly one list per prio)
> >
> > You could also handle this in ->prepare_job, which is called after all the
> > default fences have signalled. That allows you to put the "wait for a
> > previous job to finnish/unload" behind a dma_fence, which is how (I think
> > at least) you can get the round-robin emergent behaviour: If there's no
> > idle slot, you just pick all the fences from the currently busy job you
> > want to steal the slot from (with priority and lru taken into account),
> > let the scheduler wait for that to finnish, and then it'll call your
> > run_job when the slot is already available.
>
> Ah, nice! It would also avoid queuing new jobs to a resident entity
> when others are waiting for a FW slot, even if, in practice, I'm not
> sure we should do that: context will be suspended when the group is
> evicted anyway, and things could keep running in the meantime.
> I'll give it a try, thanks for the suggestion!
>
> >
> > Also if you do the allocation in ->prepare_job with dma_fence and not
> > run_job, then I think can sort out fairness issues (if they do pop up) in
> > the drm/sched code instead of having to think about this in each driver.
>
> By allocation, you mean assigning a FW slot ID? If we do this allocation
> in ->prepare_job(), couldn't we mess up ordering? Like,
> lower-prio/later-queuing entity being scheduled before its pairs,
> because there's no guarantee on the job completion order (and thus the
> queue idleness order). I mean, completion order depends on the kind of
> job being executed by the queues, the time the FW actually lets the
> queue execute things and probably other factors. You can use metrics
> like the position in the LRU list + the amount of jobs currently
> queued to a group to guess which one will be idle first, but that's
> just a guess. And I'm not sure I see what doing this slot selection in
> ->prepare_job() would bring us compared to doing it in ->run_job(),
> where we can just pick the least recently used slot.

In ->prepare_job you can let the scheduler code do the stalling (and
ensure fairness), in ->run_job it's your job. The current RFC doesn't
really bother much with getting this very right, but if the scheduler
code tries to make sure it pushes higher-prio stuff in first before
others, you should get the right outcome.

The more important functional issue is that you must only allocate the
fw slot after all dependencies have signalled. Otherwise you might get
a nice deadlock, where job A is waiting for the fw slot of B to become
free, and B is waiting for A to finish.

> > Few fw sched slots essentially just make fw scheduling unfairness more
> > prominent than with others, but I don't think it's fundamentally something
> > else really.
> >
> > If every ctx does that and the lru isn't too busted, they should then form
> > a nice orderly queue and cycle through the fw scheduler, while still being
> > able to get some work done. It's essentially the exact same thing that
> > happens with ttm vram eviction, when you have a total working set where
> > each process fits in vram individually, but in total they're too big and
> > you need to cycle things through.
>
> I see.
>
> >
> > > > I'll need to make sure this still works with the concept of group (it's
> > > > not a single queue we schedule, it's a group of queues, meaning that we
> > > > have N fences to watch to determine if the slot is busy or not, but
> > > > that should be okay).
> > >
> > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > not entirely fair, it does take the slot priority (which has to be
> > > unique across all currently assigned slots) into account when
> > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > when they share the same drm_sched_priority (probably based on the
> > > position in the LRU).
> >
> > Hm that will make things a bit more fun I guess, especially with your
> > constraint to not update this too often. How strict is that priority
> > difference? If it's a lot, we might need to treat this more like execlist
> > and less like a real fw scheduler ...
>
> Strict as in, if two groups with same priority try to request an
> overlapping set of resources (cores or tilers), it can deadlock, so
> pretty strict I would say :-).

So it first finishes all the higher priority tasks and only then it
runs the next one, so no round-robin? Or am I just confused what this
all is about. Or is it more that the order in the group determines how
it tries to schedule on the hw, and if the earlier job needs hw that
also the later one needs, then the earlier one has to finish first?
Which would still mean that for these overlapping cases there's just
no round-robin in the fw scheduler at all.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 15:38                                   ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-01-12 15:38 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: intel-gfx, dri-devel

On Thu, 12 Jan 2023 at 13:08, Boris Brezillon
<boris.brezillon@collabora.com> wrote:
> On Thu, 12 Jan 2023 11:42:57 +0100
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
> > On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > > On Thu, 12 Jan 2023 11:11:03 +0100
> > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > >
> > > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >
> > > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > > > <boris.brezillon@collabora.com> wrote:
> > > > > > > >
> > > > > > > > Hi Daniel,
> > > > > > > >
> > > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > > Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > > >
> > > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > > Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > > Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > > > > > > > guaranteed to be the same completion even if targeting the same
> > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the
> > > > > > > > > > > > GuC,
> > > > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> > > > > > > > > > > > using
> > > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> > > > > > > > > > > > falls
> > > > > > > > > > > > > > > > apart as the TDR expects submission order == completion order.
> > > > > > > > > > > > Using a
> > > > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > > > > > > > > > problem.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > > > > > > > > > FW-assisted
> > > > > > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > > > > > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > > > > > > > > > does timeslice scheduling to update the command streams passed to the
> > > > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > > > > > > > > > > because the integration with drm_sched was painful, but also because
> > > > > > > > > > > > I
> > > > > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really future proof.
> > > > > > > > > > > > Giving
> > > > > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> > > > > > > > > > > > might
> > > > > > > > > > > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > > > > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > > > > > > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > > > > > > > > >
> > > > > > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > > > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > > > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > > > > > > > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > > > > > > > > > aside from the most basic level of not overflowing the submission ring
> > > > > > > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > > > > > > scheduler fits quite well.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > > > > > > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > > > > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > > > > > > > least 1k. This puts it in an entirely different class from what you have on
> > > > > > > > > > > Mali. Roughly, there's about three categories here:
> > > > > > > > > > >
> > > > > > > > > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > > > > > > > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > > > > > > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > > > > > > > > categorization.)
> > > > > > > > > > >
> > > > > > > > > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > > > > > > > > you're going to have to juggle in the kernel in order to run desktop Linux.
> > > > > > > > > > >
> > > > > > > > > > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > > > > > > > > > want the kernel scheduling anything. Just throw it at the firmware and let
> > > > > > > > > > > it go brrrrr.  If we ever run out of queues (unlikely), the kernel can
> > > > > > > > > > > temporarily pause some low-priority contexts and do some juggling or,
> > > > > > > > > > > frankly, just fail userspace queue creation and tell the user to close some
> > > > > > > > > > > windows.
> > > > > > > > > > >
> > > > > > > > > > > The existence of this 2nd class is a bit annoying but it's where we are. I
> > > > > > > > > > > think it's worth recognizing that Xe and panfrost are in different places
> > > > > > > > > > > here and will require different designs. For Xe, we really are just using
> > > > > > > > > > > drm/scheduler as a front-end and the firmware does all the real scheduling.
> > > > > > > > > > >
> > > > > > > > > > > How do we deal with class 2? That's an interesting question.  We may
> > > > > > > > > > > eventually want to break that off into a separate discussion and not litter
> > > > > > > > > > > the Xe thread but let's keep going here for a bit.  I think there are some
> > > > > > > > > > > pretty reasonable solutions but they're going to look a bit different.
> > > > > > > > > > >
> > > > > > > > > > > The way I did this for Xe with execlists was to keep the 1:1:1 mapping
> > > > > > > > > > > between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
> > > > > > > > > > > Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
> > > > > > > > > > > ring and then there was a tiny kernel which operated entirely in IRQ
> > > > > > > > > > > handlers which juggled those execlists by smashing HW registers.  For
> > > > > > > > > > > Panfrost, I think we want something slightly different but can borrow some
> > > > > > > > > > > ideas here.  In particular, have the schedulers feed kernel-side SW queues
> > > > > > > > > > > (they can even be fixed-size if that helps) and then have a kthread which
> > > > > > > > > > > juggles those feeds the limited FW queues.  In the case where you have few
> > > > > > > > > > > enough active contexts to fit them all in FW, I do think it's best to have
> > > > > > > > > > > them all active in FW and let it schedule. But with only 31, you need to be
> > > > > > > > > > > able to juggle if you run out.
> > > > > > > > > >
> > > > > > > > > > That's more or less what I do right now, except I don't use the
> > > > > > > > > > drm_sched front-end to handle deps or queue jobs (at least not yet). The
> > > > > > > > > > kernel-side timeslice-based scheduler juggling with runnable queues
> > > > > > > > > > (queues with pending jobs that are not yet resident on a FW slot)
> > > > > > > > > > uses a dedicated ordered-workqueue instead of a thread, with scheduler
> > > > > > > > > > ticks being handled with a delayed-work (tick happening every X
> > > > > > > > > > milliseconds when queues are waiting for a slot). It all seems very
> > > > > > > > > > HW/FW-specific though, and I think it's a bit premature to try to
> > > > > > > > > > generalize that part, but the dep-tracking logic implemented by
> > > > > > > > > > drm_sched looked like something I could easily re-use, hence my
> > > > > > > > > > interest in Xe's approach.
> > > > > > > > >
> > > > > > > > > So another option for these few fw queue slots schedulers would be to
> > > > > > > > > treat them as vram and enlist ttm.
> > > > > > > > >
> > > > > > > > > Well maybe more enlist ttm and less treat them like vram, but ttm can
> > > > > > > > > handle idr (or xarray or whatever you want) and then help you with all the
> > > > > > > > > pipelining (and the drm_sched then with sorting out dependencies). If you
> > > > > > > > > then also preferentially "evict" low-priority queus you pretty much have
> > > > > > > > > the perfect thing.
> > > > > > > > >
> > > > > > > > > Note that GuC with sriov splits up the id space and together with some
> > > > > > > > > restrictions due to multi-engine contexts media needs might also need this
> > > > > > > > > all.
> > > > > > > > >
> > > > > > > > > If you're balking at the idea of enlisting ttm just for fw queue
> > > > > > > > > management, amdgpu has a shoddy version of id allocation for their vm/tlb
> > > > > > > > > index allocation. Might be worth it to instead lift that into some sched
> > > > > > > > > helper code.
> > > > > > > >
> > > > > > > > Would you mind pointing me to the amdgpu code you're mentioning here?
> > > > > > > > Still have a hard time seeing what TTM has to do with scheduling, but I
> > > > > > > > also don't know much about TTM, so I'll keep digging.
> > > > > > >
> > > > > > > ttm is about moving stuff in&out of a limited space and gives you some
> > > > > > > nice tooling for pipelining it all. It doesn't care whether that space
> > > > > > > is vram or some limited id space. vmwgfx used ttm as an id manager
> > > > > > > iirc.
> > > > > >
> > > > > > Ok.
> > > > > >
> > > > > > >
> > > > > > > > > Either way there's two imo rather solid approaches available to sort this
> > > > > > > > > out. And once you have that, then there shouldn't be any big difference in
> > > > > > > > > driver design between fw with defacto unlimited queue ids, and those with
> > > > > > > > > severe restrictions in number of queues.
> > > > > > > >
> > > > > > > > Honestly, I don't think there's much difference between those two cases
> > > > > > > > already. There's just a bunch of additional code to schedule queues on
> > > > > > > > FW slots for the limited-number-of-FW-slots case, which, right now, is
> > > > > > > > driver specific. The job queuing front-end pretty much achieves what
> > > > > > > > drm_sched does already: queuing job to entities, checking deps,
> > > > > > > > submitting job to HW (in our case, writing to the command stream ring
> > > > > > > > buffer). Things start to differ after that point: once a scheduling
> > > > > > > > entity has pending jobs, we add it to one of the runnable queues (one
> > > > > > > > queue per prio) and kick the kernel-side timeslice-based scheduler to
> > > > > > > > re-evaluate, if needed.
> > > > > > > >
> > > > > > > > I'm all for using generic code when it makes sense, even if that means
> > > > > > > > adding this common code when it doesn't exists, but I don't want to be
> > > > > > > > dragged into some major refactoring that might take years to land.
> > > > > > > > Especially if pancsf is the first
> > > > > > > > FW-assisted-scheduler-with-few-FW-slot driver.
> > > > > > >
> > > > > > > I don't see where there's a major refactoring that you're getting dragged into?
> > > > > >
> > > > > > Oh, no, I'm not saying this is the case just yet, just wanted to make
> > > > > > sure we're on the same page :-).
> > > > > >
> > > > > > >
> > > > > > > Yes there's a huge sprawling discussion right now, but I think that's
> > > > > > > just largely people getting confused.
> > > > > >
> > > > > > I definitely am :-).
> > > > > >
> > > > > > >
> > > > > > > Wrt the actual id assignment stuff, in amdgpu at least it's few lines
> > > > > > > of code. See the amdgpu_vmid_grab stuff for the simplest starting
> > > > > > > point.
> > > > > >
> > > > > > Ok, thanks for the pointers. I'll have a look and see how I could use
> > > > > > that. I guess that's about getting access to the FW slots with some
> > > > > > sort of priority+FIFO ordering guarantees given by TTM. If that's the
> > > > > > case, I'll have to think about it, because that's a major shift from
> > > > > > what we're doing now, and I'm afraid this could lead to starving
> > > > > > non-resident entities if all resident entities keep receiving new jobs
> > > > > > to execute. Unless we put some sort of barrier when giving access to a
> > > > > > slot, so we evict the entity when it's done executing the stuff it had
> > > > > > when it was given access to this slot. But then, again, there are other
> > > > > > constraints to take into account for the Arm Mali CSF case:
> > > > > >
> > > > > > - it's more efficient to update all FW slots at once, because each
> > > > > >   update of a slot might require updating priorities of the other slots
> > > > > >   (FW mandates unique slot priorities, and those priorities depend on
> > > > > >   the entity priority/queue-ordering)
> > > > > > - context/FW slot switches have a non-negligible cost (FW needs to
> > > > > >   suspend the context and save the state every time there such a
> > > > > >   switch), so, limiting the number of FW slot updates might prove
> > > > > >   important
> > > > >
> > > > > I frankly think you're overworrying. When you have 31+ contexts running at
> > > > > the same time, you have bigger problems. At that point there's two
> > > > > use-cases:
> > > > > 1. system is overloaded, the user will reach for reset button anyway
> > > > > 2. temporary situation, all you have to do is be roughly fair enough to get
> > > > >    through it before case 1 happens.
> > > > >
> > > > > Trying to write a perfect scheduler for this before we have actual
> > > > > benchmarks that justify the effort seems like pretty serious overkill.
> > > > > That's why I think the simplest solution is the one we should have:
> > > > >
> > > > > - drm/sched frontend. If you get into slot exhaustion that alone will
> > > > >   ensure enough fairness
> > > >
> > > > We're talking about the CS ring buffer slots here, right?
> > > >
> > > > >
> > > > > - LRU list of slots, with dma_fence so you can pipeline/batch up changes
> > > > >   as needed (but I honestly wouldn't worry about the batching before
> > > > >   you've shown an actual need for this in some benchmark/workload, even
> > > > >   piglit shouldn't have this many things running concurrently I think, you
> > > > >   don't have that many cpu cores). Between drm/sched and the lru you will
> > > > >   have an emergent scheduler that cycles through all runnable gpu jobs.
> > > > >
> > > > > - If you want to go fancy, have eviction tricks like skipping currently
> > > > >   still active gpu context with higher priority than the one that you need
> > > > >   to find a slot for.
> > > > >
> > > > > - You don't need time slicing in this, not even for compute. compute is
> > > > >   done with preempt context fences, if you give them a minimum scheduling
> > > > >   quanta you'll have a very basic round robin scheduler as an emergent
> > > > >   thing.
> > > > >
> > > > > Any workload were it matters will be scheduled by the fw directly, with
> > > > > drm/sched only being the dma_fence dependcy sorter. My take is that if you
> > > > > spend more than a hundred or so lines with slot allocation logic
> > > > > (excluding the hw code to load/unload a slot) you're probably doing some
> > > > > serious overengineering.
> > > >
> > > > Let me see if I got this right:
> > > >
> > > > - we still keep a 1:1 drm_gpu_scheduler:drm_sched_entity approach,
> > > >   where hw_submission_limit == available_slots_in_ring_buf
> > > > - when ->run_job() is called, we write the RUN_JOB() instruction
> > > >   sequence to the next available ringbuf slot and queue the entity to
> > > >   the FW-slot queue
> > > >   * if a slot is directly available, we program the slot directly
> > > >   * if no slots are available, but some slots are done with the jobs
> > > >     they were given (last job fence signaled), we evict the LRU entity
> > > >     (possibly taking priority into account) and use this slot for the
> > > >     new entity
> > > >   * if no slots are available and all currently assigned slots
> > > >     contain busy entities, we queue the entity to a pending list
> > > >     (possibly one list per prio)
> >
> > You could also handle this in ->prepare_job, which is called after all the
> > default fences have signalled. That allows you to put the "wait for a
> > previous job to finnish/unload" behind a dma_fence, which is how (I think
> > at least) you can get the round-robin emergent behaviour: If there's no
> > idle slot, you just pick all the fences from the currently busy job you
> > want to steal the slot from (with priority and lru taken into account),
> > let the scheduler wait for that to finnish, and then it'll call your
> > run_job when the slot is already available.
>
> Ah, nice! It would also avoid queuing new jobs to a resident entity
> when others are waiting for a FW slot, even if, in practice, I'm not
> sure we should do that: context will be suspended when the group is
> evicted anyway, and things could keep running in the meantime.
> I'll give it a try, thanks for the suggestion!
>
> >
> > Also if you do the allocation in ->prepare_job with dma_fence and not
> > run_job, then I think can sort out fairness issues (if they do pop up) in
> > the drm/sched code instead of having to think about this in each driver.
>
> By allocation, you mean assigning a FW slot ID? If we do this allocation
> in ->prepare_job(), couldn't we mess up ordering? Like,
> lower-prio/later-queuing entity being scheduled before its pairs,
> because there's no guarantee on the job completion order (and thus the
> queue idleness order). I mean, completion order depends on the kind of
> job being executed by the queues, the time the FW actually lets the
> queue execute things and probably other factors. You can use metrics
> like the position in the LRU list + the amount of jobs currently
> queued to a group to guess which one will be idle first, but that's
> just a guess. And I'm not sure I see what doing this slot selection in
> ->prepare_job() would bring us compared to doing it in ->run_job(),
> where we can just pick the least recently used slot.

In ->prepare_job you can let the scheduler code do the stalling (and
ensure fairness), in ->run_job it's your job. The current RFC doesn't
really bother much with getting this very right, but if the scheduler
code tries to make sure it pushes higher-prio stuff in first before
others, you should get the right outcome.

The more important functional issue is that you must only allocate the
fw slot after all dependencies have signalled. Otherwise you might get
a nice deadlock, where job A is waiting for the fw slot of B to become
free, and B is waiting for A to finish.

> > Few fw sched slots essentially just make fw scheduling unfairness more
> > prominent than with others, but I don't think it's fundamentally something
> > else really.
> >
> > If every ctx does that and the lru isn't too busted, they should then form
> > a nice orderly queue and cycle through the fw scheduler, while still being
> > able to get some work done. It's essentially the exact same thing that
> > happens with ttm vram eviction, when you have a total working set where
> > each process fits in vram individually, but in total they're too big and
> > you need to cycle things through.
>
> I see.
>
> >
> > > > I'll need to make sure this still works with the concept of group (it's
> > > > not a single queue we schedule, it's a group of queues, meaning that we
> > > > have N fences to watch to determine if the slot is busy or not, but
> > > > that should be okay).
> > >
> > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > not entirely fair, it does take the slot priority (which has to be
> > > unique across all currently assigned slots) into account when
> > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > when they share the same drm_sched_priority (probably based on the
> > > position in the LRU).
> >
> > Hm that will make things a bit more fun I guess, especially with your
> > constraint to not update this too often. How strict is that priority
> > difference? If it's a lot, we might need to treat this more like execlist
> > and less like a real fw scheduler ...
>
> Strict as in, if two groups with same priority try to request an
> overlapping set of resources (cores or tilers), it can deadlock, so
> pretty strict I would say :-).

So it first finishes all the higher priority tasks and only then it
runs the next one, so no round-robin? Or am I just confused what this
all is about. Or is it more that the order in the group determines how
it tries to schedule on the hw, and if the earlier job needs hw that
also the later one needs, then the earlier one has to finish first?
Which would still mean that for these overlapping cases there's just
no round-robin in the fw scheduler at all.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-12 15:38                                   ` Daniel Vetter
@ 2023-01-12 16:48                                     ` Boris Brezillon
  -1 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 16:48 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel, Jason Ekstrand

On Thu, 12 Jan 2023 16:38:18 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> > >
> > > Also if you do the allocation in ->prepare_job with dma_fence and not
> > > run_job, then I think can sort out fairness issues (if they do pop up) in
> > > the drm/sched code instead of having to think about this in each driver.  
> >
> > By allocation, you mean assigning a FW slot ID? If we do this allocation
> > in ->prepare_job(), couldn't we mess up ordering? Like,
> > lower-prio/later-queuing entity being scheduled before its pairs,
> > because there's no guarantee on the job completion order (and thus the
> > queue idleness order). I mean, completion order depends on the kind of
> > job being executed by the queues, the time the FW actually lets the
> > queue execute things and probably other factors. You can use metrics
> > like the position in the LRU list + the amount of jobs currently
> > queued to a group to guess which one will be idle first, but that's
> > just a guess. And I'm not sure I see what doing this slot selection in  
> > ->prepare_job() would bring us compared to doing it in ->run_job(),  
> > where we can just pick the least recently used slot.  
> 
> In ->prepare_job you can let the scheduler code do the stalling (and
> ensure fairness), in ->run_job it's your job.

Yeah returning a fence in ->prepare_job() to wait for a FW slot to
become idle sounds good. This fence would be signaled when one of the
slots becomes idle. But I'm wondering why we'd want to select the slot
so early. Can't we just do the selection in ->run_job()? After all, if
the fence has been signaled, that means we'll find at least one slot
that's ready when we hit ->run_job(), and we can select it at that
point.

> The current RFC doesn't
> really bother much with getting this very right, but if the scheduler
> code tries to make sure it pushes higher-prio stuff in first before
> others, you should get the right outcome.

Okay, so I'm confused again. We said we had a 1:1
drm_gpu_scheduler:drm_sched_entity mapping, meaning that entities are
isolated from each other. I can see how I could place the dma_fence
returned by ->prepare_job() in a driver-specific per-priority list, so
the driver can pick the highest-prio/first-inserted entry and signal the
associated fence when a slot becomes idle. But I have a hard time
seeing how common code could do that if it doesn't see the other
entities. Right now, drm_gpu_scheduler only selects the best entity
among the registered ones, and there's only one entity per
drm_gpu_scheduler in this case.

> 
> The more important functional issue is that you must only allocate the
> fw slot after all dependencies have signalled.

Sure, but it doesn't have to be a specific FW slot, it can be any FW
slot, as long as we don't signal more fences than we have slots
available, right?

> Otherwise you might get
> a nice deadlock, where job A is waiting for the fw slot of B to become
> free, and B is waiting for A to finish.

Got that part, and that's ensured by the fact we wait for all
regular deps before returning the FW-slot-available dma_fence in
->prepare_job(). This exact same fence will be signaled when a slot
becomes idle.

> 
> > > Few fw sched slots essentially just make fw scheduling unfairness more
> > > prominent than with others, but I don't think it's fundamentally something
> > > else really.
> > >
> > > If every ctx does that and the lru isn't too busted, they should then form
> > > a nice orderly queue and cycle through the fw scheduler, while still being
> > > able to get some work done. It's essentially the exact same thing that
> > > happens with ttm vram eviction, when you have a total working set where
> > > each process fits in vram individually, but in total they're too big and
> > > you need to cycle things through.  
> >
> > I see.
> >  
> > >  
> > > > > I'll need to make sure this still works with the concept of group (it's
> > > > > not a single queue we schedule, it's a group of queues, meaning that we
> > > > > have N fences to watch to determine if the slot is busy or not, but
> > > > > that should be okay).  
> > > >
> > > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > > not entirely fair, it does take the slot priority (which has to be
> > > > unique across all currently assigned slots) into account when
> > > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > > when they share the same drm_sched_priority (probably based on the
> > > > position in the LRU).  
> > >
> > > Hm that will make things a bit more fun I guess, especially with your
> > > constraint to not update this too often. How strict is that priority
> > > difference? If it's a lot, we might need to treat this more like execlist
> > > and less like a real fw scheduler ...  
> >
> > Strict as in, if two groups with same priority try to request an
> > overlapping set of resources (cores or tilers), it can deadlock, so
> > pretty strict I would say :-).  
> 
> So it first finishes all the higher priority tasks and only then it
> runs the next one, so no round-robin? Or am I just confused what this
> all is about. Or is it more that the order in the group determines how
> it tries to schedule on the hw, and if the earlier job needs hw that
> also the later one needs, then the earlier one has to finish first?
> Which would still mean that for these overlapping cases there's just
> no round-robin in the fw scheduler at all.

Okay, so my understanding is: FW scheduler always takes the highest
priority when selecting between X groups requesting access to a
resource, but if 2 groups want the same resource and have the same
priority, there's no ordering guarantee. The deadlock happens when both
group A and B claim resources X and Y. Group A might get resource X
and group B might get resource Y, both waiting for the other resource
they claimed. If they have different priorities one of them would back
off and let the other run, if they have the same priority, none of them
would, and that's where the deadlock comes from. Note that we don't
control the order resources get acquired from the CS, so there's no way
to avoid this deadlock without assigning different priorities.

And you're right, if you pick different priorities, the only time lower
priority groups get to run is when the highest priority group is
waiting on an asynchronous operation to complete (can be a
compute/frag/tiler job completion, some inter queue synchronization,
waiting for an already acquired resource, ...), or when it's idle. I
suspect queues from different groups can run concurrently if there's
enough command-stream processing slots available, and those groups
request resources that don't overlap, but I'm speculating here. So, no
round-robin if slots are assigned unique priorities. Not even sure
scheduling is time-slice based to be honest, it could be some
cooperative scheduling where groups with the same priorities get to
wait for the currently running group to be blocked to get access to
the HW. In any case, there's no easy way to prevent deadlocks if we
don't assign unique priorities.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 16:48                                     ` Boris Brezillon
  0 siblings, 0 replies; 161+ messages in thread
From: Boris Brezillon @ 2023-01-12 16:48 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Thu, 12 Jan 2023 16:38:18 +0100
Daniel Vetter <daniel@ffwll.ch> wrote:

> > >
> > > Also if you do the allocation in ->prepare_job with dma_fence and not
> > > run_job, then I think can sort out fairness issues (if they do pop up) in
> > > the drm/sched code instead of having to think about this in each driver.  
> >
> > By allocation, you mean assigning a FW slot ID? If we do this allocation
> > in ->prepare_job(), couldn't we mess up ordering? Like,
> > lower-prio/later-queuing entity being scheduled before its pairs,
> > because there's no guarantee on the job completion order (and thus the
> > queue idleness order). I mean, completion order depends on the kind of
> > job being executed by the queues, the time the FW actually lets the
> > queue execute things and probably other factors. You can use metrics
> > like the position in the LRU list + the amount of jobs currently
> > queued to a group to guess which one will be idle first, but that's
> > just a guess. And I'm not sure I see what doing this slot selection in  
> > ->prepare_job() would bring us compared to doing it in ->run_job(),  
> > where we can just pick the least recently used slot.  
> 
> In ->prepare_job you can let the scheduler code do the stalling (and
> ensure fairness), in ->run_job it's your job.

Yeah returning a fence in ->prepare_job() to wait for a FW slot to
become idle sounds good. This fence would be signaled when one of the
slots becomes idle. But I'm wondering why we'd want to select the slot
so early. Can't we just do the selection in ->run_job()? After all, if
the fence has been signaled, that means we'll find at least one slot
that's ready when we hit ->run_job(), and we can select it at that
point.

> The current RFC doesn't
> really bother much with getting this very right, but if the scheduler
> code tries to make sure it pushes higher-prio stuff in first before
> others, you should get the right outcome.

Okay, so I'm confused again. We said we had a 1:1
drm_gpu_scheduler:drm_sched_entity mapping, meaning that entities are
isolated from each other. I can see how I could place the dma_fence
returned by ->prepare_job() in a driver-specific per-priority list, so
the driver can pick the highest-prio/first-inserted entry and signal the
associated fence when a slot becomes idle. But I have a hard time
seeing how common code could do that if it doesn't see the other
entities. Right now, drm_gpu_scheduler only selects the best entity
among the registered ones, and there's only one entity per
drm_gpu_scheduler in this case.

> 
> The more important functional issue is that you must only allocate the
> fw slot after all dependencies have signalled.

Sure, but it doesn't have to be a specific FW slot, it can be any FW
slot, as long as we don't signal more fences than we have slots
available, right?

> Otherwise you might get
> a nice deadlock, where job A is waiting for the fw slot of B to become
> free, and B is waiting for A to finish.

Got that part, and that's ensured by the fact we wait for all
regular deps before returning the FW-slot-available dma_fence in
->prepare_job(). This exact same fence will be signaled when a slot
becomes idle.

> 
> > > Few fw sched slots essentially just make fw scheduling unfairness more
> > > prominent than with others, but I don't think it's fundamentally something
> > > else really.
> > >
> > > If every ctx does that and the lru isn't too busted, they should then form
> > > a nice orderly queue and cycle through the fw scheduler, while still being
> > > able to get some work done. It's essentially the exact same thing that
> > > happens with ttm vram eviction, when you have a total working set where
> > > each process fits in vram individually, but in total they're too big and
> > > you need to cycle things through.  
> >
> > I see.
> >  
> > >  
> > > > > I'll need to make sure this still works with the concept of group (it's
> > > > > not a single queue we schedule, it's a group of queues, meaning that we
> > > > > have N fences to watch to determine if the slot is busy or not, but
> > > > > that should be okay).  
> > > >
> > > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > > not entirely fair, it does take the slot priority (which has to be
> > > > unique across all currently assigned slots) into account when
> > > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > > when they share the same drm_sched_priority (probably based on the
> > > > position in the LRU).  
> > >
> > > Hm that will make things a bit more fun I guess, especially with your
> > > constraint to not update this too often. How strict is that priority
> > > difference? If it's a lot, we might need to treat this more like execlist
> > > and less like a real fw scheduler ...  
> >
> > Strict as in, if two groups with same priority try to request an
> > overlapping set of resources (cores or tilers), it can deadlock, so
> > pretty strict I would say :-).  
> 
> So it first finishes all the higher priority tasks and only then it
> runs the next one, so no round-robin? Or am I just confused what this
> all is about. Or is it more that the order in the group determines how
> it tries to schedule on the hw, and if the earlier job needs hw that
> also the later one needs, then the earlier one has to finish first?
> Which would still mean that for these overlapping cases there's just
> no round-robin in the fw scheduler at all.

Okay, so my understanding is: FW scheduler always takes the highest
priority when selecting between X groups requesting access to a
resource, but if 2 groups want the same resource and have the same
priority, there's no ordering guarantee. The deadlock happens when both
group A and B claim resources X and Y. Group A might get resource X
and group B might get resource Y, both waiting for the other resource
they claimed. If they have different priorities one of them would back
off and let the other run, if they have the same priority, none of them
would, and that's where the deadlock comes from. Note that we don't
control the order resources get acquired from the CS, so there's no way
to avoid this deadlock without assigning different priorities.

And you're right, if you pick different priorities, the only time lower
priority groups get to run is when the highest priority group is
waiting on an asynchronous operation to complete (can be a
compute/frag/tiler job completion, some inter queue synchronization,
waiting for an already acquired resource, ...), or when it's idle. I
suspect queues from different groups can run concurrently if there's
enough command-stream processing slots available, and those groups
request resources that don't overlap, but I'm speculating here. So, no
round-robin if slots are assigned unique priorities. Not even sure
scheduling is time-slice based to be honest, it could be some
cooperative scheduling where groups with the same priorities get to
wait for the currently running group to be blocked to get access to
the HW. In any case, there's no easy way to prevent deadlocks if we
don't assign unique priorities.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-12  9:54       ` Lucas De Marchi
@ 2023-01-12 17:10         ` Matthew Brost
  -1 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-12 17:10 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: Tvrtko Ursulin, intel-gfx, dri-devel

On Thu, Jan 12, 2023 at 10:54:25AM +0100, Lucas De Marchi wrote:
> On Thu, Jan 05, 2023 at 09:27:57PM +0000, Matthew Brost wrote:
> > On Tue, Jan 03, 2023 at 12:21:08PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 22/12/2022 22:21, Matthew Brost wrote:
> > > > Hello,
> > > >
> > > > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > > > integrated and discrete platforms starting with Tiger Lake (first platform with
> > > > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > > > to work from that is unencumbered by older platforms, whilst also taking the
> > > > opportunity to rearchitect our driver to increase sharing across the drm
> > > > subsystem, both leveraging and allowing us to contribute more towards other
> > > > shared components like TTM and drm/scheduler. The memory model is based on VM
> > > > bind which is similar to the i915 implementation. Likewise the execbuf
> > > > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > > >
> > > > The code is at a stage where it is already functional and has experimental
> > > > support for multiple platforms starting from Tiger Lake, with initial support
> > > > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > > > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > > > implementation will be released publicly early next year. We also have a suite
> > > > of IGTs for XE that will appear on the IGT list shortly.
> > > >
> > > > It has been built with the assumption of supporting multiple architectures from
> > > > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > > > intend to continue working on it and improving on it as part of the kernel
> > > > community upstream.
> > > >
> > > > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > > > ready Xe for production throughout 2023.
> > > >
> > > > As for display, the intent is to share the display code with the i915 driver so
> > > > that there is maximum reuse there. Currently this is being done by compiling the
> > > > display code twice, but alternatives to that are under consideration and we want
> > > > to have more discussion on what the best final solution will look like over the
> > > > next few months. Right now, work is ongoing in refactoring the display codebase
> > > > to remove as much as possible any unnecessary dependencies on i915 specific data
> > > > structures there..
> > > >
> > > > We currently have 2 submission backends, execlists and GuC. The execlist is
> > > > meant mostly for testing and is not fully functional while GuC backend is fully
> > > > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > > > required and should be placed in /lib/firmware/xe.
> > > 
> > > What is the plan going forward for the execlists backend? I think it would
> > > be preferable to not upstream something semi-functional and so to carry
> > > technical debt in the brand new code base, from the very start. If it is for
> > > Tigerlake, which is the starting platform for Xe, could it be made GuC only
> > > Tigerlake for instance?
> > > 
> > 
> > A little background here. In the original PoC written by Jason and Dave,
> > the execlist backend was the only one present and it was in semi-working
> > state. As soon as myself and a few others started working on Xe we went
> > full in a on the GuC backend. We left the execlist backend basically in
> > the state it was in. We left it in place for 2 reasons.
> > 
> > 1. Having 2 backends from the start ensured we layered our code
> > correctly. The layer was a complete disaster in the i915 so we really
> > wanted to avoid that.
> > 2. The thought was it might be needed for early product bring up one
> > day.
> > 
> > As I think about this a bit more, we likely just delete execlist backend
> > before merging this upstream and perhaps just carry 1 large patch
> > internally with this implementation that we can use as needed. Final
> > decession TDB though.
> 
> but that might regress after some time on "let's keep 2 backends so we
> layer the code correctly". Leaving the additional backend behind
> CONFIG_BROKEN or XE_EXPERIMENTAL, or something like that, not
> enabled by distros, but enabled in CI would be a good idea IMO.
> 
> Carrying a large patch out of tree would make things harder for new
> platforms. A perfect backend split would make it possible, but like I
> said, we are likely not to have it if we delete the second backend.
> 

Good points here Lucas. One thing that we absolutely have wrong is
falling back to execlists if GuC firmware is missing. We def should not
be doing that as it creates confusion.

I kinda like the idea hiding it behind a config option + module
parameter to use the backend so you really, really need to try to be
able to use it + with this in the code it make us disciplined in our
layering. At some point we will likely another supported backend and at
that point we may decide to delete this backend.

Matt

> Lucas De Marchi
> 
> > 
> > Matt
> > 
> > > Regards,
> > > 
> > > Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-01-12 17:10         ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-12 17:10 UTC (permalink / raw)
  To: Lucas De Marchi; +Cc: intel-gfx, dri-devel

On Thu, Jan 12, 2023 at 10:54:25AM +0100, Lucas De Marchi wrote:
> On Thu, Jan 05, 2023 at 09:27:57PM +0000, Matthew Brost wrote:
> > On Tue, Jan 03, 2023 at 12:21:08PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 22/12/2022 22:21, Matthew Brost wrote:
> > > > Hello,
> > > >
> > > > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > > > integrated and discrete platforms starting with Tiger Lake (first platform with
> > > > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > > > to work from that is unencumbered by older platforms, whilst also taking the
> > > > opportunity to rearchitect our driver to increase sharing across the drm
> > > > subsystem, both leveraging and allowing us to contribute more towards other
> > > > shared components like TTM and drm/scheduler. The memory model is based on VM
> > > > bind which is similar to the i915 implementation. Likewise the execbuf
> > > > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > > >
> > > > The code is at a stage where it is already functional and has experimental
> > > > support for multiple platforms starting from Tiger Lake, with initial support
> > > > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > > > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > > > implementation will be released publicly early next year. We also have a suite
> > > > of IGTs for XE that will appear on the IGT list shortly.
> > > >
> > > > It has been built with the assumption of supporting multiple architectures from
> > > > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > > > intend to continue working on it and improving on it as part of the kernel
> > > > community upstream.
> > > >
> > > > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > > > ready Xe for production throughout 2023.
> > > >
> > > > As for display, the intent is to share the display code with the i915 driver so
> > > > that there is maximum reuse there. Currently this is being done by compiling the
> > > > display code twice, but alternatives to that are under consideration and we want
> > > > to have more discussion on what the best final solution will look like over the
> > > > next few months. Right now, work is ongoing in refactoring the display codebase
> > > > to remove as much as possible any unnecessary dependencies on i915 specific data
> > > > structures there..
> > > >
> > > > We currently have 2 submission backends, execlists and GuC. The execlist is
> > > > meant mostly for testing and is not fully functional while GuC backend is fully
> > > > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > > > required and should be placed in /lib/firmware/xe.
> > > 
> > > What is the plan going forward for the execlists backend? I think it would
> > > be preferable to not upstream something semi-functional and so to carry
> > > technical debt in the brand new code base, from the very start. If it is for
> > > Tigerlake, which is the starting platform for Xe, could it be made GuC only
> > > Tigerlake for instance?
> > > 
> > 
> > A little background here. In the original PoC written by Jason and Dave,
> > the execlist backend was the only one present and it was in semi-working
> > state. As soon as myself and a few others started working on Xe we went
> > full in a on the GuC backend. We left the execlist backend basically in
> > the state it was in. We left it in place for 2 reasons.
> > 
> > 1. Having 2 backends from the start ensured we layered our code
> > correctly. The layer was a complete disaster in the i915 so we really
> > wanted to avoid that.
> > 2. The thought was it might be needed for early product bring up one
> > day.
> > 
> > As I think about this a bit more, we likely just delete execlist backend
> > before merging this upstream and perhaps just carry 1 large patch
> > internally with this implementation that we can use as needed. Final
> > decession TDB though.
> 
> but that might regress after some time on "let's keep 2 backends so we
> layer the code correctly". Leaving the additional backend behind
> CONFIG_BROKEN or XE_EXPERIMENTAL, or something like that, not
> enabled by distros, but enabled in CI would be a good idea IMO.
> 
> Carrying a large patch out of tree would make things harder for new
> platforms. A perfect backend split would make it possible, but like I
> said, we are likely not to have it if we delete the second backend.
> 

Good points here Lucas. One thing that we absolutely have wrong is
falling back to execlists if GuC firmware is missing. We def should not
be doing that as it creates confusion.

I kinda like the idea hiding it behind a config option + module
parameter to use the backend so you really, really need to try to be
able to use it + with this in the code it make us disciplined in our
layering. At some point we will likely another supported backend and at
that point we may decide to delete this backend.

Matt

> Lucas De Marchi
> 
> > 
> > Matt
> > 
> > > Regards,
> > > 
> > > Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 17:52                             ` Matthew Brost
@ 2023-01-12 18:21                               ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 18:21 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, Jason Ekstrand, dri-devel



On 11/01/2023 17:52, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:

[snip]

>> Anyway, since you are not buying any arguments on paper perhaps you are more
>> open towards testing. If you would adapt gem_wsim for Xe you would be able
>> to spawn N simulated transcode sessions on any Gen11+ machine and try it
>> out.
>>
>> For example:
>>
>> gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600
>>
>> That will run you 36 parallel transcoding sessions streams for 600 frames
>> each. No client setup needed whatsoever apart from compiling IGT.
>>
>> In the past that was quite a handy tool to identify scheduling issues, or
>> validate changes against. All workloads with the media prefix have actually
>> been hand crafted by looking at what real media pipelines do with real data.
>> Few years back at least.
>>
> 
> Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
> to use UMD benchmarks to determine if there are performance problems as
> in the i915 we had tons microbenchmarks / IGT benchmarks that we found
> meant absolutely nothing. Can't say if this benchmark falls into that
> category.

I explained what it does so it was supposed to be obvious it is not a 
micro benchmark.

2.5k what, lines of code? Difficulty of adding Xe support does not scale 
with LOC but with how much it uses the kernel API. You'd essentially 
need to handle context/engine creation and different execbuf.

It's not trivial no, but it would save you downloading gigabytes of test 
streams, building a bunch of tools and libraries etc, and so overall in 
my experience it *significantly* improves the driver development 
turn-around time.

> We VK and compute benchmarks running and haven't found any major issues
> yet. The media UMD hasn't been ported because of the VM bind dependency
> so I can't say if there are any issues with the media UMD + Xe.
> 
> What I can do hack up xe_exec_threads to really hammer Xe - change it to
> 128x xe_engines + 8k execs per thread. Each exec is super simple, it
> just stores a dword. It creates a thread per hardware engine, so on TGL
> this is 5x threads.
> 
> Results below:
> root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
> IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
> Starting subtest: threads-basic
> Subtest threads-basic: SUCCESS (1.215s)
> root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
>    40960  491520 7401728
> root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
>      645    7095   82457
> 
> So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
> time to run 40960 execs. That seems to indicate we do not have a
> scheduling problem.
> 
> This is 8 core (or at least 8 threads) TGL:
> 
> root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
> ...
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 140
> model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
> stepping        : 1
> microcode       : 0x3a
> cpu MHz         : 2344.098
> cache size      : 12288 KB
> physical id     : 0
> siblings        : 8
> core id         : 3
> cpu cores       : 4
> ...
> 
> Enough data to be convinced there is not issue with this design? I can
> also hack up Xe to use less GPU schedulers /w a kthreads but again that
> isn't trivial and doesn't seem necessary based on these results.

Not yet. It's not only about how many somethings per second you can do. 
It is also about what effect to the rest of the system it creates.

Anyway I think you said in different sub-thread you will move away from 
system_wq, so we can close this one. With that plan at least I don't 
have to worry my mouse will stutter and audio glitch while Xe is 
churning away.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 18:21                               ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 18:21 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel



On 11/01/2023 17:52, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:

[snip]

>> Anyway, since you are not buying any arguments on paper perhaps you are more
>> open towards testing. If you would adapt gem_wsim for Xe you would be able
>> to spawn N simulated transcode sessions on any Gen11+ machine and try it
>> out.
>>
>> For example:
>>
>> gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600
>>
>> That will run you 36 parallel transcoding sessions streams for 600 frames
>> each. No client setup needed whatsoever apart from compiling IGT.
>>
>> In the past that was quite a handy tool to identify scheduling issues, or
>> validate changes against. All workloads with the media prefix have actually
>> been hand crafted by looking at what real media pipelines do with real data.
>> Few years back at least.
>>
> 
> Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
> to use UMD benchmarks to determine if there are performance problems as
> in the i915 we had tons microbenchmarks / IGT benchmarks that we found
> meant absolutely nothing. Can't say if this benchmark falls into that
> category.

I explained what it does so it was supposed to be obvious it is not a 
micro benchmark.

2.5k what, lines of code? Difficulty of adding Xe support does not scale 
with LOC but with how much it uses the kernel API. You'd essentially 
need to handle context/engine creation and different execbuf.

It's not trivial no, but it would save you downloading gigabytes of test 
streams, building a bunch of tools and libraries etc, and so overall in 
my experience it *significantly* improves the driver development 
turn-around time.

> We VK and compute benchmarks running and haven't found any major issues
> yet. The media UMD hasn't been ported because of the VM bind dependency
> so I can't say if there are any issues with the media UMD + Xe.
> 
> What I can do hack up xe_exec_threads to really hammer Xe - change it to
> 128x xe_engines + 8k execs per thread. Each exec is super simple, it
> just stores a dword. It creates a thread per hardware engine, so on TGL
> this is 5x threads.
> 
> Results below:
> root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
> IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
> Starting subtest: threads-basic
> Subtest threads-basic: SUCCESS (1.215s)
> root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
>    40960  491520 7401728
> root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
>      645    7095   82457
> 
> So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
> time to run 40960 execs. That seems to indicate we do not have a
> scheduling problem.
> 
> This is 8 core (or at least 8 threads) TGL:
> 
> root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
> ...
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 140
> model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
> stepping        : 1
> microcode       : 0x3a
> cpu MHz         : 2344.098
> cache size      : 12288 KB
> physical id     : 0
> siblings        : 8
> core id         : 3
> cpu cores       : 4
> ...
> 
> Enough data to be convinced there is not issue with this design? I can
> also hack up Xe to use less GPU schedulers /w a kthreads but again that
> isn't trivial and doesn't seem necessary based on these results.

Not yet. It's not only about how many somethings per second you can do. 
It is also about what effect to the rest of the system it creates.

Anyway I think you said in different sub-thread you will move away from 
system_wq, so we can close this one. With that plan at least I don't 
have to worry my mouse will stutter and audio glitch while Xe is 
churning away.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 19:40                           ` Matthew Brost
@ 2023-01-12 18:43                             ` Tvrtko Ursulin
  -1 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 18:43 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel, Jason Ekstrand



On 11/01/2023 19:40, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 08:50:37AM +0000, Tvrtko Ursulin wrote:

[snip]

>> This example is where it would hurt on large systems. Imagine only an even
>> wider media transcode card...
>>
>> Second example is only a single engine class used (3d desktop?) but with a
>> bunch of not-runnable jobs queued and waiting on a fence to signal. Implicit
>> or explicit dependencies doesn't matter. Then the fence signals and call
>> backs run. N work items get scheduled, but they all submit to the same HW
>> engine. So we end up with:
>>
>>          /-- wi1 --\
>>         / ..     .. \
>>   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>>         \ ..    ..  /
>>          \-- wiN --/
>>
>>
>> All that we have achieved is waking up N CPUs to contend on the same lock
>> and effectively insert the job into the same single HW queue. I don't see
>> any positives there.
>>
> 
> I've said this before, the CT channel in practice isn't going to be full
> so the section of code protected by the mutex is really, really small.
> The mutex really shouldn't ever have contention. Also does a mutex spin
> for small period of time before going to sleep? I seem to recall some
> type of core lock did this, if we can use a lock that spins for short
> period of time this argument falls apart.

This argument already fell apart when we established it's the system_wq 
and not the unbound one. So a digression only - it did not fall apart 
because of the CT channel not being ever congested, there would still be 
the question of what's the point to wake up N cpus when there is a 
single work channel in the backend.

You would have been able to bypass all that by inserting work items 
directly, not via the scheduler workers. I thought that was what Jason 
was implying when he mentioned that a better frontend/backend drm 
scheduler split was considered at some point.

Because for 1:1:1, where GuC is truly 1, it does seem it would work 
better if that sort of a split would enable you to queue directly into 
the backend bypassing the kthread/worker / wait_on wake_up dance.

Would that work? From drm_sched_entity_push_job directly to the backend 
- not waking up but *calling* the equivalent of drm_sched_main.

>> Right, that's all solid I think. My takeaway is that frontend priority
>> sorting and that stuff isn't needed and that is okay. And that there are
>> multiple options to maybe improve drm scheduler, like the fore mentioned
>> making it deal with out of order, or split into functional components, or
>> split frontend/backend what you suggested. For most of them cost vs benefit
>> is more or less not completely clear, neither how much effort was invested
>> to look into them.
>>
>> One thing I missed from this explanation is how drm_scheduler per engine
>> class interferes with the high level concepts. And I did not manage to pick
>> up on what exactly is the TDR problem in that case. Maybe the two are one
>> and the same.
>>
>> Bottom line is I still have the concern that conversion to kworkers has an
>> opportunity to regress. Possibly more opportunity for some Xe use cases than
>> to affect other vendors, since they would still be using per physical engine
>> / queue scheduler instances.
>>
> 
> We certainly don't want to affect other vendors but I haven't yet heard
> any push back from other vendors. I don't think speculating about
> potential problems is helpful.

I haven't had any push back on the drm cgroup controller either. :D

>> And to put my money where my mouth is I will try to put testing Xe inside
>> the full blown ChromeOS environment in my team plans. It would probably also
>> be beneficial if Xe team could take a look at real world behaviour of the
>> extreme transcode use cases too. If the stack is ready for that and all. It
>> would be better to know earlier rather than later if there is a fundamental
>> issue.
>>
> 
> We don't have a media UMD yet it will be tough to test at this point in
> time. Also not sure when Xe is going to be POR for a Chrome product
> either so porting Xe into ChromeOS likely isn't a top priority for your
> team. I know from experience that porting things into ChromeOS isn't
> trivial as I've support several of these efforts. Not saying don't do
> this just mentioning the realities of what you are suggesting.

I know, I only said I'd put it in the plans, not that it will happen 
tomorrow.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-12 18:43                             ` Tvrtko Ursulin
  0 siblings, 0 replies; 161+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 18:43 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel



On 11/01/2023 19:40, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 08:50:37AM +0000, Tvrtko Ursulin wrote:

[snip]

>> This example is where it would hurt on large systems. Imagine only an even
>> wider media transcode card...
>>
>> Second example is only a single engine class used (3d desktop?) but with a
>> bunch of not-runnable jobs queued and waiting on a fence to signal. Implicit
>> or explicit dependencies doesn't matter. Then the fence signals and call
>> backs run. N work items get scheduled, but they all submit to the same HW
>> engine. So we end up with:
>>
>>          /-- wi1 --\
>>         / ..     .. \
>>   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>>         \ ..    ..  /
>>          \-- wiN --/
>>
>>
>> All that we have achieved is waking up N CPUs to contend on the same lock
>> and effectively insert the job into the same single HW queue. I don't see
>> any positives there.
>>
> 
> I've said this before, the CT channel in practice isn't going to be full
> so the section of code protected by the mutex is really, really small.
> The mutex really shouldn't ever have contention. Also does a mutex spin
> for small period of time before going to sleep? I seem to recall some
> type of core lock did this, if we can use a lock that spins for short
> period of time this argument falls apart.

This argument already fell apart when we established it's the system_wq 
and not the unbound one. So a digression only - it did not fall apart 
because of the CT channel not being ever congested, there would still be 
the question of what's the point to wake up N cpus when there is a 
single work channel in the backend.

You would have been able to bypass all that by inserting work items 
directly, not via the scheduler workers. I thought that was what Jason 
was implying when he mentioned that a better frontend/backend drm 
scheduler split was considered at some point.

Because for 1:1:1, where GuC is truly 1, it does seem it would work 
better if that sort of a split would enable you to queue directly into 
the backend bypassing the kthread/worker / wait_on wake_up dance.

Would that work? From drm_sched_entity_push_job directly to the backend 
- not waking up but *calling* the equivalent of drm_sched_main.

>> Right, that's all solid I think. My takeaway is that frontend priority
>> sorting and that stuff isn't needed and that is okay. And that there are
>> multiple options to maybe improve drm scheduler, like the fore mentioned
>> making it deal with out of order, or split into functional components, or
>> split frontend/backend what you suggested. For most of them cost vs benefit
>> is more or less not completely clear, neither how much effort was invested
>> to look into them.
>>
>> One thing I missed from this explanation is how drm_scheduler per engine
>> class interferes with the high level concepts. And I did not manage to pick
>> up on what exactly is the TDR problem in that case. Maybe the two are one
>> and the same.
>>
>> Bottom line is I still have the concern that conversion to kworkers has an
>> opportunity to regress. Possibly more opportunity for some Xe use cases than
>> to affect other vendors, since they would still be using per physical engine
>> / queue scheduler instances.
>>
> 
> We certainly don't want to affect other vendors but I haven't yet heard
> any push back from other vendors. I don't think speculating about
> potential problems is helpful.

I haven't had any push back on the drm cgroup controller either. :D

>> And to put my money where my mouth is I will try to put testing Xe inside
>> the full blown ChromeOS environment in my team plans. It would probably also
>> be beneficial if Xe team could take a look at real world behaviour of the
>> extreme transcode use cases too. If the stack is ready for that and all. It
>> would be better to know earlier rather than later if there is a fundamental
>> issue.
>>
> 
> We don't have a media UMD yet it will be tough to test at this point in
> time. Also not sure when Xe is going to be POR for a Chrome product
> either so porting Xe into ChromeOS likely isn't a top priority for your
> team. I know from experience that porting things into ChromeOS isn't
> trivial as I've support several of these efforts. Not saying don't do
> this just mentioning the realities of what you are suggesting.

I know, I only said I'd put it in the plans, not that it will happen 
tomorrow.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-11 22:56                               ` Jason Ekstrand
  (?)
@ 2023-01-13  0:39                               ` John Harrison
  2023-01-18  3:06                                   ` Matthew Brost
  -1 siblings, 1 reply; 161+ messages in thread
From: John Harrison @ 2023-01-13  0:39 UTC (permalink / raw)
  To: Jason Ekstrand, Matthew Brost; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 38093 bytes --]

On 1/11/2023 14:56, Jason Ekstrand wrote:
> On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost 
> <matthew.brost@intel.com> wrote:
>
>     On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
>     > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
>     > tvrtko.ursulin@linux.intel.com> wrote:
>     >
>     > >
>     [snip]
>     > >
>     > > Typically is the key here. But I am not sure it is good
>     enough. Consider
>     > > this example - Intel Flex 170:
>     > >
>     > >   * Delivers up to 36 streams 1080p60 transcode throughput per
>     card.
>     > >   * When scaled to 10 cards in a 4U server configuration, it
>     can support
>     > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
>     > >
>     >
>     > I had a feeling it was going to be media.... 😅
>     >
>
>     Yea wondering the media UMD can be rewritten to use less
>     xe_engines, it
>     is massive rewrite for VM bind + no implicit dependencies so let's
>     just
>     pile on some more work?
>
>
> It could probably use fewer than it does today.  It currently creates 
> and throws away contexts like crazy, or did last I looked at it.  
> However, the nature of media encode is that it often spreads across 
> two or three different types of engines.  There's not much you can do 
> to change that.
And as per Tvrtko's example, you get media servers that transcode huge 
numbers of tiny streams in parallel. Almost no work per frame but 100s 
of independent streams being run concurrently. That means many 100s of 
contexts all trying to run at 30fps. I recall a specific bug about 
thundering herds - hundreds (thousands?) of waiting threads all being 
woken up at once because some request had completed.

>     >
>     > > One transcode stream from my experience typically is 3-4 GPU
>     contexts
>     > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
>     a single
>     > > CPU thread. 4 contexts * 36 streams = 144 active contexts.
>     Multiply by
>     > > 60fps = 8640 jobs submitted and completed per second.
>     > >
>     > > 144 active contexts in the proposed scheme means possibly
>     means 144
>     > > kernel worker threads spawned (driven by 36 transcode CPU
>     threads). (I
>     > > don't think the pools would scale down given all are
>     constantly pinged
>     > > at 60fps.)
>     > >
>     > > And then each of 144 threads goes to grab the single GuC CT
>     mutex. First
>     > > threads are being made schedulable, then put to sleep as mutex
>     > > contention is hit, then woken again as mutexes are getting
>     released,
>     > > rinse, repeat.
>     > >
>     >
>     > Why is every submission grabbing the GuC CT mutex? I've not read
>     the GuC
>     > back-end yet but I was under the impression that most run_job()
>     would be
>     > just shoving another packet into a ring buffer.  If we have to
>     send the GuC
>     > a message on the control ring every single time we submit a job,
>     that's
>     > pretty horrible.
>     >
>
>     Run job writes the ring buffer and moves the tail as the first
>     step (no
>     lock required). Next it needs to tell the GuC the xe_engine LRC
>     tail has
>     moved, this is done from a single Host to GuC channel which is
>     circular
>     buffer, the writing of the channel protected by the mutex. There are
>     little more nuances too but in practice there is always space in the
>     channel so the time mutex needs to held is really, really small
>     (check cached credits, write 3 dwords in payload, write 1 dword to
>     move
>     tail). I also believe mutexes in Linux are hybrid where they spin
>     for a
>     little bit before sleeping and certainly if there is space in the
>     channel we shouldn't sleep mutex contention.
>
>
> Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we 
> had some way to batch things up a bit so we only have to poke the GuC 
> channel once for every batch of things rather than once per job.  
> That's maybe something we can look into as a future improvement; not 
> fundamental.
>
> Generally, though, it sounds like contention could be a real problem 
> if we end up ping-ponging that lock between cores.  It's going to 
> depend on how much work it takes to get the next ready thing vs. the 
> cost of that atomic.  But, also, anything we do is going to 
> potentially run into contention problems.  *shrug*  If we were going 
> to go for one-per-HW-engine, we may as well go one-per-device and then 
> we wouldn't need the lock.  Off the top of my head, that doesn't sound 
> great either but IDK.
>
>     As far as this being horrible, well didn't design the GuC and this how
>     it is implemented for KMD based submission. We also have 256 doorbells
>     so we wouldn't need a lock but I think are other issues with that
>     design
>     too which need to be worked out in the Xe2 / Xe3 timeframe.
>
>
> Yeah, not blaming you.  Just surprised, that's all.  How does it work 
> for userspace submission?  What would it look like if the kernel 
> emulated userspace submission?  Is that even possible?
>
> What are these doorbell things?  How do they play into it?
Basically a bank of MMIO space reserved per 'entity' where a write to 
that MMIO space becomes an named interrupt to GuC. You can assign each 
doorbell to a specific GuC context. So writing to that doorbell address 
is effectively the same as sending a SCHEDULE_CONTEXT H2G message from 
the KMD for that context. But the advantage is you ring the doorbell 
from user land with no call into the kernel at all. Or from within the 
kernel, you can do it without needing any locks at all. Problem is, we 
have 64K contexts in GuC but only 256 doorbells in the hardware. Less if 
using SRIOV. So the "per 'entity'" part because somewhat questionable as 
to exactly what the 'entity' is. And hence we just haven't bothered 
supporting them in Linux because a) no direct submission from user land 
yet, and b) as Matthew says entire chain of IOCTL from UMD to kernel to 
acquiring a lock and sending the H2G has generally been fast enough. The 
latency only becomes an issue for ULLS people but for them, even the 
doorbells from user space are too high a latency because that still 
potentially involves the GuC having to do some scheduling and context 
switch type action.

John.


>     Also if you see my follow up response Xe is ~33k execs per second with
>     the current implementation on a 8 core (or maybe 8 thread) TGL which
>     seems to fine to me.
>
>
> 33k exec/sec is about 500/frame which should be fine. 500 is a lot for 
> a single frame.  I typically tell game devs to shoot for dozens per 
> frame.  The important thing is that it stays low even with hundreds of 
> memory objects bound. (Xe should be just fine there.)
>
> --Jason
>
>     Matt
>
>     > --Jason
>     >
>     >
>     > (And yes this backend contention is there regardless of 1:1:1,
>     it would
>     > > require a different re-design to solve that. But it is just a
>     question
>     > > whether there are 144 contending threads, or just 6 with the
>     thread per
>     > > engine class scheme.)
>     > >
>     > > Then multiply all by 10 for a 4U server use case and you get
>     1440 worker
>     > > kthreads, yes 10 more CT locks, but contending on how many CPU
>     cores?
>     > > Just so they can grab a timeslice and maybe content on a mutex
>     as the
>     > > next step.
>     > >
>     > > This example is where it would hurt on large systems. Imagine
>     only an
>     > > even wider media transcode card...
>     > >
>     > > Second example is only a single engine class used (3d
>     desktop?) but with
>     > > a bunch of not-runnable jobs queued and waiting on a fence to
>     signal.
>     > > Implicit or explicit dependencies doesn't matter. Then the
>     fence signals
>     > > and call backs run. N work items get scheduled, but they all
>     submit to
>     > > the same HW engine. So we end up with:
>     > >
>     > >          /-- wi1 --\
>     > >         / ..     .. \
>     > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
>     > >         \ ..    ..  /
>     > >          \-- wiN --/
>     > >
>     > >
>     > > All that we have achieved is waking up N CPUs to contend on
>     the same
>     > > lock and effectively insert the job into the same single HW
>     queue. I
>     > > don't see any positives there.
>     > >
>     > > This example I think can particularly hurt small / low power
>     devices
>     > > because of needless waking up of many cores for no benefit.
>     Granted, I
>     > > don't have a good feel on how common this pattern is in practice.
>     > >
>     > > >
>     > > >     That
>     > > >     is the number which drives the maximum number of
>     not-runnable jobs
>     > > that
>     > > >     can become runnable at once, and hence spawn that many
>     work items,
>     > > and
>     > > >     in turn unbound worker threads.
>     > > >
>     > > >     Several problems there.
>     > > >
>     > > >     It is fundamentally pointless to have potentially that
>     many more
>     > > >     threads
>     > > >     than the number of CPU cores - it simply creates a
>     scheduling storm.
>     > > >
>     > > >     Unbound workers have no CPU / cache locality either and
>     no connection
>     > > >     with the CPU scheduler to optimize scheduling patterns.
>     This may
>     > > matter
>     > > >     either on large systems or on small ones. Whereas the
>     current design
>     > > >     allows for scheduler to notice userspace CPU thread
>     keeps waking up
>     > > the
>     > > >     same drm scheduler kernel thread, and so it can keep
>     them on the same
>     > > >     CPU, the unbound workers lose that ability and so 2nd
>     CPU might be
>     > > >     getting woken up from low sleep for every submission.
>     > > >
>     > > >     Hence, apart from being a bit of a impedance mismatch,
>     the proposal
>     > > has
>     > > >     the potential to change performance and power patterns
>     and both large
>     > > >     and small machines.
>     > > >
>     > > >
>     > > > Ok, thanks for explaining the issue you're seeing in more
>     detail.  Yes,
>     > > > deferred kwork does appear to mismatch somewhat with what
>     the scheduler
>     > > > needs or at least how it's worked in the past.  How much
>     impact will
>     > > > that mismatch have?  Unclear.
>     > > >
>     > > >      >      >>> Secondly, it probably demands separate
>     workers (not
>     > > >     optional),
>     > > >      >     otherwise
>     > > >      >      >>> behaviour of shared workqueues has either
>     the potential
>     > > to
>     > > >      >     explode number
>     > > >      >      >>> kernel threads anyway, or add latency.
>     > > >      >      >>>
>     > > >      >      >>
>     > > >      >      >> Right now the system_unbound_wq is used which
>     does have a
>     > > >     limit
>     > > >      >     on the
>     > > >      >      >> number of threads, right? I do have a FIXME
>     to allow a
>     > > >     worker to be
>     > > >      >      >> passed in similar to TDR.
>     > > >      >      >>
>     > > >      >      >> WRT to latency, the 1:1 ratio could actually
>     have lower
>     > > >     latency
>     > > >      >     as 2 GPU
>     > > >      >      >> schedulers can be pushing jobs into the backend /
>     > > cleaning up
>     > > >      >     jobs in
>     > > >      >      >> parallel.
>     > > >      >      >>
>     > > >      >      >
>     > > >      >      > Thought of one more point here where why in Xe we
>     > > >     absolutely want
>     > > >      >     a 1 to
>     > > >      >      > 1 ratio between entity and scheduler - the way
>     we implement
>     > > >      >     timeslicing
>     > > >      >      > for preempt fences.
>     > > >      >      >
>     > > >      >      > Let me try to explain.
>     > > >      >      >
>     > > >      >      > Preempt fences are implemented via the generic
>     messaging
>     > > >      >     interface [1]
>     > > >      >      > with suspend / resume messages. If a suspend
>     messages is
>     > > >     received to
>     > > >      >      > soon after calling resume (this is per entity)
>     we simply
>     > > >     sleep in the
>     > > >      >      > suspend call thus giving the entity a
>     timeslice. This
>     > > >     completely
>     > > >      >     falls
>     > > >      >      > apart with a many to 1 relationship as now a
>     entity
>     > > >     waiting for a
>     > > >      >      > timeslice blocks the other entities. Could we
>     work aroudn
>     > > >     this,
>     > > >      >     sure but
>     > > >      >      > just another bunch of code we'd have to add in
>     Xe. Being to
>     > > >      >     freely sleep
>     > > >      >      > in backend without affecting other entities is
>     really,
>     > > really
>     > > >      >     nice IMO
>     > > >      >      > and I bet Xe isn't the only driver that is
>     going to feel
>     > > >     this way.
>     > > >      >      >
>     > > >      >      > Last thing I'll say regardless of how anyone
>     feels about
>     > > >     Xe using
>     > > >      >     a 1 to
>     > > >      >      > 1 relationship this patch IMO makes sense as I
>     hope we can
>     > > all
>     > > >      >     agree a
>     > > >      >      > workqueue scales better than kthreads.
>     > > >      >
>     > > >      >     I don't know for sure what will scale better and
>     for what use
>     > > >     case,
>     > > >      >     combination of CPU cores vs number of GPU engines
>     to keep
>     > > >     busy vs other
>     > > >      >     system activity. But I wager someone is bound to
>     ask for some
>     > > >      >     numbers to
>     > > >      >     make sure proposal is not negatively affecting
>     any other
>     > > drivers.
>     > > >      >
>     > > >      >
>     > > >      > Then let them ask.  Waving your hands vaguely in the
>     direction of
>     > > >     the
>     > > >      > rest of DRM and saying "Uh, someone (not me) might
>     object" is
>     > > >     profoundly
>     > > >      > unhelpful.  Sure, someone might. That's why it's on
>     dri-devel.
>     > > >     If you
>     > > >      > think there's someone in particular who might have a
>     useful
>     > > >     opinion on
>     > > >      > this, throw them in the CC so they don't miss the
>     e-mail thread.
>     > > >      >
>     > > >      > Or are you asking for numbers?  If so, what numbers
>     are you
>     > > >     asking for?
>     > > >
>     > > >     It was a heads up to the Xe team in case people weren't
>     appreciating
>     > > >     how
>     > > >     the proposed change has the potential influence power
>     and performance
>     > > >     across the board. And nothing in the follow up
>     discussion made me
>     > > think
>     > > >     it was considered so I don't think it was redundant to
>     raise it.
>     > > >
>     > > >     In my experience it is typical that such core changes
>     come with some
>     > > >     numbers. Which is in case of drm scheduler is tricky and
>     probably
>     > > >     requires explicitly asking everyone to test (rather than
>     count on
>     > > >     "don't
>     > > >     miss the email thread"). Real products can fail to ship
>     due ten mW
>     > > here
>     > > >     or there. Like suddenly an extra core prevented from
>     getting into
>     > > deep
>     > > >     sleep.
>     > > >
>     > > >     If that was "profoundly unhelpful" so be it.
>     > > >
>     > > >
>     > > > With your above explanation, it makes more sense what you're
>     asking.
>     > > > It's still not something Matt is likely to be able to
>     provide on his
>     > > > own.  We need to tag some other folks and ask them to test
>     it out.  We
>     > > > could play around a bit with it on Xe but it's not exactly
>     production
>     > > > grade yet and is going to hit this differently from most. 
>     Likely
>     > > > candidates are probably AMD and Freedreno.
>     > >
>     > > Whoever is setup to check out power and performance would be
>     good to
>     > > give it a spin, yes.
>     > >
>     > > PS. I don't think I was asking Matt to test with other
>     devices. To start
>     > > with I think Xe is a team effort. I was asking for more
>     background on
>     > > the design decision since patch 4/20 does not say anything on that
>     > > angle, nor later in the thread it was IMO sufficiently addressed.
>     > >
>     > > >      > Also, If we're talking about a design that might
>     paint us into an
>     > > >      > Intel-HW-specific hole, that would be one thing.  But
>     we're not.
>     > > >     We're
>     > > >      > talking about switching which kernel threading/task
>     mechanism to
>     > > >     use for
>     > > >      > what's really a very generic problem.  The core Xe
>     design works
>     > > >     without
>     > > >      > this patch (just with more kthreads).  If we land
>     this patch or
>     > > >      > something like it and get it wrong and it causes a
>     performance
>     > > >     problem
>     > > >      > for someone down the line, we can revisit it.
>     > > >
>     > > >     For some definition of "it works" - I really wouldn't
>     suggest
>     > > >     shipping a
>     > > >     kthread per user context at any point.
>     > > >
>     > > >
>     > > > You have yet to elaborate on why. What resources is it
>     consuming that's
>     > > > going to be a problem? Are you anticipating CPU affinity
>     problems? Or
>     > > > does it just seem wasteful?
>     > >
>     > > Well I don't know, commit message says the approach does not
>     scale. :)
>     > >
>     > > > I think I largely agree that it's probably
>     unnecessary/wasteful but
>     > > > reducing the number of kthreads seems like a tractable
>     problem to solve
>     > > > regardless of where we put the gpu_scheduler object.  Is
>     this the right
>     > > > solution?  Maybe not.  It was also proposed at one point
>     that we could
>     > > > split the scheduler into two pieces: A scheduler which owns
>     the kthread,
>     > > > and a back-end which targets some HW ring thing where you
>     can have
>     > > > multiple back-ends per scheduler.  That's certainly more
>     invasive from a
>     > > > DRM scheduler internal API PoV but would solve the kthread
>     problem in a
>     > > > way that's more similar to what we have now.
>     > > >
>     > > >      >     In any case that's a low level question caused by
>     the high
>     > > >     level design
>     > > >      >     decision. So I'd think first focus on the high
>     level - which
>     > > >     is the 1:1
>     > > >      >     mapping of entity to scheduler instance proposal.
>     > > >      >
>     > > >      >     Fundamentally it will be up to the DRM
>     maintainers and the
>     > > >     community to
>     > > >      >     bless your approach. And it is important to
>     stress 1:1 is
>     > > about
>     > > >      >     userspace contexts, so I believe unlike any other
>     current
>     > > >     scheduler
>     > > >      >     user. And also important to stress this
>     effectively does not
>     > > >     make Xe
>     > > >      >     _really_ use the scheduler that much.
>     > > >      >
>     > > >      >
>     > > >      > I don't think this makes Xe nearly as much of a
>     one-off as you
>     > > >     think it
>     > > >      > does.  I've already told the Asahi team working on
>     Apple M1/2
>     > > >     hardware
>     > > >      > to do it this way and it seems to be a pretty good
>     mapping for
>     > > >     them. I
>     > > >      > believe this is roughly the plan for nouveau as
>     well.  It's not
>     > > >     the way
>     > > >      > it currently works for anyone because most other
>     groups aren't
>     > > >     doing FW
>     > > >      > scheduling yet.  In the world of FW scheduling and
>     hardware
>     > > >     designed to
>     > > >      > support userspace direct-to-FW submit, I think the
>     design makes
>     > > >     perfect
>     > > >      > sense (see below) and I expect we'll see more drivers
>     move in this
>     > > >      > direction as those drivers evolve. (AMD is doing some
>     customish
>     > > >     thing
>     > > >      > for how with gpu_scheduler on the front-end somehow.
>     I've not dug
>     > > >     into
>     > > >      > those details.)
>     > > >      >
>     > > >      >     I can only offer my opinion, which is that the
>     two options
>     > > >     mentioned in
>     > > >      >     this thread (either improve drm scheduler to cope
>     with what is
>     > > >      >     required,
>     > > >      >     or split up the code so you can use just the parts of
>     > > >     drm_sched which
>     > > >      >     you want - which is frontend dependency tracking)
>     shouldn't
>     > > be so
>     > > >      >     readily dismissed, given how I think the idea was
>     for the new
>     > > >     driver to
>     > > >      >     work less in a silo and more in the community
>     (not do kludges
>     > > to
>     > > >      >     workaround stuff because it is thought to be too
>     hard to
>     > > >     improve common
>     > > >      >     code), but fundamentally, "goto previous
>     paragraph" for what
>     > > I am
>     > > >      >     concerned.
>     > > >      >
>     > > >      >
>     > > >      > Meta comment:  It appears as if you're falling into
>     the standard
>     > > >     i915
>     > > >      > team trap of having an internal discussion about what the
>     > > community
>     > > >      > discussion might look like instead of actually having the
>     > > community
>     > > >      > discussion.  If you are seriously concerned about
>     interactions
>     > > with
>     > > >      > other drivers or whether or setting common direction,
>     the right
>     > > >     way to
>     > > >      > do that is to break a patch or two out into a
>     separate RFC series
>     > > >     and
>     > > >      > tag a handful of driver maintainers.  Trying to
>     predict the
>     > > >     questions
>     > > >      > other people might ask is pointless. Cc them and
>     asking for their
>     > > >     input
>     > > >      > instead.
>     > > >
>     > > >     I don't follow you here. It's not an internal discussion
>     - I am
>     > > raising
>     > > >     my concerns on the design publicly. I am supposed to
>     write a patch to
>     > > >     show something, but am allowed to comment on a RFC series?
>     > > >
>     > > >
>     > > > I may have misread your tone a bit.  It felt a bit like too many
>     > > > discussions I've had in the past where people are trying to
>     predict what
>     > > > others will say instead of just asking them. Reading it
>     again, I was
>     > > > probably jumping to conclusions a bit.  Sorry about that.
>     > >
>     > > Okay no problem, thanks. In any case we don't have to keep
>     discussing
>     > > it, since I wrote one or two emails ago it is fundamentally on the
>     > > maintainers and community to ack the approach. I only felt
>     like RFC did
>     > > not explain the potential downsides sufficiently so I wanted
>     to probe
>     > > that area a bit.
>     > >
>     > > >     It is "drm/sched: Convert drm scheduler to use a work
>     queue rather
>     > > than
>     > > >     kthread" which should have Cc-ed _everyone_ who use drm
>     scheduler.
>     > > >
>     > > >
>     > > > Yeah, it probably should have.  I think that's mostly what
>     I've been
>     > > > trying to say.
>     > > >
>     > > >      >
>     > > >      >     Regards,
>     > > >      >
>     > > >      >     Tvrtko
>     > > >      >
>     > > >      >     P.S. And as a related side note, there are more
>     areas where
>     > > >     drm_sched
>     > > >      >     could be improved, like for instance priority
>     handling.
>     > > >      >     Take a look at msm_submitqueue_create /
>     > > >     msm_gpu_convert_priority /
>     > > >      >     get_sched_entity to see how msm works around the
>     drm_sched
>     > > >     hardcoded
>     > > >      >     limit of available priority levels, in order to
>     avoid having
>     > > >     to leave a
>     > > >      >     hw capability unused. I suspect msm would be
>     happier if they
>     > > >     could have
>     > > >      >     all priority levels equal in terms of whether
>     they apply only
>     > > >     at the
>     > > >      >     frontend level or completely throughout the pipeline.
>     > > >      >
>     > > >      >      > [1]
>     > > >      >
>     > > >
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>     > > >   
>      <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>     > > >
>     > > >      >
>     > > >       <
>     > >
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
>     <
>     > >
>     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
>     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>>
>     > > >      >      >
>     > > >      >      >>> What would be interesting to learn is
>     whether the option
>     > > of
>     > > >      >     refactoring
>     > > >      >      >>> drm_sched to deal with out of order
>     completion was
>     > > >     considered
>     > > >      >     and what were
>     > > >      >      >>> the conclusions.
>     > > >      >      >>>
>     > > >      >      >>
>     > > >      >      >> I coded this up a while back when trying to
>     convert the
>     > > >     i915 to
>     > > >      >     the DRM
>     > > >      >      >> scheduler it isn't all that hard either. The
>     free flow
>     > > >     control
>     > > >      >     on the
>     > > >      >      >> ring (e.g. set job limit == SIZE OF RING /
>     MAX JOB SIZE)
>     > > is
>     > > >      >     really what
>     > > >      >      >> sold me on the this design.
>     > > >      >
>     > > >      >
>     > > >      > You're not the only one to suggest supporting
>     out-of-order
>     > > >     completion.
>     > > >      > However, it's tricky and breaks a lot of internal
>     assumptions of
>     > > the
>     > > >      > scheduler. It also reduces functionality a bit
>     because it can no
>     > > >     longer
>     > > >      > automatically rate-limit HW/FW queues which are often
>     > > >     fixed-size.  (Ok,
>     > > >      > yes, it probably could but it becomes a substantially
>     harder
>     > > >     problem.)
>     > > >      >
>     > > >      > It also seems like a worse mapping to me.  The goal
>     here is to
>     > > turn
>     > > >      > submissions on a userspace-facing engine/queue into
>     submissions
>     > > >     to a FW
>     > > >      > queue submissions, sorting out any dma_fence
>     dependencies.  Matt's
>     > > >      > description of saying this is a 1:1 mapping between
>     sched/entity
>     > > >     doesn't
>     > > >      > tell the whole story. It's a 1:1:1 mapping between
>     xe_engine,
>     > > >      > gpu_scheduler, and GuC FW engine. Why make it a
>     1:something:1
>     > > >     mapping?
>     > > >      > Why is that better?
>     > > >
>     > > >     As I have stated before, what I think what would fit
>     well for Xe is
>     > > one
>     > > >     drm_scheduler per engine class. In specific terms on our
>     current
>     > > >     hardware, one drm scheduler instance for render,
>     compute, blitter,
>     > > >     video
>     > > >     and video enhance. Userspace contexts remain scheduler
>     entities.
>     > > >
>     > > >
>     > > > And this is where we fairly strongly disagree.  More in a bit.
>     > > >
>     > > >     That way you avoid the whole kthread/kworker story and
>     you have it
>     > > >     actually use the entity picking code in the scheduler,
>     which may be
>     > > >     useful when the backend is congested.
>     > > >
>     > > >
>     > > > What back-end congestion are you referring to here?  Running
>     out of FW
>     > > > queue IDs?  Something else?
>     > >
>     > > CT channel, number of context ids.
>     > >
>     > > >
>     > > >     Yes you have to solve the out of order problem so in my
>     mind that is
>     > > >     something to discuss. What the problem actually is (just
>     TDR?), how
>     > > >     tricky and why etc.
>     > > >
>     > > >     And yes you lose the handy LRCA ring buffer size
>     management so you'd
>     > > >     have to make those entities not runnable in some other way.
>     > > >
>     > > >     Regarding the argument you raise below - would any of
>     that make the
>     > > >     frontend / backend separation worse and why? Do you
>     think it is less
>     > > >     natural? If neither is true then all remains is that it
>     appears extra
>     > > >     work to support out of order completion of entities has been
>     > > discounted
>     > > >     in favour of an easy but IMO inelegant option.
>     > > >
>     > > >
>     > > > Broadly speaking, the kernel needs to stop thinking about
>     GPU scheduling
>     > > > in terms of scheduling jobs and start thinking in terms of
>     scheduling
>     > > > contexts/engines.  There is still some need for scheduling
>     individual
>     > > > jobs but that is only for the purpose of delaying them as
>     needed to
>     > > > resolve dma_fence dependencies.  Once dependencies are
>     resolved, they
>     > > > get shoved onto the context/engine queue and from there the
>     kernel only
>     > > > really manages whole contexts/engines.  This is a major
>     architectural
>     > > > shift, entirely different from the way i915 scheduling
>     works.  It's also
>     > > > different from the historical usage of DRM scheduler which I
>     think is
>     > > > why this all looks a bit funny.
>     > > >
>     > > > To justify this architectural shift, let's look at where
>     we're headed.
>     > > > In the glorious future...
>     > > >
>     > > >   1. Userspace submits directly to firmware queues.  The
>     kernel has no
>     > > > visibility whatsoever into individual jobs. At most it can
>     pause/resume
>     > > > FW contexts as needed to handle eviction and memory management.
>     > > >
>     > > >   2. Because of 1, apart from handing out the FW queue IDs
>     at the
>     > > > beginning, the kernel can't really juggle them that much. 
>     Depending on
>     > > > FW design, it may be able to pause a client, give its IDs to
>     another,
>     > > > and then resume it later when IDs free up. What it's not
>     doing is
>     > > > juggling IDs on a job-by-job basis like i915 currently is.
>     > > >
>     > > >   3. Long-running compute jobs may not complete for days. 
>     This means
>     > > > that memory management needs to happen in terms of
>     pause/resume of
>     > > > entire contexts/engines using the memory rather than based
>     on waiting
>     > > > for individual jobs to complete or pausing individual jobs
>     until the
>     > > > memory is available.
>     > > >
>     > > >   4. Synchronization happens via userspace memory fences
>     (UMF) and the
>     > > > kernel is mostly unaware of most dependencies and when a
>     context/engine
>     > > > is or is not runnable.  Instead, it keeps as many of them
>     minimally
>     > > > active (memory is available, even if it's in system RAM) as
>     possible and
>     > > > lets the FW sort out dependencies.  (There may need to be
>     some facility
>     > > > for sleeping a context until a memory change similar to
>     futex() or
>     > > > poll() for userspace threads.  There are some details TBD.)
>     > > >
>     > > > Are there potential problems that will need to be solved
>     here?  Yes.  Is
>     > > > it a good design?  Well, Microsoft has been living in this
>     future for
>     > > > half a decade or better and it's working quite well for
>     them.  It's also
>     > > > the way all modern game consoles work.  It really is just
>     Linux that's
>     > > > stuck with the same old job model we've had since the
>     monumental shift
>     > > > to DRI2.
>     > > >
>     > > > To that end, one of the core goals of the Xe project was to
>     make the
>     > > > driver internally behave as close to the above model as
>     possible while
>     > > > keeping the old-school job model as a very thin layer on
>     top.  As the
>     > > > broader ecosystem problems (window-system support for UMF,
>     for instance)
>     > > > are solved, that layer can be peeled back. The core driver
>     will already
>     > > > be ready for it.
>     > > >
>     > > > To that end, the point of the DRM scheduler in Xe isn't to
>     schedule
>     > > > jobs.  It's to resolve syncobj and dma-buf implicit sync
>     dependencies
>     > > > and stuff jobs into their respective context/engine queue
>     once they're
>     > > > ready.  All the actual scheduling happens in firmware and
>     any scheduling
>     > > > the kernel does to deal with contention, oversubscriptions,
>     too many
>     > > > contexts, etc. is between contexts/engines, not individual
>     jobs.  Sure,
>     > > > the individual job visibility is nice, but if we design
>     around it, we'll
>     > > > never get to the glorious future.
>     > > >
>     > > > I really need to turn the above (with a bit more detail)
>     into a blog
>     > > > post.... Maybe I'll do that this week.
>     > > >
>     > > > In any case, I hope that provides more insight into why Xe
>     is designed
>     > > > the way it is and why I'm pushing back so hard on trying to
>     make it more
>     > > > of a "classic" driver as far as scheduling is concerned. 
>     Are there
>     > > > potential problems here?  Yes, that's why Xe has been labeled a
>     > > > prototype.  Are such radical changes necessary to get to
>     said glorious
>     > > > future?  Yes, I think they are.  Will it be worth it?  I
>     believe so.
>     > >
>     > > Right, that's all solid I think. My takeaway is that frontend
>     priority
>     > > sorting and that stuff isn't needed and that is okay. And that
>     there are
>     > > multiple options to maybe improve drm scheduler, like the fore
>     mentioned
>     > > making it deal with out of order, or split into functional
>     components,
>     > > or split frontend/backend what you suggested. For most of them
>     cost vs
>     > > benefit is more or less not completely clear, neither how much
>     effort
>     > > was invested to look into them.
>     > >
>     > > One thing I missed from this explanation is how drm_scheduler
>     per engine
>     > > class interferes with the high level concepts. And I did not
>     manage to
>     > > pick up on what exactly is the TDR problem in that case. Maybe
>     the two
>     > > are one and the same.
>     > >
>     > > Bottom line is I still have the concern that conversion to
>     kworkers has
>     > > an opportunity to regress. Possibly more opportunity for some
>     Xe use
>     > > cases than to affect other vendors, since they would still be
>     using per
>     > > physical engine / queue scheduler instances.
>     > >
>     > > And to put my money where my mouth is I will try to put testing Xe
>     > > inside the full blown ChromeOS environment in my team plans.
>     It would
>     > > probably also be beneficial if Xe team could take a look at
>     real world
>     > > behaviour of the extreme transcode use cases too. If the stack
>     is ready
>     > > for that and all. It would be better to know earlier rather
>     than later
>     > > if there is a fundamental issue.
>     > >
>     > > For the patch at hand, and the cover letter, it certainly
>     feels it would
>     > > benefit to record the past design discussion had with AMD
>     folks, to
>     > > explicitly copy other drivers, and to record the theoretical
>     pros and
>     > > cons of threads vs unbound workers as I have tried to
>     highlight them.
>     > >
>     > > Regards,
>     > >
>     > > Tvrtko
>     > >
>

[-- Attachment #2: Type: text/html, Size: 60005 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
                   ` (24 preceding siblings ...)
  (?)
@ 2023-01-17 16:12 ` Jason Ekstrand
  -1 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-17 16:12 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

[-- Attachment #1: Type: text/plain, Size: 11436 bytes --]

On Thu, Dec 22, 2022 at 4:29 PM Matthew Brost <matthew.brost@intel.com>
wrote:

> Hello,
>
> This is a submission for Xe, a new driver for Intel GPUs that supports both
> integrated and discrete platforms starting with Tiger Lake (first platform
> with
> Intel Xe Architecture). The intention of this new driver is to have a
> fresh base
> to work from that is unencumbered by older platforms, whilst also taking
> the
> opportunity to rearchitect our driver to increase sharing across the drm
> subsystem, both leveraging and allowing us to contribute more towards other
> shared components like TTM and drm/scheduler. The memory model is based on
> VM
> bind which is similar to the i915 implementation. Likewise the execbuf
> implementation for Xe is very similar to execbuf3 in the i915 [1].
>
> The code is at a stage where it is already functional and has experimental
> support for multiple platforms starting from Tiger Lake, with initial
> support
> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as
> well
> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> implementation will be released publicly early next year. We also have a
> suite
> of IGTs for XE that will appear on the IGT list shortly.
>
> It has been built with the assumption of supporting multiple architectures
> from
> the get-go, right now with tests running both on X86 and ARM hosts. And we
> intend to continue working on it and improving on it as part of the kernel
> community upstream.
>
> The new Xe driver leverages a lot from i915 and work on i915 continues as
> we
> ready Xe for production throughout 2023.
>
> As for display, the intent is to share the display code with the i915
> driver so
> that there is maximum reuse there. Currently this is being done by
> compiling the
> display code twice, but alternatives to that are under consideration and
> we want
> to have more discussion on what the best final solution will look like
> over the
> next few months. Right now, work is ongoing in refactoring the display
> codebase
> to remove as much as possible any unnecessary dependencies on i915
> specific data
> structures there..
>
> We currently have 2 submission backends, execlists and GuC. The execlist is
> meant mostly for testing and is not fully functional while GuC backend is
> fully
> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> required and should be placed in /lib/firmware/xe.
>
> The GuC firmware can be found in the below location:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
>
> The easiest way to setup firmware is:
> cp -r /lib/firmware/i915 /lib/firmware/xe
>
> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a
> separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is
> available in
> a public repo listed below.
>
> Xe driver commit:
>
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7


Drive-by comment here because I don't see any actual xe patches on the list:

You probably want to drop DRM_XE_SYNC_DMA_BUF from the uAPI.  Now that
we've landed the new dma-buf ioctls for sync_file import/export, there's
really no reason to have it as part of submit.  Dropping it should also
make locking a tiny bit easier.

--Jason



> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/
>
> There's a lot of work still to happen on Xe but we're very excited about
> it and
> wanted to share it early and welcome feedback and discussion.
>
> Cheers,
> Matthew Brost
>
> [1] https://patchwork.freedesktop.org/series/105879/
> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
>
> Maarten Lankhorst (12):
>   drm/amd: Convert amdgpu to use suballocation helper.
>   drm/radeon: Use the drm suballocation manager implementation.
>   drm/i915: Remove gem and overlay frontbuffer tracking
>   drm/i915/display: Neuter frontbuffer tracking harder
>   drm/i915/display: Add more macros to remove all direct calls to uncore
>   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>   drm/i915: Rename find_section to find_bdb_section
>   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>     fails
>   drm/i915/display: Remaining changes to make xe compile
>   sound/hda: Allow XE as i915 replacement for sound
>   mei/hdcp: Also enable for XE
>
> Matthew Brost (5):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
>
> Thomas Hellström (3):
>   drm/suballoc: Introduce a generic suballocation manager
>   drm: Add a gpu page-table walker helper
>   drm/ttm: Don't print error message if eviction was interrupted
>
>  drivers/gpu/drm/Kconfig                       |   5 +
>  drivers/gpu/drm/Makefile                      |   4 +
>  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>  drivers/gpu/drm/i915/Makefile                 |   2 +-
>  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>  .../drm/i915/display/intel_display_debugfs.c  |   8 +
>  .../drm/i915/display/intel_display_power.c    |  40 ++-
>  .../drm/i915/display/intel_display_power.h    |   6 +
>  .../i915/display/intel_display_power_map.c    |   7 +
>  .../i915/display/intel_display_power_well.c   |  24 +-
>  .../drm/i915/display/intel_display_reg_defs.h |   4 +
>  .../drm/i915/display/intel_display_trace.h    |   6 +
>  .../drm/i915/display/intel_display_types.h    |  32 +-
>  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>  .../drm/i915/display/intel_plane_initial.c    |   3 +-
>  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>  drivers/gpu/drm/i915/i915_driver.c            |   1 +
>  drivers/gpu/drm/i915/i915_gem.c               |   8 -
>  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>  drivers/gpu/drm/i915/i915_vma.c               |  12 -
>  drivers/gpu/drm/radeon/radeon.h               |  55 +--
>  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>  include/drm/drm_pt_walk.h                     | 161 +++++++++
>  include/drm/drm_suballoc.h                    | 112 ++++++
>  include/drm/gpu_scheduler.h                   |  41 ++-
>  sound/hda/hdac_i915.c                         |  17 +-
>  sound/pci/hda/hda_intel.c                     |  56 +--
>  sound/soc/intel/avs/core.c                    |  13 +-
>  sound/soc/sof/intel/hda.c                     |   7 +-
>  98 files changed, 2076 insertions(+), 1325 deletions(-)
>  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>  create mode 100644 drivers/gpu/drm/drm_suballoc.c
>  create mode 100644 include/drm/drm_pt_walk.h
>  create mode 100644 include/drm/drm_suballoc.h
>
> --
> 2.37.3
>
>

[-- Attachment #2: Type: text/html, Size: 13535 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2023-01-12 17:10         ` Matthew Brost
  (?)
@ 2023-01-17 16:40         ` Jason Ekstrand
  -1 siblings, 0 replies; 161+ messages in thread
From: Jason Ekstrand @ 2023-01-17 16:40 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, Lucas De Marchi, dri-devel

[-- Attachment #1: Type: text/plain, Size: 6735 bytes --]

On Thu, Jan 12, 2023 at 11:17 AM Matthew Brost <matthew.brost@intel.com>
wrote:

> On Thu, Jan 12, 2023 at 10:54:25AM +0100, Lucas De Marchi wrote:
> > On Thu, Jan 05, 2023 at 09:27:57PM +0000, Matthew Brost wrote:
> > > On Tue, Jan 03, 2023 at 12:21:08PM +0000, Tvrtko Ursulin wrote:
> > > >
> > > > On 22/12/2022 22:21, Matthew Brost wrote:
> > > > > Hello,
> > > > >
> > > > > This is a submission for Xe, a new driver for Intel GPUs that
> supports both
> > > > > integrated and discrete platforms starting with Tiger Lake (first
> platform with
> > > > > Intel Xe Architecture). The intention of this new driver is to
> have a fresh base
> > > > > to work from that is unencumbered by older platforms, whilst also
> taking the
> > > > > opportunity to rearchitect our driver to increase sharing across
> the drm
> > > > > subsystem, both leveraging and allowing us to contribute more
> towards other
> > > > > shared components like TTM and drm/scheduler. The memory model is
> based on VM
> > > > > bind which is similar to the i915 implementation. Likewise the
> execbuf
> > > > > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > > > >
> > > > > The code is at a stage where it is already functional and has
> experimental
> > > > > support for multiple platforms starting from Tiger Lake, with
> initial support
> > > > > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan
> drivers), as well
> > > > > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2]
> and NEO
> > > > > implementation will be released publicly early next year. We also
> have a suite
> > > > > of IGTs for XE that will appear on the IGT list shortly.
> > > > >
> > > > > It has been built with the assumption of supporting multiple
> architectures from
> > > > > the get-go, right now with tests running both on X86 and ARM
> hosts. And we
> > > > > intend to continue working on it and improving on it as part of
> the kernel
> > > > > community upstream.
> > > > >
> > > > > The new Xe driver leverages a lot from i915 and work on i915
> continues as we
> > > > > ready Xe for production throughout 2023.
> > > > >
> > > > > As for display, the intent is to share the display code with the
> i915 driver so
> > > > > that there is maximum reuse there. Currently this is being done by
> compiling the
> > > > > display code twice, but alternatives to that are under
> consideration and we want
> > > > > to have more discussion on what the best final solution will look
> like over the
> > > > > next few months. Right now, work is ongoing in refactoring the
> display codebase
> > > > > to remove as much as possible any unnecessary dependencies on i915
> specific data
> > > > > structures there..
> > > > >
> > > > > We currently have 2 submission backends, execlists and GuC. The
> execlist is
> > > > > meant mostly for testing and is not fully functional while GuC
> backend is fully
> > > > > functional. As with the i915 and GuC submission, in Xe the GuC
> firmware is
> > > > > required and should be placed in /lib/firmware/xe.
> > > >
> > > > What is the plan going forward for the execlists backend? I think it
> would
> > > > be preferable to not upstream something semi-functional and so to
> carry
> > > > technical debt in the brand new code base, from the very start. If
> it is for
> > > > Tigerlake, which is the starting platform for Xe, could it be made
> GuC only
> > > > Tigerlake for instance?
> > > >
> > >
> > > A little background here. In the original PoC written by Jason and
> Dave,
> > > the execlist backend was the only one present and it was in
> semi-working
> > > state. As soon as myself and a few others started working on Xe we went
> > > full in a on the GuC backend. We left the execlist backend basically in
> > > the state it was in. We left it in place for 2 reasons.
> > >
> > > 1. Having 2 backends from the start ensured we layered our code
> > > correctly. The layer was a complete disaster in the i915 so we really
> > > wanted to avoid that.
> > > 2. The thought was it might be needed for early product bring up one
> > > day.
> > >
> > > As I think about this a bit more, we likely just delete execlist
> backend
> > > before merging this upstream and perhaps just carry 1 large patch
> > > internally with this implementation that we can use as needed. Final
> > > decession TDB though.
> >
> > but that might regress after some time on "let's keep 2 backends so we
> > layer the code correctly". Leaving the additional backend behind
> > CONFIG_BROKEN or XE_EXPERIMENTAL, or something like that, not
> > enabled by distros, but enabled in CI would be a good idea IMO.
> >
> > Carrying a large patch out of tree would make things harder for new
> > platforms. A perfect backend split would make it possible, but like I
> > said, we are likely not to have it if we delete the second backend.
> >
>
> Good points here Lucas. One thing that we absolutely have wrong is
> falling back to execlists if GuC firmware is missing. We def should not
> be doing that as it creates confusion.
>

Yeah, we certainly shouldn't be falling back on it silently. That's a
recipe for disaster. If it stays, it should be behind a config option
that's clearly labeled as broken or not intended for production use. If
someone is a zero-firmware purist and wants to enable it and accept the
brokenness, that's their choice.

I'm not especially attached to the execlist back-end so I'm not going to
insist on anything here RE keeping it.

There is more to me starting with execlists than avoiding GuC, though. One
of the reasons I did it was to prove that the same core Xe scheduling model
[3] doesn't depend on firmware. As long as your hardware has some ability
to juggle independent per-context rings, you can get the same separation
and it makes everything cleaner. If this is the direction things are headed
(and I really think it is; I need to blog about it), being able to do the
Xe model on more primitive hardware which lacks competent firmware-based
submission is important. I wanted to prototype that to show that it could
be done.

I also kinda wanted to prove that execlists didn't have to be horrible like
in i915. You know, for funzies....

--Jason

[3]:
https://lists.freedesktop.org/archives/dri-devel/2023-January/386381.html



> I kinda like the idea hiding it behind a config option + module
> parameter to use the backend so you really, really need to try to be
> able to use it + with this in the code it make us disciplined in our
> layering. At some point we will likely another supported backend and at
> that point we may decide to delete this backend.
>
> Matt
>
> > Lucas De Marchi
> >
> > >
> > > Matt
> > >
> > > > Regards,
> > > >
> > > > Tvrtko
>

[-- Attachment #2: Type: text/html, Size: 8434 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-01-13  0:39                               ` John Harrison
@ 2023-01-18  3:06                                   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-18  3:06 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel, Jason Ekstrand

On Thu, Jan 12, 2023 at 04:39:32PM -0800, John Harrison wrote:
> On 1/11/2023 14:56, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.brost@intel.com>
> > wrote:
> > 
> >     On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> >     > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> >     > tvrtko.ursulin@linux.intel.com> wrote:
> >     >
> >     > >
> >     [snip]
> >     > >
> >     > > Typically is the key here. But I am not sure it is good
> >     enough. Consider
> >     > > this example - Intel Flex 170:
> >     > >
> >     > >   * Delivers up to 36 streams 1080p60 transcode throughput per
> >     card.
> >     > >   * When scaled to 10 cards in a 4U server configuration, it
> >     can support
> >     > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> >     > >
> >     >
> >     > I had a feeling it was going to be media.... 😅
> >     >
> > 
> >     Yea wondering the media UMD can be rewritten to use less
> >     xe_engines, it
> >     is massive rewrite for VM bind + no implicit dependencies so let's
> >     just
> >     pile on some more work?
> > 
> > 
> > It could probably use fewer than it does today.  It currently creates
> > and throws away contexts like crazy, or did last I looked at it. 
> > However, the nature of media encode is that it often spreads across two
> > or three different types of engines.  There's not much you can do to
> > change that.
> And as per Tvrtko's example, you get media servers that transcode huge
> numbers of tiny streams in parallel. Almost no work per frame but 100s of
> independent streams being run concurrently. That means many 100s of contexts
> all trying to run at 30fps. I recall a specific bug about thundering herds -
> hundreds (thousands?) of waiting threads all being woken up at once because
> some request had completed.
> 
> >     >
> >     > > One transcode stream from my experience typically is 3-4 GPU
> >     contexts
> >     > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
> >     a single
> >     > > CPU thread. 4 contexts * 36 streams = 144 active contexts.
> >     Multiply by
> >     > > 60fps = 8640 jobs submitted and completed per second.
> >     > >
> >     > > 144 active contexts in the proposed scheme means possibly
> >     means 144
> >     > > kernel worker threads spawned (driven by 36 transcode CPU
> >     threads). (I
> >     > > don't think the pools would scale down given all are
> >     constantly pinged
> >     > > at 60fps.)
> >     > >
> >     > > And then each of 144 threads goes to grab the single GuC CT
> >     mutex. First
> >     > > threads are being made schedulable, then put to sleep as mutex
> >     > > contention is hit, then woken again as mutexes are getting
> >     released,
> >     > > rinse, repeat.
> >     > >
> >     >
> >     > Why is every submission grabbing the GuC CT mutex? I've not read
> >     the GuC
> >     > back-end yet but I was under the impression that most run_job()
> >     would be
> >     > just shoving another packet into a ring buffer.  If we have to
> >     send the GuC
> >     > a message on the control ring every single time we submit a job,
> >     that's
> >     > pretty horrible.
> >     >
> > 
> >     Run job writes the ring buffer and moves the tail as the first
> >     step (no
> >     lock required). Next it needs to tell the GuC the xe_engine LRC
> >     tail has
> >     moved, this is done from a single Host to GuC channel which is
> >     circular
> >     buffer, the writing of the channel protected by the mutex. There are
> >     little more nuances too but in practice there is always space in the
> >     channel so the time mutex needs to held is really, really small
> >     (check cached credits, write 3 dwords in payload, write 1 dword to
> >     move
> >     tail). I also believe mutexes in Linux are hybrid where they spin
> >     for a
> >     little bit before sleeping and certainly if there is space in the
> >     channel we shouldn't sleep mutex contention.
> > 
> > 
> > Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we
> > had some way to batch things up a bit so we only have to poke the GuC
> > channel once for every batch of things rather than once per job.  That's
> > maybe something we can look into as a future improvement; not
> > fundamental.
> > 
> > Generally, though, it sounds like contention could be a real problem if
> > we end up ping-ponging that lock between cores.  It's going to depend on
> > how much work it takes to get the next ready thing vs. the cost of that
> > atomic.  But, also, anything we do is going to potentially run into
> > contention problems.  *shrug*  If we were going to go for
> > one-per-HW-engine, we may as well go one-per-device and then we wouldn't
> > need the lock.  Off the top of my head, that doesn't sound great either
> > but IDK.
> > 
> >     As far as this being horrible, well didn't design the GuC and this how
> >     it is implemented for KMD based submission. We also have 256 doorbells
> >     so we wouldn't need a lock but I think are other issues with that
> >     design
> >     too which need to be worked out in the Xe2 / Xe3 timeframe.
> > 
> > 
> > Yeah, not blaming you.  Just surprised, that's all.  How does it work
> > for userspace submission?  What would it look like if the kernel
> > emulated userspace submission?  Is that even possible?
> > 
> > What are these doorbell things?  How do they play into it?
> Basically a bank of MMIO space reserved per 'entity' where a write to that
> MMIO space becomes an named interrupt to GuC. You can assign each doorbell
> to a specific GuC context. So writing to that doorbell address is
> effectively the same as sending a SCHEDULE_CONTEXT H2G message from the KMD
> for that context. But the advantage is you ring the doorbell from user land
> with no call into the kernel at all. Or from within the kernel, you can do
> it without needing any locks at all. Problem is, we have 64K contexts in GuC
> but only 256 doorbells in the hardware. Less if using SRIOV. So the "per
> 'entity'" part because somewhat questionable as to exactly what the 'entity'
> is. And hence we just haven't bothered supporting them in Linux because a)
> no direct submission from user land yet, and b) as Matthew says entire chain
> of IOCTL from UMD to kernel to acquiring a lock and sending the H2G has
> generally been fast enough. The latency only becomes an issue for ULLS
> people but for them, even the doorbells from user space are too high a
> latency because that still potentially involves the GuC having to do some
> scheduling and context switch type action.
> 
> John.
> 

I talked with Jason on IRC last week about doorbells and we came up
with the idea after chatting to allocate the doorbells with a greedy
algorithm which results in the first 256 xe_engine each getting their
own doorbell thus avoid contention on the CT channel / lock (this is
still KMD submission).

Coded up a prototype for this and initial test results of
xe_exec_threads /w 245 user xe_engines, 5 threads, and 40k total execs
are an average of .824s vs. 923s for /w and w/o doorbells. Or in other
words 49714 execs per seconds /w doorbells vs. 44353 without. This seems
to indicate using doorbells can provide a performance improvement. Also
Jason and I reasoned we should be able to use doorbells 99% of the time
aside from maybe some wacky media use cases. I also plan on following up
with the media UMD to see if we can get them to use less xe_engines.

Matt

> 
> >     Also if you see my follow up response Xe is ~33k execs per second with
> >     the current implementation on a 8 core (or maybe 8 thread) TGL which
> >     seems to fine to me.
> > 
> > 
> > 33k exec/sec is about 500/frame which should be fine. 500 is a lot for a
> > single frame.  I typically tell game devs to shoot for dozens per
> > frame.  The important thing is that it stays low even with hundreds of
> > memory objects bound. (Xe should be just fine there.)
> > 
> > --Jason
> > 
> >     Matt
> > 
> >     > --Jason
> >     >
> >     >
> >     > (And yes this backend contention is there regardless of 1:1:1,
> >     it would
> >     > > require a different re-design to solve that. But it is just a
> >     question
> >     > > whether there are 144 contending threads, or just 6 with the
> >     thread per
> >     > > engine class scheme.)
> >     > >
> >     > > Then multiply all by 10 for a 4U server use case and you get
> >     1440 worker
> >     > > kthreads, yes 10 more CT locks, but contending on how many CPU
> >     cores?
> >     > > Just so they can grab a timeslice and maybe content on a mutex
> >     as the
> >     > > next step.
> >     > >
> >     > > This example is where it would hurt on large systems. Imagine
> >     only an
> >     > > even wider media transcode card...
> >     > >
> >     > > Second example is only a single engine class used (3d
> >     desktop?) but with
> >     > > a bunch of not-runnable jobs queued and waiting on a fence to
> >     signal.
> >     > > Implicit or explicit dependencies doesn't matter. Then the
> >     fence signals
> >     > > and call backs run. N work items get scheduled, but they all
> >     submit to
> >     > > the same HW engine. So we end up with:
> >     > >
> >     > >          /-- wi1 --\
> >     > >         / ..     .. \
> >     > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> >     > >         \ ..    ..  /
> >     > >          \-- wiN --/
> >     > >
> >     > >
> >     > > All that we have achieved is waking up N CPUs to contend on
> >     the same
> >     > > lock and effectively insert the job into the same single HW
> >     queue. I
> >     > > don't see any positives there.
> >     > >
> >     > > This example I think can particularly hurt small / low power
> >     devices
> >     > > because of needless waking up of many cores for no benefit.
> >     Granted, I
> >     > > don't have a good feel on how common this pattern is in practice.
> >     > >
> >     > > >
> >     > > >     That
> >     > > >     is the number which drives the maximum number of
> >     not-runnable jobs
> >     > > that
> >     > > >     can become runnable at once, and hence spawn that many
> >     work items,
> >     > > and
> >     > > >     in turn unbound worker threads.
> >     > > >
> >     > > >     Several problems there.
> >     > > >
> >     > > >     It is fundamentally pointless to have potentially that
> >     many more
> >     > > >     threads
> >     > > >     than the number of CPU cores - it simply creates a
> >     scheduling storm.
> >     > > >
> >     > > >     Unbound workers have no CPU / cache locality either and
> >     no connection
> >     > > >     with the CPU scheduler to optimize scheduling patterns.
> >     This may
> >     > > matter
> >     > > >     either on large systems or on small ones. Whereas the
> >     current design
> >     > > >     allows for scheduler to notice userspace CPU thread
> >     keeps waking up
> >     > > the
> >     > > >     same drm scheduler kernel thread, and so it can keep
> >     them on the same
> >     > > >     CPU, the unbound workers lose that ability and so 2nd
> >     CPU might be
> >     > > >     getting woken up from low sleep for every submission.
> >     > > >
> >     > > >     Hence, apart from being a bit of a impedance mismatch,
> >     the proposal
> >     > > has
> >     > > >     the potential to change performance and power patterns
> >     and both large
> >     > > >     and small machines.
> >     > > >
> >     > > >
> >     > > > Ok, thanks for explaining the issue you're seeing in more
> >     detail.  Yes,
> >     > > > deferred kwork does appear to mismatch somewhat with what
> >     the scheduler
> >     > > > needs or at least how it's worked in the past.  How much
> >     impact will
> >     > > > that mismatch have?  Unclear.
> >     > > >
> >     > > >      >      >>> Secondly, it probably demands separate
> >     workers (not
> >     > > >     optional),
> >     > > >      >     otherwise
> >     > > >      >      >>> behaviour of shared workqueues has either
> >     the potential
> >     > > to
> >     > > >      >     explode number
> >     > > >      >      >>> kernel threads anyway, or add latency.
> >     > > >      >      >>>
> >     > > >      >      >>
> >     > > >      >      >> Right now the system_unbound_wq is used which
> >     does have a
> >     > > >     limit
> >     > > >      >     on the
> >     > > >      >      >> number of threads, right? I do have a FIXME
> >     to allow a
> >     > > >     worker to be
> >     > > >      >      >> passed in similar to TDR.
> >     > > >      >      >>
> >     > > >      >      >> WRT to latency, the 1:1 ratio could actually
> >     have lower
> >     > > >     latency
> >     > > >      >     as 2 GPU
> >     > > >      >      >> schedulers can be pushing jobs into the backend /
> >     > > cleaning up
> >     > > >      >     jobs in
> >     > > >      >      >> parallel.
> >     > > >      >      >>
> >     > > >      >      >
> >     > > >      >      > Thought of one more point here where why in Xe we
> >     > > >     absolutely want
> >     > > >      >     a 1 to
> >     > > >      >      > 1 ratio between entity and scheduler - the way
> >     we implement
> >     > > >      >     timeslicing
> >     > > >      >      > for preempt fences.
> >     > > >      >      >
> >     > > >      >      > Let me try to explain.
> >     > > >      >      >
> >     > > >      >      > Preempt fences are implemented via the generic
> >     messaging
> >     > > >      >     interface [1]
> >     > > >      >      > with suspend / resume messages. If a suspend
> >     messages is
> >     > > >     received to
> >     > > >      >      > soon after calling resume (this is per entity)
> >     we simply
> >     > > >     sleep in the
> >     > > >      >      > suspend call thus giving the entity a
> >     timeslice. This
> >     > > >     completely
> >     > > >      >     falls
> >     > > >      >      > apart with a many to 1 relationship as now a
> >     entity
> >     > > >     waiting for a
> >     > > >      >      > timeslice blocks the other entities. Could we
> >     work aroudn
> >     > > >     this,
> >     > > >      >     sure but
> >     > > >      >      > just another bunch of code we'd have to add in
> >     Xe. Being to
> >     > > >      >     freely sleep
> >     > > >      >      > in backend without affecting other entities is
> >     really,
> >     > > really
> >     > > >      >     nice IMO
> >     > > >      >      > and I bet Xe isn't the only driver that is
> >     going to feel
> >     > > >     this way.
> >     > > >      >      >
> >     > > >      >      > Last thing I'll say regardless of how anyone
> >     feels about
> >     > > >     Xe using
> >     > > >      >     a 1 to
> >     > > >      >      > 1 relationship this patch IMO makes sense as I
> >     hope we can
> >     > > all
> >     > > >      >     agree a
> >     > > >      >      > workqueue scales better than kthreads.
> >     > > >      >
> >     > > >      >     I don't know for sure what will scale better and
> >     for what use
> >     > > >     case,
> >     > > >      >     combination of CPU cores vs number of GPU engines
> >     to keep
> >     > > >     busy vs other
> >     > > >      >     system activity. But I wager someone is bound to
> >     ask for some
> >     > > >      >     numbers to
> >     > > >      >     make sure proposal is not negatively affecting
> >     any other
> >     > > drivers.
> >     > > >      >
> >     > > >      >
> >     > > >      > Then let them ask.  Waving your hands vaguely in the
> >     direction of
> >     > > >     the
> >     > > >      > rest of DRM and saying "Uh, someone (not me) might
> >     object" is
> >     > > >     profoundly
> >     > > >      > unhelpful.  Sure, someone might. That's why it's on
> >     dri-devel.
> >     > > >     If you
> >     > > >      > think there's someone in particular who might have a
> >     useful
> >     > > >     opinion on
> >     > > >      > this, throw them in the CC so they don't miss the
> >     e-mail thread.
> >     > > >      >
> >     > > >      > Or are you asking for numbers?  If so, what numbers
> >     are you
> >     > > >     asking for?
> >     > > >
> >     > > >     It was a heads up to the Xe team in case people weren't
> >     appreciating
> >     > > >     how
> >     > > >     the proposed change has the potential influence power
> >     and performance
> >     > > >     across the board. And nothing in the follow up
> >     discussion made me
> >     > > think
> >     > > >     it was considered so I don't think it was redundant to
> >     raise it.
> >     > > >
> >     > > >     In my experience it is typical that such core changes
> >     come with some
> >     > > >     numbers. Which is in case of drm scheduler is tricky and
> >     probably
> >     > > >     requires explicitly asking everyone to test (rather than
> >     count on
> >     > > >     "don't
> >     > > >     miss the email thread"). Real products can fail to ship
> >     due ten mW
> >     > > here
> >     > > >     or there. Like suddenly an extra core prevented from
> >     getting into
> >     > > deep
> >     > > >     sleep.
> >     > > >
> >     > > >     If that was "profoundly unhelpful" so be it.
> >     > > >
> >     > > >
> >     > > > With your above explanation, it makes more sense what you're
> >     asking.
> >     > > > It's still not something Matt is likely to be able to
> >     provide on his
> >     > > > own.  We need to tag some other folks and ask them to test
> >     it out.  We
> >     > > > could play around a bit with it on Xe but it's not exactly
> >     production
> >     > > > grade yet and is going to hit this differently from most. 
> >     Likely
> >     > > > candidates are probably AMD and Freedreno.
> >     > >
> >     > > Whoever is setup to check out power and performance would be
> >     good to
> >     > > give it a spin, yes.
> >     > >
> >     > > PS. I don't think I was asking Matt to test with other
> >     devices. To start
> >     > > with I think Xe is a team effort. I was asking for more
> >     background on
> >     > > the design decision since patch 4/20 does not say anything on that
> >     > > angle, nor later in the thread it was IMO sufficiently addressed.
> >     > >
> >     > > >      > Also, If we're talking about a design that might
> >     paint us into an
> >     > > >      > Intel-HW-specific hole, that would be one thing.  But
> >     we're not.
> >     > > >     We're
> >     > > >      > talking about switching which kernel threading/task
> >     mechanism to
> >     > > >     use for
> >     > > >      > what's really a very generic problem.  The core Xe
> >     design works
> >     > > >     without
> >     > > >      > this patch (just with more kthreads).  If we land
> >     this patch or
> >     > > >      > something like it and get it wrong and it causes a
> >     performance
> >     > > >     problem
> >     > > >      > for someone down the line, we can revisit it.
> >     > > >
> >     > > >     For some definition of "it works" - I really wouldn't
> >     suggest
> >     > > >     shipping a
> >     > > >     kthread per user context at any point.
> >     > > >
> >     > > >
> >     > > > You have yet to elaborate on why. What resources is it
> >     consuming that's
> >     > > > going to be a problem? Are you anticipating CPU affinity
> >     problems? Or
> >     > > > does it just seem wasteful?
> >     > >
> >     > > Well I don't know, commit message says the approach does not
> >     scale. :)
> >     > >
> >     > > > I think I largely agree that it's probably
> >     unnecessary/wasteful but
> >     > > > reducing the number of kthreads seems like a tractable
> >     problem to solve
> >     > > > regardless of where we put the gpu_scheduler object.  Is
> >     this the right
> >     > > > solution?  Maybe not.  It was also proposed at one point
> >     that we could
> >     > > > split the scheduler into two pieces: A scheduler which owns
> >     the kthread,
> >     > > > and a back-end which targets some HW ring thing where you
> >     can have
> >     > > > multiple back-ends per scheduler.  That's certainly more
> >     invasive from a
> >     > > > DRM scheduler internal API PoV but would solve the kthread
> >     problem in a
> >     > > > way that's more similar to what we have now.
> >     > > >
> >     > > >      >     In any case that's a low level question caused by
> >     the high
> >     > > >     level design
> >     > > >      >     decision. So I'd think first focus on the high
> >     level - which
> >     > > >     is the 1:1
> >     > > >      >     mapping of entity to scheduler instance proposal.
> >     > > >      >
> >     > > >      >     Fundamentally it will be up to the DRM
> >     maintainers and the
> >     > > >     community to
> >     > > >      >     bless your approach. And it is important to
> >     stress 1:1 is
> >     > > about
> >     > > >      >     userspace contexts, so I believe unlike any other
> >     current
> >     > > >     scheduler
> >     > > >      >     user. And also important to stress this
> >     effectively does not
> >     > > >     make Xe
> >     > > >      >     _really_ use the scheduler that much.
> >     > > >      >
> >     > > >      >
> >     > > >      > I don't think this makes Xe nearly as much of a
> >     one-off as you
> >     > > >     think it
> >     > > >      > does.  I've already told the Asahi team working on
> >     Apple M1/2
> >     > > >     hardware
> >     > > >      > to do it this way and it seems to be a pretty good
> >     mapping for
> >     > > >     them. I
> >     > > >      > believe this is roughly the plan for nouveau as
> >     well.  It's not
> >     > > >     the way
> >     > > >      > it currently works for anyone because most other
> >     groups aren't
> >     > > >     doing FW
> >     > > >      > scheduling yet.  In the world of FW scheduling and
> >     hardware
> >     > > >     designed to
> >     > > >      > support userspace direct-to-FW submit, I think the
> >     design makes
> >     > > >     perfect
> >     > > >      > sense (see below) and I expect we'll see more drivers
> >     move in this
> >     > > >      > direction as those drivers evolve. (AMD is doing some
> >     customish
> >     > > >     thing
> >     > > >      > for how with gpu_scheduler on the front-end somehow.
> >     I've not dug
> >     > > >     into
> >     > > >      > those details.)
> >     > > >      >
> >     > > >      >     I can only offer my opinion, which is that the
> >     two options
> >     > > >     mentioned in
> >     > > >      >     this thread (either improve drm scheduler to cope
> >     with what is
> >     > > >      >     required,
> >     > > >      >     or split up the code so you can use just the parts of
> >     > > >     drm_sched which
> >     > > >      >     you want - which is frontend dependency tracking)
> >     shouldn't
> >     > > be so
> >     > > >      >     readily dismissed, given how I think the idea was
> >     for the new
> >     > > >     driver to
> >     > > >      >     work less in a silo and more in the community
> >     (not do kludges
> >     > > to
> >     > > >      >     workaround stuff because it is thought to be too
> >     hard to
> >     > > >     improve common
> >     > > >      >     code), but fundamentally, "goto previous
> >     paragraph" for what
> >     > > I am
> >     > > >      >     concerned.
> >     > > >      >
> >     > > >      >
> >     > > >      > Meta comment:  It appears as if you're falling into
> >     the standard
> >     > > >     i915
> >     > > >      > team trap of having an internal discussion about what the
> >     > > community
> >     > > >      > discussion might look like instead of actually having the
> >     > > community
> >     > > >      > discussion.  If you are seriously concerned about
> >     interactions
> >     > > with
> >     > > >      > other drivers or whether or setting common direction,
> >     the right
> >     > > >     way to
> >     > > >      > do that is to break a patch or two out into a
> >     separate RFC series
> >     > > >     and
> >     > > >      > tag a handful of driver maintainers.  Trying to
> >     predict the
> >     > > >     questions
> >     > > >      > other people might ask is pointless. Cc them and
> >     asking for their
> >     > > >     input
> >     > > >      > instead.
> >     > > >
> >     > > >     I don't follow you here. It's not an internal discussion
> >     - I am
> >     > > raising
> >     > > >     my concerns on the design publicly. I am supposed to
> >     write a patch to
> >     > > >     show something, but am allowed to comment on a RFC series?
> >     > > >
> >     > > >
> >     > > > I may have misread your tone a bit.  It felt a bit like too many
> >     > > > discussions I've had in the past where people are trying to
> >     predict what
> >     > > > others will say instead of just asking them. Reading it
> >     again, I was
> >     > > > probably jumping to conclusions a bit.  Sorry about that.
> >     > >
> >     > > Okay no problem, thanks. In any case we don't have to keep
> >     discussing
> >     > > it, since I wrote one or two emails ago it is fundamentally on the
> >     > > maintainers and community to ack the approach. I only felt
> >     like RFC did
> >     > > not explain the potential downsides sufficiently so I wanted
> >     to probe
> >     > > that area a bit.
> >     > >
> >     > > >     It is "drm/sched: Convert drm scheduler to use a work
> >     queue rather
> >     > > than
> >     > > >     kthread" which should have Cc-ed _everyone_ who use drm
> >     scheduler.
> >     > > >
> >     > > >
> >     > > > Yeah, it probably should have.  I think that's mostly what
> >     I've been
> >     > > > trying to say.
> >     > > >
> >     > > >      >
> >     > > >      >     Regards,
> >     > > >      >
> >     > > >      >     Tvrtko
> >     > > >      >
> >     > > >      >     P.S. And as a related side note, there are more
> >     areas where
> >     > > >     drm_sched
> >     > > >      >     could be improved, like for instance priority
> >     handling.
> >     > > >      >     Take a look at msm_submitqueue_create /
> >     > > >     msm_gpu_convert_priority /
> >     > > >      >     get_sched_entity to see how msm works around the
> >     drm_sched
> >     > > >     hardcoded
> >     > > >      >     limit of available priority levels, in order to
> >     avoid having
> >     > > >     to leave a
> >     > > >      >     hw capability unused. I suspect msm would be
> >     happier if they
> >     > > >     could have
> >     > > >      >     all priority levels equal in terms of whether
> >     they apply only
> >     > > >     at the
> >     > > >      >     frontend level or completely throughout the pipeline.
> >     > > >      >
> >     > > >      >      > [1]
> >     > > >      >
> >     > > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     > > >   
> >      <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     > > >
> >     > > >      >
> >     > > >       <
> >     > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     <
> >     > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>>
> >     > > >      >      >
> >     > > >      >      >>> What would be interesting to learn is
> >     whether the option
> >     > > of
> >     > > >      >     refactoring
> >     > > >      >      >>> drm_sched to deal with out of order
> >     completion was
> >     > > >     considered
> >     > > >      >     and what were
> >     > > >      >      >>> the conclusions.
> >     > > >      >      >>>
> >     > > >      >      >>
> >     > > >      >      >> I coded this up a while back when trying to
> >     convert the
> >     > > >     i915 to
> >     > > >      >     the DRM
> >     > > >      >      >> scheduler it isn't all that hard either. The
> >     free flow
> >     > > >     control
> >     > > >      >     on the
> >     > > >      >      >> ring (e.g. set job limit == SIZE OF RING /
> >     MAX JOB SIZE)
> >     > > is
> >     > > >      >     really what
> >     > > >      >      >> sold me on the this design.
> >     > > >      >
> >     > > >      >
> >     > > >      > You're not the only one to suggest supporting
> >     out-of-order
> >     > > >     completion.
> >     > > >      > However, it's tricky and breaks a lot of internal
> >     assumptions of
> >     > > the
> >     > > >      > scheduler. It also reduces functionality a bit
> >     because it can no
> >     > > >     longer
> >     > > >      > automatically rate-limit HW/FW queues which are often
> >     > > >     fixed-size.  (Ok,
> >     > > >      > yes, it probably could but it becomes a substantially
> >     harder
> >     > > >     problem.)
> >     > > >      >
> >     > > >      > It also seems like a worse mapping to me.  The goal
> >     here is to
> >     > > turn
> >     > > >      > submissions on a userspace-facing engine/queue into
> >     submissions
> >     > > >     to a FW
> >     > > >      > queue submissions, sorting out any dma_fence
> >     dependencies.  Matt's
> >     > > >      > description of saying this is a 1:1 mapping between
> >     sched/entity
> >     > > >     doesn't
> >     > > >      > tell the whole story. It's a 1:1:1 mapping between
> >     xe_engine,
> >     > > >      > gpu_scheduler, and GuC FW engine. Why make it a
> >     1:something:1
> >     > > >     mapping?
> >     > > >      > Why is that better?
> >     > > >
> >     > > >     As I have stated before, what I think what would fit
> >     well for Xe is
> >     > > one
> >     > > >     drm_scheduler per engine class. In specific terms on our
> >     current
> >     > > >     hardware, one drm scheduler instance for render,
> >     compute, blitter,
> >     > > >     video
> >     > > >     and video enhance. Userspace contexts remain scheduler
> >     entities.
> >     > > >
> >     > > >
> >     > > > And this is where we fairly strongly disagree.  More in a bit.
> >     > > >
> >     > > >     That way you avoid the whole kthread/kworker story and
> >     you have it
> >     > > >     actually use the entity picking code in the scheduler,
> >     which may be
> >     > > >     useful when the backend is congested.
> >     > > >
> >     > > >
> >     > > > What back-end congestion are you referring to here?  Running
> >     out of FW
> >     > > > queue IDs?  Something else?
> >     > >
> >     > > CT channel, number of context ids.
> >     > >
> >     > > >
> >     > > >     Yes you have to solve the out of order problem so in my
> >     mind that is
> >     > > >     something to discuss. What the problem actually is (just
> >     TDR?), how
> >     > > >     tricky and why etc.
> >     > > >
> >     > > >     And yes you lose the handy LRCA ring buffer size
> >     management so you'd
> >     > > >     have to make those entities not runnable in some other way.
> >     > > >
> >     > > >     Regarding the argument you raise below - would any of
> >     that make the
> >     > > >     frontend / backend separation worse and why? Do you
> >     think it is less
> >     > > >     natural? If neither is true then all remains is that it
> >     appears extra
> >     > > >     work to support out of order completion of entities has been
> >     > > discounted
> >     > > >     in favour of an easy but IMO inelegant option.
> >     > > >
> >     > > >
> >     > > > Broadly speaking, the kernel needs to stop thinking about
> >     GPU scheduling
> >     > > > in terms of scheduling jobs and start thinking in terms of
> >     scheduling
> >     > > > contexts/engines.  There is still some need for scheduling
> >     individual
> >     > > > jobs but that is only for the purpose of delaying them as
> >     needed to
> >     > > > resolve dma_fence dependencies.  Once dependencies are
> >     resolved, they
> >     > > > get shoved onto the context/engine queue and from there the
> >     kernel only
> >     > > > really manages whole contexts/engines.  This is a major
> >     architectural
> >     > > > shift, entirely different from the way i915 scheduling
> >     works.  It's also
> >     > > > different from the historical usage of DRM scheduler which I
> >     think is
> >     > > > why this all looks a bit funny.
> >     > > >
> >     > > > To justify this architectural shift, let's look at where
> >     we're headed.
> >     > > > In the glorious future...
> >     > > >
> >     > > >   1. Userspace submits directly to firmware queues.  The
> >     kernel has no
> >     > > > visibility whatsoever into individual jobs. At most it can
> >     pause/resume
> >     > > > FW contexts as needed to handle eviction and memory management.
> >     > > >
> >     > > >   2. Because of 1, apart from handing out the FW queue IDs
> >     at the
> >     > > > beginning, the kernel can't really juggle them that much. 
> >     Depending on
> >     > > > FW design, it may be able to pause a client, give its IDs to
> >     another,
> >     > > > and then resume it later when IDs free up. What it's not
> >     doing is
> >     > > > juggling IDs on a job-by-job basis like i915 currently is.
> >     > > >
> >     > > >   3. Long-running compute jobs may not complete for days. 
> >     This means
> >     > > > that memory management needs to happen in terms of
> >     pause/resume of
> >     > > > entire contexts/engines using the memory rather than based
> >     on waiting
> >     > > > for individual jobs to complete or pausing individual jobs
> >     until the
> >     > > > memory is available.
> >     > > >
> >     > > >   4. Synchronization happens via userspace memory fences
> >     (UMF) and the
> >     > > > kernel is mostly unaware of most dependencies and when a
> >     context/engine
> >     > > > is or is not runnable.  Instead, it keeps as many of them
> >     minimally
> >     > > > active (memory is available, even if it's in system RAM) as
> >     possible and
> >     > > > lets the FW sort out dependencies.  (There may need to be
> >     some facility
> >     > > > for sleeping a context until a memory change similar to
> >     futex() or
> >     > > > poll() for userspace threads.  There are some details TBD.)
> >     > > >
> >     > > > Are there potential problems that will need to be solved
> >     here?  Yes.  Is
> >     > > > it a good design?  Well, Microsoft has been living in this
> >     future for
> >     > > > half a decade or better and it's working quite well for
> >     them.  It's also
> >     > > > the way all modern game consoles work.  It really is just
> >     Linux that's
> >     > > > stuck with the same old job model we've had since the
> >     monumental shift
> >     > > > to DRI2.
> >     > > >
> >     > > > To that end, one of the core goals of the Xe project was to
> >     make the
> >     > > > driver internally behave as close to the above model as
> >     possible while
> >     > > > keeping the old-school job model as a very thin layer on
> >     top.  As the
> >     > > > broader ecosystem problems (window-system support for UMF,
> >     for instance)
> >     > > > are solved, that layer can be peeled back. The core driver
> >     will already
> >     > > > be ready for it.
> >     > > >
> >     > > > To that end, the point of the DRM scheduler in Xe isn't to
> >     schedule
> >     > > > jobs.  It's to resolve syncobj and dma-buf implicit sync
> >     dependencies
> >     > > > and stuff jobs into their respective context/engine queue
> >     once they're
> >     > > > ready.  All the actual scheduling happens in firmware and
> >     any scheduling
> >     > > > the kernel does to deal with contention, oversubscriptions,
> >     too many
> >     > > > contexts, etc. is between contexts/engines, not individual
> >     jobs.  Sure,
> >     > > > the individual job visibility is nice, but if we design
> >     around it, we'll
> >     > > > never get to the glorious future.
> >     > > >
> >     > > > I really need to turn the above (with a bit more detail)
> >     into a blog
> >     > > > post.... Maybe I'll do that this week.
> >     > > >
> >     > > > In any case, I hope that provides more insight into why Xe
> >     is designed
> >     > > > the way it is and why I'm pushing back so hard on trying to
> >     make it more
> >     > > > of a "classic" driver as far as scheduling is concerned. 
> >     Are there
> >     > > > potential problems here?  Yes, that's why Xe has been labeled a
> >     > > > prototype.  Are such radical changes necessary to get to
> >     said glorious
> >     > > > future?  Yes, I think they are.  Will it be worth it?  I
> >     believe so.
> >     > >
> >     > > Right, that's all solid I think. My takeaway is that frontend
> >     priority
> >     > > sorting and that stuff isn't needed and that is okay. And that
> >     there are
> >     > > multiple options to maybe improve drm scheduler, like the fore
> >     mentioned
> >     > > making it deal with out of order, or split into functional
> >     components,
> >     > > or split frontend/backend what you suggested. For most of them
> >     cost vs
> >     > > benefit is more or less not completely clear, neither how much
> >     effort
> >     > > was invested to look into them.
> >     > >
> >     > > One thing I missed from this explanation is how drm_scheduler
> >     per engine
> >     > > class interferes with the high level concepts. And I did not
> >     manage to
> >     > > pick up on what exactly is the TDR problem in that case. Maybe
> >     the two
> >     > > are one and the same.
> >     > >
> >     > > Bottom line is I still have the concern that conversion to
> >     kworkers has
> >     > > an opportunity to regress. Possibly more opportunity for some
> >     Xe use
> >     > > cases than to affect other vendors, since they would still be
> >     using per
> >     > > physical engine / queue scheduler instances.
> >     > >
> >     > > And to put my money where my mouth is I will try to put testing Xe
> >     > > inside the full blown ChromeOS environment in my team plans.
> >     It would
> >     > > probably also be beneficial if Xe team could take a look at
> >     real world
> >     > > behaviour of the extreme transcode use cases too. If the stack
> >     is ready
> >     > > for that and all. It would be better to know earlier rather
> >     than later
> >     > > if there is a fundamental issue.
> >     > >
> >     > > For the patch at hand, and the cover letter, it certainly
> >     feels it would
> >     > > benefit to record the past design discussion had with AMD
> >     folks, to
> >     > > explicitly copy other drivers, and to record the theoretical
> >     pros and
> >     > > cons of threads vs unbound workers as I have tried to
> >     highlight them.
> >     > >
> >     > > Regards,
> >     > >
> >     > > Tvrtko
> >     > >
> > 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread
@ 2023-01-18  3:06                                   ` Matthew Brost
  0 siblings, 0 replies; 161+ messages in thread
From: Matthew Brost @ 2023-01-18  3:06 UTC (permalink / raw)
  To: John Harrison; +Cc: intel-gfx, dri-devel

On Thu, Jan 12, 2023 at 04:39:32PM -0800, John Harrison wrote:
> On 1/11/2023 14:56, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.brost@intel.com>
> > wrote:
> > 
> >     On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> >     > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> >     > tvrtko.ursulin@linux.intel.com> wrote:
> >     >
> >     > >
> >     [snip]
> >     > >
> >     > > Typically is the key here. But I am not sure it is good
> >     enough. Consider
> >     > > this example - Intel Flex 170:
> >     > >
> >     > >   * Delivers up to 36 streams 1080p60 transcode throughput per
> >     card.
> >     > >   * When scaled to 10 cards in a 4U server configuration, it
> >     can support
> >     > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> >     > >
> >     >
> >     > I had a feeling it was going to be media.... 😅
> >     >
> > 
> >     Yea wondering the media UMD can be rewritten to use less
> >     xe_engines, it
> >     is massive rewrite for VM bind + no implicit dependencies so let's
> >     just
> >     pile on some more work?
> > 
> > 
> > It could probably use fewer than it does today.  It currently creates
> > and throws away contexts like crazy, or did last I looked at it. 
> > However, the nature of media encode is that it often spreads across two
> > or three different types of engines.  There's not much you can do to
> > change that.
> And as per Tvrtko's example, you get media servers that transcode huge
> numbers of tiny streams in parallel. Almost no work per frame but 100s of
> independent streams being run concurrently. That means many 100s of contexts
> all trying to run at 30fps. I recall a specific bug about thundering herds -
> hundreds (thousands?) of waiting threads all being woken up at once because
> some request had completed.
> 
> >     >
> >     > > One transcode stream from my experience typically is 3-4 GPU
> >     contexts
> >     > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
> >     a single
> >     > > CPU thread. 4 contexts * 36 streams = 144 active contexts.
> >     Multiply by
> >     > > 60fps = 8640 jobs submitted and completed per second.
> >     > >
> >     > > 144 active contexts in the proposed scheme means possibly
> >     means 144
> >     > > kernel worker threads spawned (driven by 36 transcode CPU
> >     threads). (I
> >     > > don't think the pools would scale down given all are
> >     constantly pinged
> >     > > at 60fps.)
> >     > >
> >     > > And then each of 144 threads goes to grab the single GuC CT
> >     mutex. First
> >     > > threads are being made schedulable, then put to sleep as mutex
> >     > > contention is hit, then woken again as mutexes are getting
> >     released,
> >     > > rinse, repeat.
> >     > >
> >     >
> >     > Why is every submission grabbing the GuC CT mutex? I've not read
> >     the GuC
> >     > back-end yet but I was under the impression that most run_job()
> >     would be
> >     > just shoving another packet into a ring buffer.  If we have to
> >     send the GuC
> >     > a message on the control ring every single time we submit a job,
> >     that's
> >     > pretty horrible.
> >     >
> > 
> >     Run job writes the ring buffer and moves the tail as the first
> >     step (no
> >     lock required). Next it needs to tell the GuC the xe_engine LRC
> >     tail has
> >     moved, this is done from a single Host to GuC channel which is
> >     circular
> >     buffer, the writing of the channel protected by the mutex. There are
> >     little more nuances too but in practice there is always space in the
> >     channel so the time mutex needs to held is really, really small
> >     (check cached credits, write 3 dwords in payload, write 1 dword to
> >     move
> >     tail). I also believe mutexes in Linux are hybrid where they spin
> >     for a
> >     little bit before sleeping and certainly if there is space in the
> >     channel we shouldn't sleep mutex contention.
> > 
> > 
> > Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we
> > had some way to batch things up a bit so we only have to poke the GuC
> > channel once for every batch of things rather than once per job.  That's
> > maybe something we can look into as a future improvement; not
> > fundamental.
> > 
> > Generally, though, it sounds like contention could be a real problem if
> > we end up ping-ponging that lock between cores.  It's going to depend on
> > how much work it takes to get the next ready thing vs. the cost of that
> > atomic.  But, also, anything we do is going to potentially run into
> > contention problems.  *shrug*  If we were going to go for
> > one-per-HW-engine, we may as well go one-per-device and then we wouldn't
> > need the lock.  Off the top of my head, that doesn't sound great either
> > but IDK.
> > 
> >     As far as this being horrible, well didn't design the GuC and this how
> >     it is implemented for KMD based submission. We also have 256 doorbells
> >     so we wouldn't need a lock but I think are other issues with that
> >     design
> >     too which need to be worked out in the Xe2 / Xe3 timeframe.
> > 
> > 
> > Yeah, not blaming you.  Just surprised, that's all.  How does it work
> > for userspace submission?  What would it look like if the kernel
> > emulated userspace submission?  Is that even possible?
> > 
> > What are these doorbell things?  How do they play into it?
> Basically a bank of MMIO space reserved per 'entity' where a write to that
> MMIO space becomes an named interrupt to GuC. You can assign each doorbell
> to a specific GuC context. So writing to that doorbell address is
> effectively the same as sending a SCHEDULE_CONTEXT H2G message from the KMD
> for that context. But the advantage is you ring the doorbell from user land
> with no call into the kernel at all. Or from within the kernel, you can do
> it without needing any locks at all. Problem is, we have 64K contexts in GuC
> but only 256 doorbells in the hardware. Less if using SRIOV. So the "per
> 'entity'" part because somewhat questionable as to exactly what the 'entity'
> is. And hence we just haven't bothered supporting them in Linux because a)
> no direct submission from user land yet, and b) as Matthew says entire chain
> of IOCTL from UMD to kernel to acquiring a lock and sending the H2G has
> generally been fast enough. The latency only becomes an issue for ULLS
> people but for them, even the doorbells from user space are too high a
> latency because that still potentially involves the GuC having to do some
> scheduling and context switch type action.
> 
> John.
> 

I talked with Jason on IRC last week about doorbells and we came up
with the idea after chatting to allocate the doorbells with a greedy
algorithm which results in the first 256 xe_engine each getting their
own doorbell thus avoid contention on the CT channel / lock (this is
still KMD submission).

Coded up a prototype for this and initial test results of
xe_exec_threads /w 245 user xe_engines, 5 threads, and 40k total execs
are an average of .824s vs. 923s for /w and w/o doorbells. Or in other
words 49714 execs per seconds /w doorbells vs. 44353 without. This seems
to indicate using doorbells can provide a performance improvement. Also
Jason and I reasoned we should be able to use doorbells 99% of the time
aside from maybe some wacky media use cases. I also plan on following up
with the media UMD to see if we can get them to use less xe_engines.

Matt

> 
> >     Also if you see my follow up response Xe is ~33k execs per second with
> >     the current implementation on a 8 core (or maybe 8 thread) TGL which
> >     seems to fine to me.
> > 
> > 
> > 33k exec/sec is about 500/frame which should be fine. 500 is a lot for a
> > single frame.  I typically tell game devs to shoot for dozens per
> > frame.  The important thing is that it stays low even with hundreds of
> > memory objects bound. (Xe should be just fine there.)
> > 
> > --Jason
> > 
> >     Matt
> > 
> >     > --Jason
> >     >
> >     >
> >     > (And yes this backend contention is there regardless of 1:1:1,
> >     it would
> >     > > require a different re-design to solve that. But it is just a
> >     question
> >     > > whether there are 144 contending threads, or just 6 with the
> >     thread per
> >     > > engine class scheme.)
> >     > >
> >     > > Then multiply all by 10 for a 4U server use case and you get
> >     1440 worker
> >     > > kthreads, yes 10 more CT locks, but contending on how many CPU
> >     cores?
> >     > > Just so they can grab a timeslice and maybe content on a mutex
> >     as the
> >     > > next step.
> >     > >
> >     > > This example is where it would hurt on large systems. Imagine
> >     only an
> >     > > even wider media transcode card...
> >     > >
> >     > > Second example is only a single engine class used (3d
> >     desktop?) but with
> >     > > a bunch of not-runnable jobs queued and waiting on a fence to
> >     signal.
> >     > > Implicit or explicit dependencies doesn't matter. Then the
> >     fence signals
> >     > > and call backs run. N work items get scheduled, but they all
> >     submit to
> >     > > the same HW engine. So we end up with:
> >     > >
> >     > >          /-- wi1 --\
> >     > >         / ..     .. \
> >     > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> >     > >         \ ..    ..  /
> >     > >          \-- wiN --/
> >     > >
> >     > >
> >     > > All that we have achieved is waking up N CPUs to contend on
> >     the same
> >     > > lock and effectively insert the job into the same single HW
> >     queue. I
> >     > > don't see any positives there.
> >     > >
> >     > > This example I think can particularly hurt small / low power
> >     devices
> >     > > because of needless waking up of many cores for no benefit.
> >     Granted, I
> >     > > don't have a good feel on how common this pattern is in practice.
> >     > >
> >     > > >
> >     > > >     That
> >     > > >     is the number which drives the maximum number of
> >     not-runnable jobs
> >     > > that
> >     > > >     can become runnable at once, and hence spawn that many
> >     work items,
> >     > > and
> >     > > >     in turn unbound worker threads.
> >     > > >
> >     > > >     Several problems there.
> >     > > >
> >     > > >     It is fundamentally pointless to have potentially that
> >     many more
> >     > > >     threads
> >     > > >     than the number of CPU cores - it simply creates a
> >     scheduling storm.
> >     > > >
> >     > > >     Unbound workers have no CPU / cache locality either and
> >     no connection
> >     > > >     with the CPU scheduler to optimize scheduling patterns.
> >     This may
> >     > > matter
> >     > > >     either on large systems or on small ones. Whereas the
> >     current design
> >     > > >     allows for scheduler to notice userspace CPU thread
> >     keeps waking up
> >     > > the
> >     > > >     same drm scheduler kernel thread, and so it can keep
> >     them on the same
> >     > > >     CPU, the unbound workers lose that ability and so 2nd
> >     CPU might be
> >     > > >     getting woken up from low sleep for every submission.
> >     > > >
> >     > > >     Hence, apart from being a bit of a impedance mismatch,
> >     the proposal
> >     > > has
> >     > > >     the potential to change performance and power patterns
> >     and both large
> >     > > >     and small machines.
> >     > > >
> >     > > >
> >     > > > Ok, thanks for explaining the issue you're seeing in more
> >     detail.  Yes,
> >     > > > deferred kwork does appear to mismatch somewhat with what
> >     the scheduler
> >     > > > needs or at least how it's worked in the past.  How much
> >     impact will
> >     > > > that mismatch have?  Unclear.
> >     > > >
> >     > > >      >      >>> Secondly, it probably demands separate
> >     workers (not
> >     > > >     optional),
> >     > > >      >     otherwise
> >     > > >      >      >>> behaviour of shared workqueues has either
> >     the potential
> >     > > to
> >     > > >      >     explode number
> >     > > >      >      >>> kernel threads anyway, or add latency.
> >     > > >      >      >>>
> >     > > >      >      >>
> >     > > >      >      >> Right now the system_unbound_wq is used which
> >     does have a
> >     > > >     limit
> >     > > >      >     on the
> >     > > >      >      >> number of threads, right? I do have a FIXME
> >     to allow a
> >     > > >     worker to be
> >     > > >      >      >> passed in similar to TDR.
> >     > > >      >      >>
> >     > > >      >      >> WRT to latency, the 1:1 ratio could actually
> >     have lower
> >     > > >     latency
> >     > > >      >     as 2 GPU
> >     > > >      >      >> schedulers can be pushing jobs into the backend /
> >     > > cleaning up
> >     > > >      >     jobs in
> >     > > >      >      >> parallel.
> >     > > >      >      >>
> >     > > >      >      >
> >     > > >      >      > Thought of one more point here where why in Xe we
> >     > > >     absolutely want
> >     > > >      >     a 1 to
> >     > > >      >      > 1 ratio between entity and scheduler - the way
> >     we implement
> >     > > >      >     timeslicing
> >     > > >      >      > for preempt fences.
> >     > > >      >      >
> >     > > >      >      > Let me try to explain.
> >     > > >      >      >
> >     > > >      >      > Preempt fences are implemented via the generic
> >     messaging
> >     > > >      >     interface [1]
> >     > > >      >      > with suspend / resume messages. If a suspend
> >     messages is
> >     > > >     received to
> >     > > >      >      > soon after calling resume (this is per entity)
> >     we simply
> >     > > >     sleep in the
> >     > > >      >      > suspend call thus giving the entity a
> >     timeslice. This
> >     > > >     completely
> >     > > >      >     falls
> >     > > >      >      > apart with a many to 1 relationship as now a
> >     entity
> >     > > >     waiting for a
> >     > > >      >      > timeslice blocks the other entities. Could we
> >     work aroudn
> >     > > >     this,
> >     > > >      >     sure but
> >     > > >      >      > just another bunch of code we'd have to add in
> >     Xe. Being to
> >     > > >      >     freely sleep
> >     > > >      >      > in backend without affecting other entities is
> >     really,
> >     > > really
> >     > > >      >     nice IMO
> >     > > >      >      > and I bet Xe isn't the only driver that is
> >     going to feel
> >     > > >     this way.
> >     > > >      >      >
> >     > > >      >      > Last thing I'll say regardless of how anyone
> >     feels about
> >     > > >     Xe using
> >     > > >      >     a 1 to
> >     > > >      >      > 1 relationship this patch IMO makes sense as I
> >     hope we can
> >     > > all
> >     > > >      >     agree a
> >     > > >      >      > workqueue scales better than kthreads.
> >     > > >      >
> >     > > >      >     I don't know for sure what will scale better and
> >     for what use
> >     > > >     case,
> >     > > >      >     combination of CPU cores vs number of GPU engines
> >     to keep
> >     > > >     busy vs other
> >     > > >      >     system activity. But I wager someone is bound to
> >     ask for some
> >     > > >      >     numbers to
> >     > > >      >     make sure proposal is not negatively affecting
> >     any other
> >     > > drivers.
> >     > > >      >
> >     > > >      >
> >     > > >      > Then let them ask.  Waving your hands vaguely in the
> >     direction of
> >     > > >     the
> >     > > >      > rest of DRM and saying "Uh, someone (not me) might
> >     object" is
> >     > > >     profoundly
> >     > > >      > unhelpful.  Sure, someone might. That's why it's on
> >     dri-devel.
> >     > > >     If you
> >     > > >      > think there's someone in particular who might have a
> >     useful
> >     > > >     opinion on
> >     > > >      > this, throw them in the CC so they don't miss the
> >     e-mail thread.
> >     > > >      >
> >     > > >      > Or are you asking for numbers?  If so, what numbers
> >     are you
> >     > > >     asking for?
> >     > > >
> >     > > >     It was a heads up to the Xe team in case people weren't
> >     appreciating
> >     > > >     how
> >     > > >     the proposed change has the potential influence power
> >     and performance
> >     > > >     across the board. And nothing in the follow up
> >     discussion made me
> >     > > think
> >     > > >     it was considered so I don't think it was redundant to
> >     raise it.
> >     > > >
> >     > > >     In my experience it is typical that such core changes
> >     come with some
> >     > > >     numbers. Which is in case of drm scheduler is tricky and
> >     probably
> >     > > >     requires explicitly asking everyone to test (rather than
> >     count on
> >     > > >     "don't
> >     > > >     miss the email thread"). Real products can fail to ship
> >     due ten mW
> >     > > here
> >     > > >     or there. Like suddenly an extra core prevented from
> >     getting into
> >     > > deep
> >     > > >     sleep.
> >     > > >
> >     > > >     If that was "profoundly unhelpful" so be it.
> >     > > >
> >     > > >
> >     > > > With your above explanation, it makes more sense what you're
> >     asking.
> >     > > > It's still not something Matt is likely to be able to
> >     provide on his
> >     > > > own.  We need to tag some other folks and ask them to test
> >     it out.  We
> >     > > > could play around a bit with it on Xe but it's not exactly
> >     production
> >     > > > grade yet and is going to hit this differently from most. 
> >     Likely
> >     > > > candidates are probably AMD and Freedreno.
> >     > >
> >     > > Whoever is setup to check out power and performance would be
> >     good to
> >     > > give it a spin, yes.
> >     > >
> >     > > PS. I don't think I was asking Matt to test with other
> >     devices. To start
> >     > > with I think Xe is a team effort. I was asking for more
> >     background on
> >     > > the design decision since patch 4/20 does not say anything on that
> >     > > angle, nor later in the thread it was IMO sufficiently addressed.
> >     > >
> >     > > >      > Also, If we're talking about a design that might
> >     paint us into an
> >     > > >      > Intel-HW-specific hole, that would be one thing.  But
> >     we're not.
> >     > > >     We're
> >     > > >      > talking about switching which kernel threading/task
> >     mechanism to
> >     > > >     use for
> >     > > >      > what's really a very generic problem.  The core Xe
> >     design works
> >     > > >     without
> >     > > >      > this patch (just with more kthreads).  If we land
> >     this patch or
> >     > > >      > something like it and get it wrong and it causes a
> >     performance
> >     > > >     problem
> >     > > >      > for someone down the line, we can revisit it.
> >     > > >
> >     > > >     For some definition of "it works" - I really wouldn't
> >     suggest
> >     > > >     shipping a
> >     > > >     kthread per user context at any point.
> >     > > >
> >     > > >
> >     > > > You have yet to elaborate on why. What resources is it
> >     consuming that's
> >     > > > going to be a problem? Are you anticipating CPU affinity
> >     problems? Or
> >     > > > does it just seem wasteful?
> >     > >
> >     > > Well I don't know, commit message says the approach does not
> >     scale. :)
> >     > >
> >     > > > I think I largely agree that it's probably
> >     unnecessary/wasteful but
> >     > > > reducing the number of kthreads seems like a tractable
> >     problem to solve
> >     > > > regardless of where we put the gpu_scheduler object.  Is
> >     this the right
> >     > > > solution?  Maybe not.  It was also proposed at one point
> >     that we could
> >     > > > split the scheduler into two pieces: A scheduler which owns
> >     the kthread,
> >     > > > and a back-end which targets some HW ring thing where you
> >     can have
> >     > > > multiple back-ends per scheduler.  That's certainly more
> >     invasive from a
> >     > > > DRM scheduler internal API PoV but would solve the kthread
> >     problem in a
> >     > > > way that's more similar to what we have now.
> >     > > >
> >     > > >      >     In any case that's a low level question caused by
> >     the high
> >     > > >     level design
> >     > > >      >     decision. So I'd think first focus on the high
> >     level - which
> >     > > >     is the 1:1
> >     > > >      >     mapping of entity to scheduler instance proposal.
> >     > > >      >
> >     > > >      >     Fundamentally it will be up to the DRM
> >     maintainers and the
> >     > > >     community to
> >     > > >      >     bless your approach. And it is important to
> >     stress 1:1 is
> >     > > about
> >     > > >      >     userspace contexts, so I believe unlike any other
> >     current
> >     > > >     scheduler
> >     > > >      >     user. And also important to stress this
> >     effectively does not
> >     > > >     make Xe
> >     > > >      >     _really_ use the scheduler that much.
> >     > > >      >
> >     > > >      >
> >     > > >      > I don't think this makes Xe nearly as much of a
> >     one-off as you
> >     > > >     think it
> >     > > >      > does.  I've already told the Asahi team working on
> >     Apple M1/2
> >     > > >     hardware
> >     > > >      > to do it this way and it seems to be a pretty good
> >     mapping for
> >     > > >     them. I
> >     > > >      > believe this is roughly the plan for nouveau as
> >     well.  It's not
> >     > > >     the way
> >     > > >      > it currently works for anyone because most other
> >     groups aren't
> >     > > >     doing FW
> >     > > >      > scheduling yet.  In the world of FW scheduling and
> >     hardware
> >     > > >     designed to
> >     > > >      > support userspace direct-to-FW submit, I think the
> >     design makes
> >     > > >     perfect
> >     > > >      > sense (see below) and I expect we'll see more drivers
> >     move in this
> >     > > >      > direction as those drivers evolve. (AMD is doing some
> >     customish
> >     > > >     thing
> >     > > >      > for how with gpu_scheduler on the front-end somehow.
> >     I've not dug
> >     > > >     into
> >     > > >      > those details.)
> >     > > >      >
> >     > > >      >     I can only offer my opinion, which is that the
> >     two options
> >     > > >     mentioned in
> >     > > >      >     this thread (either improve drm scheduler to cope
> >     with what is
> >     > > >      >     required,
> >     > > >      >     or split up the code so you can use just the parts of
> >     > > >     drm_sched which
> >     > > >      >     you want - which is frontend dependency tracking)
> >     shouldn't
> >     > > be so
> >     > > >      >     readily dismissed, given how I think the idea was
> >     for the new
> >     > > >     driver to
> >     > > >      >     work less in a silo and more in the community
> >     (not do kludges
> >     > > to
> >     > > >      >     workaround stuff because it is thought to be too
> >     hard to
> >     > > >     improve common
> >     > > >      >     code), but fundamentally, "goto previous
> >     paragraph" for what
> >     > > I am
> >     > > >      >     concerned.
> >     > > >      >
> >     > > >      >
> >     > > >      > Meta comment:  It appears as if you're falling into
> >     the standard
> >     > > >     i915
> >     > > >      > team trap of having an internal discussion about what the
> >     > > community
> >     > > >      > discussion might look like instead of actually having the
> >     > > community
> >     > > >      > discussion.  If you are seriously concerned about
> >     interactions
> >     > > with
> >     > > >      > other drivers or whether or setting common direction,
> >     the right
> >     > > >     way to
> >     > > >      > do that is to break a patch or two out into a
> >     separate RFC series
> >     > > >     and
> >     > > >      > tag a handful of driver maintainers.  Trying to
> >     predict the
> >     > > >     questions
> >     > > >      > other people might ask is pointless. Cc them and
> >     asking for their
> >     > > >     input
> >     > > >      > instead.
> >     > > >
> >     > > >     I don't follow you here. It's not an internal discussion
> >     - I am
> >     > > raising
> >     > > >     my concerns on the design publicly. I am supposed to
> >     write a patch to
> >     > > >     show something, but am allowed to comment on a RFC series?
> >     > > >
> >     > > >
> >     > > > I may have misread your tone a bit.  It felt a bit like too many
> >     > > > discussions I've had in the past where people are trying to
> >     predict what
> >     > > > others will say instead of just asking them. Reading it
> >     again, I was
> >     > > > probably jumping to conclusions a bit.  Sorry about that.
> >     > >
> >     > > Okay no problem, thanks. In any case we don't have to keep
> >     discussing
> >     > > it, since I wrote one or two emails ago it is fundamentally on the
> >     > > maintainers and community to ack the approach. I only felt
> >     like RFC did
> >     > > not explain the potential downsides sufficiently so I wanted
> >     to probe
> >     > > that area a bit.
> >     > >
> >     > > >     It is "drm/sched: Convert drm scheduler to use a work
> >     queue rather
> >     > > than
> >     > > >     kthread" which should have Cc-ed _everyone_ who use drm
> >     scheduler.
> >     > > >
> >     > > >
> >     > > > Yeah, it probably should have.  I think that's mostly what
> >     I've been
> >     > > > trying to say.
> >     > > >
> >     > > >      >
> >     > > >      >     Regards,
> >     > > >      >
> >     > > >      >     Tvrtko
> >     > > >      >
> >     > > >      >     P.S. And as a related side note, there are more
> >     areas where
> >     > > >     drm_sched
> >     > > >      >     could be improved, like for instance priority
> >     handling.
> >     > > >      >     Take a look at msm_submitqueue_create /
> >     > > >     msm_gpu_convert_priority /
> >     > > >      >     get_sched_entity to see how msm works around the
> >     drm_sched
> >     > > >     hardcoded
> >     > > >      >     limit of available priority levels, in order to
> >     avoid having
> >     > > >     to leave a
> >     > > >      >     hw capability unused. I suspect msm would be
> >     happier if they
> >     > > >     could have
> >     > > >      >     all priority levels equal in terms of whether
> >     they apply only
> >     > > >     at the
> >     > > >      >     frontend level or completely throughout the pipeline.
> >     > > >      >
> >     > > >      >      > [1]
> >     > > >      >
> >     > > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     > > >   
> >      <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     > > >
> >     > > >      >
> >     > > >       <
> >     > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     <
> >     > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>>
> >     > > >      >      >
> >     > > >      >      >>> What would be interesting to learn is
> >     whether the option
> >     > > of
> >     > > >      >     refactoring
> >     > > >      >      >>> drm_sched to deal with out of order
> >     completion was
> >     > > >     considered
> >     > > >      >     and what were
> >     > > >      >      >>> the conclusions.
> >     > > >      >      >>>
> >     > > >      >      >>
> >     > > >      >      >> I coded this up a while back when trying to
> >     convert the
> >     > > >     i915 to
> >     > > >      >     the DRM
> >     > > >      >      >> scheduler it isn't all that hard either. The
> >     free flow
> >     > > >     control
> >     > > >      >     on the
> >     > > >      >      >> ring (e.g. set job limit == SIZE OF RING /
> >     MAX JOB SIZE)
> >     > > is
> >     > > >      >     really what
> >     > > >      >      >> sold me on the this design.
> >     > > >      >
> >     > > >      >
> >     > > >      > You're not the only one to suggest supporting
> >     out-of-order
> >     > > >     completion.
> >     > > >      > However, it's tricky and breaks a lot of internal
> >     assumptions of
> >     > > the
> >     > > >      > scheduler. It also reduces functionality a bit
> >     because it can no
> >     > > >     longer
> >     > > >      > automatically rate-limit HW/FW queues which are often
> >     > > >     fixed-size.  (Ok,
> >     > > >      > yes, it probably could but it becomes a substantially
> >     harder
> >     > > >     problem.)
> >     > > >      >
> >     > > >      > It also seems like a worse mapping to me.  The goal
> >     here is to
> >     > > turn
> >     > > >      > submissions on a userspace-facing engine/queue into
> >     submissions
> >     > > >     to a FW
> >     > > >      > queue submissions, sorting out any dma_fence
> >     dependencies.  Matt's
> >     > > >      > description of saying this is a 1:1 mapping between
> >     sched/entity
> >     > > >     doesn't
> >     > > >      > tell the whole story. It's a 1:1:1 mapping between
> >     xe_engine,
> >     > > >      > gpu_scheduler, and GuC FW engine. Why make it a
> >     1:something:1
> >     > > >     mapping?
> >     > > >      > Why is that better?
> >     > > >
> >     > > >     As I have stated before, what I think what would fit
> >     well for Xe is
> >     > > one
> >     > > >     drm_scheduler per engine class. In specific terms on our
> >     current
> >     > > >     hardware, one drm scheduler instance for render,
> >     compute, blitter,
> >     > > >     video
> >     > > >     and video enhance. Userspace contexts remain scheduler
> >     entities.
> >     > > >
> >     > > >
> >     > > > And this is where we fairly strongly disagree.  More in a bit.
> >     > > >
> >     > > >     That way you avoid the whole kthread/kworker story and
> >     you have it
> >     > > >     actually use the entity picking code in the scheduler,
> >     which may be
> >     > > >     useful when the backend is congested.
> >     > > >
> >     > > >
> >     > > > What back-end congestion are you referring to here?  Running
> >     out of FW
> >     > > > queue IDs?  Something else?
> >     > >
> >     > > CT channel, number of context ids.
> >     > >
> >     > > >
> >     > > >     Yes you have to solve the out of order problem so in my
> >     mind that is
> >     > > >     something to discuss. What the problem actually is (just
> >     TDR?), how
> >     > > >     tricky and why etc.
> >     > > >
> >     > > >     And yes you lose the handy LRCA ring buffer size
> >     management so you'd
> >     > > >     have to make those entities not runnable in some other way.
> >     > > >
> >     > > >     Regarding the argument you raise below - would any of
> >     that make the
> >     > > >     frontend / backend separation worse and why? Do you
> >     think it is less
> >     > > >     natural? If neither is true then all remains is that it
> >     appears extra
> >     > > >     work to support out of order completion of entities has been
> >     > > discounted
> >     > > >     in favour of an easy but IMO inelegant option.
> >     > > >
> >     > > >
> >     > > > Broadly speaking, the kernel needs to stop thinking about
> >     GPU scheduling
> >     > > > in terms of scheduling jobs and start thinking in terms of
> >     scheduling
> >     > > > contexts/engines.  There is still some need for scheduling
> >     individual
> >     > > > jobs but that is only for the purpose of delaying them as
> >     needed to
> >     > > > resolve dma_fence dependencies.  Once dependencies are
> >     resolved, they
> >     > > > get shoved onto the context/engine queue and from there the
> >     kernel only
> >     > > > really manages whole contexts/engines.  This is a major
> >     architectural
> >     > > > shift, entirely different from the way i915 scheduling
> >     works.  It's also
> >     > > > different from the historical usage of DRM scheduler which I
> >     think is
> >     > > > why this all looks a bit funny.
> >     > > >
> >     > > > To justify this architectural shift, let's look at where
> >     we're headed.
> >     > > > In the glorious future...
> >     > > >
> >     > > >   1. Userspace submits directly to firmware queues.  The
> >     kernel has no
> >     > > > visibility whatsoever into individual jobs. At most it can
> >     pause/resume
> >     > > > FW contexts as needed to handle eviction and memory management.
> >     > > >
> >     > > >   2. Because of 1, apart from handing out the FW queue IDs
> >     at the
> >     > > > beginning, the kernel can't really juggle them that much. 
> >     Depending on
> >     > > > FW design, it may be able to pause a client, give its IDs to
> >     another,
> >     > > > and then resume it later when IDs free up. What it's not
> >     doing is
> >     > > > juggling IDs on a job-by-job basis like i915 currently is.
> >     > > >
> >     > > >   3. Long-running compute jobs may not complete for days. 
> >     This means
> >     > > > that memory management needs to happen in terms of
> >     pause/resume of
> >     > > > entire contexts/engines using the memory rather than based
> >     on waiting
> >     > > > for individual jobs to complete or pausing individual jobs
> >     until the
> >     > > > memory is available.
> >     > > >
> >     > > >   4. Synchronization happens via userspace memory fences
> >     (UMF) and the
> >     > > > kernel is mostly unaware of most dependencies and when a
> >     context/engine
> >     > > > is or is not runnable.  Instead, it keeps as many of them
> >     minimally
> >     > > > active (memory is available, even if it's in system RAM) as
> >     possible and
> >     > > > lets the FW sort out dependencies.  (There may need to be
> >     some facility
> >     > > > for sleeping a context until a memory change similar to
> >     futex() or
> >     > > > poll() for userspace threads.  There are some details TBD.)
> >     > > >
> >     > > > Are there potential problems that will need to be solved
> >     here?  Yes.  Is
> >     > > > it a good design?  Well, Microsoft has been living in this
> >     future for
> >     > > > half a decade or better and it's working quite well for
> >     them.  It's also
> >     > > > the way all modern game consoles work.  It really is just
> >     Linux that's
> >     > > > stuck with the same old job model we've had since the
> >     monumental shift
> >     > > > to DRI2.
> >     > > >
> >     > > > To that end, one of the core goals of the Xe project was to
> >     make the
> >     > > > driver internally behave as close to the above model as
> >     possible while
> >     > > > keeping the old-school job model as a very thin layer on
> >     top.  As the
> >     > > > broader ecosystem problems (window-system support for UMF,
> >     for instance)
> >     > > > are solved, that layer can be peeled back. The core driver
> >     will already
> >     > > > be ready for it.
> >     > > >
> >     > > > To that end, the point of the DRM scheduler in Xe isn't to
> >     schedule
> >     > > > jobs.  It's to resolve syncobj and dma-buf implicit sync
> >     dependencies
> >     > > > and stuff jobs into their respective context/engine queue
> >     once they're
> >     > > > ready.  All the actual scheduling happens in firmware and
> >     any scheduling
> >     > > > the kernel does to deal with contention, oversubscriptions,
> >     too many
> >     > > > contexts, etc. is between contexts/engines, not individual
> >     jobs.  Sure,
> >     > > > the individual job visibility is nice, but if we design
> >     around it, we'll
> >     > > > never get to the glorious future.
> >     > > >
> >     > > > I really need to turn the above (with a bit more detail)
> >     into a blog
> >     > > > post.... Maybe I'll do that this week.
> >     > > >
> >     > > > In any case, I hope that provides more insight into why Xe
> >     is designed
> >     > > > the way it is and why I'm pushing back so hard on trying to
> >     make it more
> >     > > > of a "classic" driver as far as scheduling is concerned. 
> >     Are there
> >     > > > potential problems here?  Yes, that's why Xe has been labeled a
> >     > > > prototype.  Are such radical changes necessary to get to
> >     said glorious
> >     > > > future?  Yes, I think they are.  Will it be worth it?  I
> >     believe so.
> >     > >
> >     > > Right, that's all solid I think. My takeaway is that frontend
> >     priority
> >     > > sorting and that stuff isn't needed and that is okay. And that
> >     there are
> >     > > multiple options to maybe improve drm scheduler, like the fore
> >     mentioned
> >     > > making it deal with out of order, or split into functional
> >     components,
> >     > > or split frontend/backend what you suggested. For most of them
> >     cost vs
> >     > > benefit is more or less not completely clear, neither how much
> >     effort
> >     > > was invested to look into them.
> >     > >
> >     > > One thing I missed from this explanation is how drm_scheduler
> >     per engine
> >     > > class interferes with the high level concepts. And I did not
> >     manage to
> >     > > pick up on what exactly is the TDR problem in that case. Maybe
> >     the two
> >     > > are one and the same.
> >     > >
> >     > > Bottom line is I still have the concern that conversion to
> >     kworkers has
> >     > > an opportunity to regress. Possibly more opportunity for some
> >     Xe use
> >     > > cases than to affect other vendors, since they would still be
> >     using per
> >     > > physical engine / queue scheduler instances.
> >     > >
> >     > > And to put my money where my mouth is I will try to put testing Xe
> >     > > inside the full blown ChromeOS environment in my team plans.
> >     It would
> >     > > probably also be beneficial if Xe team could take a look at
> >     real world
> >     > > behaviour of the extreme transcode use cases too. If the stack
> >     is ready
> >     > > for that and all. It would be better to know earlier rather
> >     than later
> >     > > if there is a fundamental issue.
> >     > >
> >     > > For the patch at hand, and the cover letter, it certainly
> >     feels it would
> >     > > benefit to record the past design discussion had with AMD
> >     folks, to
> >     > > explicitly copy other drivers, and to record the theoretical
> >     pros and
> >     > > cons of threads vs unbound workers as I have tried to
> >     highlight them.
> >     > >
> >     > > Regards,
> >     > >
> >     > > Tvrtko
> >     > >
> > 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
@ 2023-02-17 20:51   ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-02-17 20:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

Hi all,

[I thought I've sent this out earlier this week, but alas got stuck, kinda
bad timing now since I'm out next week but oh well]

So xe is a quite substantial thing, and I think we need a clear plan how to land
this or it will take forever, and managers will panic. Also I'm not a big fan of
"Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
not fun. The idea here is how to get everything reviewed without having two
people end up somewhat arbitrary as deciders.

I've compiled a bunch of topics on what I think the important areas are, first
code that should be consistent about new-style render drivers that are aimed for
vk/compute userspace as the primary feature driver:

- figure out consensus solution for fw scheduler and drm/sched frontend among
  interested driver parties (probably xe, amdgpu, nouveau, new panfrost)

- for the interface itself it might be good to have the drm_gpu_scheduler as the
  single per-hw-engine driver api object (but internally a new structure), while
  renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
  think we can address the main critique of the current xe scheduler plan
  - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
    scheduler
  - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
    1:n, the api functions simply iterate over a mutex protect list of internal
    schedulers. this should also help drivers with locking mistakes around
    setup/teardown and gpu reset.
  - drivers select with a flag or something between the current mode (where the
    drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
    the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
    drm_sched_entity)
  - overall still no fundamental changes (like the current patches) to drm/sched
    data structures and algorithms. But unlike the current patches we keep the
    possibility open for eventual refactoring without having to again refactor
    all the drivers. Even better, we can delay such refactoring until we have a
    handful of real-word drivers test-driving this all so we know we actually do
    the right thing. This should allow us to address all the
    fairness/efficiency/whatever concerns that have been floating around without
    having to fix them all up upfront, before we actually know what needs to be
    fixed.

- the generic scheduler code should also including the handling of endless
  compute contexts, with the minimal scaffolding for preempt-ctx fences
  (probably on the drm_sched_entity) and making sure drm/sched can cope with the
  lack of job completion fence. This is very minimal amounts of code, but it
  helps a lot for cross-driver review if this works the same (with the same
  locking and all that) for everyone. Ideally this gets extracted from amdkfd,
  but as long as it's going to be used by all drivers supporting
  endless/compute context going forward it's good enough.

- I'm assuming this also means Matt Brost will include a patch to add himself as
  drm/sched reviewer in MAINTAINERS, or at least something like that

- adopt the gem_exec/vma helpers. again we probably want consensus here among
  the same driver projects. I don't care whether these helpers specify the ioctl
  structs or not, but they absolutely need to enforce the overall locking scheme
  for all major structs and list (so vm and vma).

- we also should have cross-driver consensus on async vm_bind support. I think
  everyone added in-syncobj support, the real fun is probably more in/out
  userspace memory fences (and personally I'm still not sure that's a good idea
  but ... *eh*). I think cross driver consensus on how this should work (ideally
  with helper support so people don't get it wrong in all the possible ways)
  would be best.

- this also means some userptr integration and some consensus how userptr should
  work for vm_bind across drivers. I don't think allowing drivers to reinvent
  that wheel is a bright idea, there's just a bit too much to get wrong here.

- for some of these the consensus might land on more/less shared code than what
  I sketched out above, the important part really is that we have consensus on
  these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
  the code back into drivers, because they really just needed the flexibility to
  program the hw correctly. Right now we definitely don't have enough shared
  code, for sure with i915-gem, but we also need to make sure we're not
  overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).

All the above will make sure that the driver overall is in concepts and design
aligned with the overall community direction, but I think it'd still be good if
someone outside of the intel gpu group reviews the driver code itself. Last time
we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
time around I think we have a perfect candidate with Oded:

- Oded needs/wants to spend some time on ramping up on how drm render drivers
  work anyway, and xe is probably the best example of a driver that's both
  supposed to be full-featured, but also doesn't contain an entire display
  driver on the side.

- Oded is in Habana, which is legally part of Intel. Bean counter budget
  shuffling to make this happen should be possible.

- Habana is still fairly distinct entity within Intel, so that is probably the
  best approach for some independent review, without making the xe team
  beholden to some non-Intel people.

The above should yield some pretty clear road towards landing xe, without any
big review fights with Dave/me like we had with amd's DC/DAL, which took a
rather long time to land unfortunately :-(

These are just my thoughts, let the bikeshed commence!

Ideally we put them all into a TODO like we've done for DC/DAL, once we have
some consensus.

Cheers, Daniel

On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> Hello,
> 
> This is a submission for Xe, a new driver for Intel GPUs that supports both
> integrated and discrete platforms starting with Tiger Lake (first platform with
> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> to work from that is unencumbered by older platforms, whilst also taking the
> opportunity to rearchitect our driver to increase sharing across the drm
> subsystem, both leveraging and allowing us to contribute more towards other
> shared components like TTM and drm/scheduler. The memory model is based on VM
> bind which is similar to the i915 implementation. Likewise the execbuf
> implementation for Xe is very similar to execbuf3 in the i915 [1].
> 
> The code is at a stage where it is already functional and has experimental
> support for multiple platforms starting from Tiger Lake, with initial support
> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> implementation will be released publicly early next year. We also have a suite
> of IGTs for XE that will appear on the IGT list shortly.
> 
> It has been built with the assumption of supporting multiple architectures from
> the get-go, right now with tests running both on X86 and ARM hosts. And we
> intend to continue working on it and improving on it as part of the kernel
> community upstream.
> 
> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> ready Xe for production throughout 2023.
> 
> As for display, the intent is to share the display code with the i915 driver so
> that there is maximum reuse there. Currently this is being done by compiling the
> display code twice, but alternatives to that are under consideration and we want
> to have more discussion on what the best final solution will look like over the
> next few months. Right now, work is ongoing in refactoring the display codebase
> to remove as much as possible any unnecessary dependencies on i915 specific data
> structures there..
> 
> We currently have 2 submission backends, execlists and GuC. The execlist is
> meant mostly for testing and is not fully functional while GuC backend is fully
> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> required and should be placed in /lib/firmware/xe.
> 
> The GuC firmware can be found in the below location:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> 
> The easiest way to setup firmware is:
> cp -r /lib/firmware/i915 /lib/firmware/xe
> 
> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/
> 
> There's a lot of work still to happen on Xe but we're very excited about it and
> wanted to share it early and welcome feedback and discussion.
> 
> Cheers,
> Matthew Brost
> 
> [1] https://patchwork.freedesktop.org/series/105879/
> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> 
> Maarten Lankhorst (12):
>   drm/amd: Convert amdgpu to use suballocation helper.
>   drm/radeon: Use the drm suballocation manager implementation.
>   drm/i915: Remove gem and overlay frontbuffer tracking
>   drm/i915/display: Neuter frontbuffer tracking harder
>   drm/i915/display: Add more macros to remove all direct calls to uncore
>   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>   drm/i915: Rename find_section to find_bdb_section
>   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>     fails
>   drm/i915/display: Remaining changes to make xe compile
>   sound/hda: Allow XE as i915 replacement for sound
>   mei/hdcp: Also enable for XE
> 
> Matthew Brost (5):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
> 
> Thomas Hellström (3):
>   drm/suballoc: Introduce a generic suballocation manager
>   drm: Add a gpu page-table walker helper
>   drm/ttm: Don't print error message if eviction was interrupted
> 
>  drivers/gpu/drm/Kconfig                       |   5 +
>  drivers/gpu/drm/Makefile                      |   4 +
>  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>  drivers/gpu/drm/i915/Makefile                 |   2 +-
>  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>  .../drm/i915/display/intel_display_debugfs.c  |   8 +
>  .../drm/i915/display/intel_display_power.c    |  40 ++-
>  .../drm/i915/display/intel_display_power.h    |   6 +
>  .../i915/display/intel_display_power_map.c    |   7 +
>  .../i915/display/intel_display_power_well.c   |  24 +-
>  .../drm/i915/display/intel_display_reg_defs.h |   4 +
>  .../drm/i915/display/intel_display_trace.h    |   6 +
>  .../drm/i915/display/intel_display_types.h    |  32 +-
>  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>  .../drm/i915/display/intel_plane_initial.c    |   3 +-
>  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>  drivers/gpu/drm/i915/i915_driver.c            |   1 +
>  drivers/gpu/drm/i915/i915_gem.c               |   8 -
>  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>  drivers/gpu/drm/i915/i915_vma.c               |  12 -
>  drivers/gpu/drm/radeon/radeon.h               |  55 +--
>  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>  include/drm/drm_pt_walk.h                     | 161 +++++++++
>  include/drm/drm_suballoc.h                    | 112 ++++++
>  include/drm/gpu_scheduler.h                   |  41 ++-
>  sound/hda/hdac_i915.c                         |  17 +-
>  sound/pci/hda/hda_intel.c                     |  56 +--
>  sound/soc/intel/avs/core.c                    |  13 +-
>  sound/soc/sof/intel/hda.c                     |   7 +-
>  98 files changed, 2076 insertions(+), 1325 deletions(-)
>  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>  create mode 100644 drivers/gpu/drm/drm_suballoc.c
>  create mode 100644 include/drm/drm_pt_walk.h
>  create mode 100644 include/drm/drm_suballoc.h
> 
> -- 
> 2.37.3
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-02-17 20:51   ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-02-17 20:51 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx, dri-devel

Hi all,

[I thought I've sent this out earlier this week, but alas got stuck, kinda
bad timing now since I'm out next week but oh well]

So xe is a quite substantial thing, and I think we need a clear plan how to land
this or it will take forever, and managers will panic. Also I'm not a big fan of
"Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
not fun. The idea here is how to get everything reviewed without having two
people end up somewhat arbitrary as deciders.

I've compiled a bunch of topics on what I think the important areas are, first
code that should be consistent about new-style render drivers that are aimed for
vk/compute userspace as the primary feature driver:

- figure out consensus solution for fw scheduler and drm/sched frontend among
  interested driver parties (probably xe, amdgpu, nouveau, new panfrost)

- for the interface itself it might be good to have the drm_gpu_scheduler as the
  single per-hw-engine driver api object (but internally a new structure), while
  renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
  think we can address the main critique of the current xe scheduler plan
  - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
    scheduler
  - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
    1:n, the api functions simply iterate over a mutex protect list of internal
    schedulers. this should also help drivers with locking mistakes around
    setup/teardown and gpu reset.
  - drivers select with a flag or something between the current mode (where the
    drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
    the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
    drm_sched_entity)
  - overall still no fundamental changes (like the current patches) to drm/sched
    data structures and algorithms. But unlike the current patches we keep the
    possibility open for eventual refactoring without having to again refactor
    all the drivers. Even better, we can delay such refactoring until we have a
    handful of real-word drivers test-driving this all so we know we actually do
    the right thing. This should allow us to address all the
    fairness/efficiency/whatever concerns that have been floating around without
    having to fix them all up upfront, before we actually know what needs to be
    fixed.

- the generic scheduler code should also including the handling of endless
  compute contexts, with the minimal scaffolding for preempt-ctx fences
  (probably on the drm_sched_entity) and making sure drm/sched can cope with the
  lack of job completion fence. This is very minimal amounts of code, but it
  helps a lot for cross-driver review if this works the same (with the same
  locking and all that) for everyone. Ideally this gets extracted from amdkfd,
  but as long as it's going to be used by all drivers supporting
  endless/compute context going forward it's good enough.

- I'm assuming this also means Matt Brost will include a patch to add himself as
  drm/sched reviewer in MAINTAINERS, or at least something like that

- adopt the gem_exec/vma helpers. again we probably want consensus here among
  the same driver projects. I don't care whether these helpers specify the ioctl
  structs or not, but they absolutely need to enforce the overall locking scheme
  for all major structs and list (so vm and vma).

- we also should have cross-driver consensus on async vm_bind support. I think
  everyone added in-syncobj support, the real fun is probably more in/out
  userspace memory fences (and personally I'm still not sure that's a good idea
  but ... *eh*). I think cross driver consensus on how this should work (ideally
  with helper support so people don't get it wrong in all the possible ways)
  would be best.

- this also means some userptr integration and some consensus how userptr should
  work for vm_bind across drivers. I don't think allowing drivers to reinvent
  that wheel is a bright idea, there's just a bit too much to get wrong here.

- for some of these the consensus might land on more/less shared code than what
  I sketched out above, the important part really is that we have consensus on
  these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
  the code back into drivers, because they really just needed the flexibility to
  program the hw correctly. Right now we definitely don't have enough shared
  code, for sure with i915-gem, but we also need to make sure we're not
  overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).

All the above will make sure that the driver overall is in concepts and design
aligned with the overall community direction, but I think it'd still be good if
someone outside of the intel gpu group reviews the driver code itself. Last time
we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
time around I think we have a perfect candidate with Oded:

- Oded needs/wants to spend some time on ramping up on how drm render drivers
  work anyway, and xe is probably the best example of a driver that's both
  supposed to be full-featured, but also doesn't contain an entire display
  driver on the side.

- Oded is in Habana, which is legally part of Intel. Bean counter budget
  shuffling to make this happen should be possible.

- Habana is still fairly distinct entity within Intel, so that is probably the
  best approach for some independent review, without making the xe team
  beholden to some non-Intel people.

The above should yield some pretty clear road towards landing xe, without any
big review fights with Dave/me like we had with amd's DC/DAL, which took a
rather long time to land unfortunately :-(

These are just my thoughts, let the bikeshed commence!

Ideally we put them all into a TODO like we've done for DC/DAL, once we have
some consensus.

Cheers, Daniel

On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> Hello,
> 
> This is a submission for Xe, a new driver for Intel GPUs that supports both
> integrated and discrete platforms starting with Tiger Lake (first platform with
> Intel Xe Architecture). The intention of this new driver is to have a fresh base
> to work from that is unencumbered by older platforms, whilst also taking the
> opportunity to rearchitect our driver to increase sharing across the drm
> subsystem, both leveraging and allowing us to contribute more towards other
> shared components like TTM and drm/scheduler. The memory model is based on VM
> bind which is similar to the i915 implementation. Likewise the execbuf
> implementation for Xe is very similar to execbuf3 in the i915 [1].
> 
> The code is at a stage where it is already functional and has experimental
> support for multiple platforms starting from Tiger Lake, with initial support
> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> implementation will be released publicly early next year. We also have a suite
> of IGTs for XE that will appear on the IGT list shortly.
> 
> It has been built with the assumption of supporting multiple architectures from
> the get-go, right now with tests running both on X86 and ARM hosts. And we
> intend to continue working on it and improving on it as part of the kernel
> community upstream.
> 
> The new Xe driver leverages a lot from i915 and work on i915 continues as we
> ready Xe for production throughout 2023.
> 
> As for display, the intent is to share the display code with the i915 driver so
> that there is maximum reuse there. Currently this is being done by compiling the
> display code twice, but alternatives to that are under consideration and we want
> to have more discussion on what the best final solution will look like over the
> next few months. Right now, work is ongoing in refactoring the display codebase
> to remove as much as possible any unnecessary dependencies on i915 specific data
> structures there..
> 
> We currently have 2 submission backends, execlists and GuC. The execlist is
> meant mostly for testing and is not fully functional while GuC backend is fully
> functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> required and should be placed in /lib/firmware/xe.
> 
> The GuC firmware can be found in the below location:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> 
> The easiest way to setup firmware is:
> cp -r /lib/firmware/i915 /lib/firmware/xe
> 
> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/
> 
> There's a lot of work still to happen on Xe but we're very excited about it and
> wanted to share it early and welcome feedback and discussion.
> 
> Cheers,
> Matthew Brost
> 
> [1] https://patchwork.freedesktop.org/series/105879/
> [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> 
> Maarten Lankhorst (12):
>   drm/amd: Convert amdgpu to use suballocation helper.
>   drm/radeon: Use the drm suballocation manager implementation.
>   drm/i915: Remove gem and overlay frontbuffer tracking
>   drm/i915/display: Neuter frontbuffer tracking harder
>   drm/i915/display: Add more macros to remove all direct calls to uncore
>   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
>   drm/i915: Rename find_section to find_bdb_section
>   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
>   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
>     fails
>   drm/i915/display: Remaining changes to make xe compile
>   sound/hda: Allow XE as i915 replacement for sound
>   mei/hdcp: Also enable for XE
> 
> Matthew Brost (5):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
> 
> Thomas Hellström (3):
>   drm/suballoc: Introduce a generic suballocation manager
>   drm: Add a gpu page-table walker helper
>   drm/ttm: Don't print error message if eviction was interrupted
> 
>  drivers/gpu/drm/Kconfig                       |   5 +
>  drivers/gpu/drm/Makefile                      |   4 +
>  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
>  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
>  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
>  drivers/gpu/drm/i915/Makefile                 |   2 +-
>  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
>  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
>  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
>  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
>  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
>  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
>  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
>  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
>  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
>  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
>  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
>  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
>  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
>  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
>  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
>  .../drm/i915/display/intel_display_debugfs.c  |   8 +
>  .../drm/i915/display/intel_display_power.c    |  40 ++-
>  .../drm/i915/display/intel_display_power.h    |   6 +
>  .../i915/display/intel_display_power_map.c    |   7 +
>  .../i915/display/intel_display_power_well.c   |  24 +-
>  .../drm/i915/display/intel_display_reg_defs.h |   4 +
>  .../drm/i915/display/intel_display_trace.h    |   6 +
>  .../drm/i915/display/intel_display_types.h    |  32 +-
>  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
>  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
>  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
>  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
>  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
>  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
>  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
>  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
>  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
>  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
>  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
>  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
>  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
>  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
>  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
>  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
>  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
>  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
>  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
>  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
>  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
>  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
>  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
>  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
>  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
>  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
>  .../drm/i915/display/intel_plane_initial.c    |   3 +-
>  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
>  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
>  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
>  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
>  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
>  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
>  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
>  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
>  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
>  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
>  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
>  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
>  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
>  drivers/gpu/drm/i915/i915_driver.c            |   1 +
>  drivers/gpu/drm/i915/i915_gem.c               |   8 -
>  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
>  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
>  drivers/gpu/drm/i915/i915_vma.c               |  12 -
>  drivers/gpu/drm/radeon/radeon.h               |  55 +--
>  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
>  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
>  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
>  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
>  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
>  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
>  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
>  include/drm/drm_pt_walk.h                     | 161 +++++++++
>  include/drm/drm_suballoc.h                    | 112 ++++++
>  include/drm/gpu_scheduler.h                   |  41 ++-
>  sound/hda/hdac_i915.c                         |  17 +-
>  sound/pci/hda/hda_intel.c                     |  56 +--
>  sound/soc/intel/avs/core.c                    |  13 +-
>  sound/soc/sof/intel/hda.c                     |   7 +-
>  98 files changed, 2076 insertions(+), 1325 deletions(-)
>  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
>  create mode 100644 drivers/gpu/drm/drm_suballoc.c
>  create mode 100644 include/drm/drm_pt_walk.h
>  create mode 100644 include/drm/drm_suballoc.h
> 
> -- 
> 2.37.3
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH 00/20] Initial Xe driver submission
  2023-02-17 20:51   ` [Intel-gfx] " Daniel Vetter
@ 2023-02-27 12:46     ` Oded Gabbay
  -1 siblings, 0 replies; 161+ messages in thread
From: Oded Gabbay @ 2023-02-27 12:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel

On Fri, Feb 17, 2023 at 10:51 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> Hi all,
>
> [I thought I've sent this out earlier this week, but alas got stuck, kinda
> bad timing now since I'm out next week but oh well]
>
> So xe is a quite substantial thing, and I think we need a clear plan how to land
> this or it will take forever, and managers will panic. Also I'm not a big fan of
> "Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
> not fun. The idea here is how to get everything reviewed without having two
> people end up somewhat arbitrary as deciders.
>
> I've compiled a bunch of topics on what I think the important areas are, first
> code that should be consistent about new-style render drivers that are aimed for
> vk/compute userspace as the primary feature driver:
>
> - figure out consensus solution for fw scheduler and drm/sched frontend among
>   interested driver parties (probably xe, amdgpu, nouveau, new panfrost)
>
> - for the interface itself it might be good to have the drm_gpu_scheduler as the
>   single per-hw-engine driver api object (but internally a new structure), while
>   renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
>   think we can address the main critique of the current xe scheduler plan
>   - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
>     scheduler
>   - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
>     1:n, the api functions simply iterate over a mutex protect list of internal
>     schedulers. this should also help drivers with locking mistakes around
>     setup/teardown and gpu reset.
>   - drivers select with a flag or something between the current mode (where the
>     drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
>     the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
>     drm_sched_entity)
>   - overall still no fundamental changes (like the current patches) to drm/sched
>     data structures and algorithms. But unlike the current patches we keep the
>     possibility open for eventual refactoring without having to again refactor
>     all the drivers. Even better, we can delay such refactoring until we have a
>     handful of real-word drivers test-driving this all so we know we actually do
>     the right thing. This should allow us to address all the
>     fairness/efficiency/whatever concerns that have been floating around without
>     having to fix them all up upfront, before we actually know what needs to be
>     fixed.
>
> - the generic scheduler code should also including the handling of endless
>   compute contexts, with the minimal scaffolding for preempt-ctx fences
>   (probably on the drm_sched_entity) and making sure drm/sched can cope with the
>   lack of job completion fence. This is very minimal amounts of code, but it
>   helps a lot for cross-driver review if this works the same (with the same
>   locking and all that) for everyone. Ideally this gets extracted from amdkfd,
>   but as long as it's going to be used by all drivers supporting
>   endless/compute context going forward it's good enough.
>
> - I'm assuming this also means Matt Brost will include a patch to add himself as
>   drm/sched reviewer in MAINTAINERS, or at least something like that
>
> - adopt the gem_exec/vma helpers. again we probably want consensus here among
>   the same driver projects. I don't care whether these helpers specify the ioctl
>   structs or not, but they absolutely need to enforce the overall locking scheme
>   for all major structs and list (so vm and vma).
>
> - we also should have cross-driver consensus on async vm_bind support. I think
>   everyone added in-syncobj support, the real fun is probably more in/out
>   userspace memory fences (and personally I'm still not sure that's a good idea
>   but ... *eh*). I think cross driver consensus on how this should work (ideally
>   with helper support so people don't get it wrong in all the possible ways)
>   would be best.
>
> - this also means some userptr integration and some consensus how userptr should
>   work for vm_bind across drivers. I don't think allowing drivers to reinvent
>   that wheel is a bright idea, there's just a bit too much to get wrong here.
>
> - for some of these the consensus might land on more/less shared code than what
>   I sketched out above, the important part really is that we have consensus on
>   these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
>   the code back into drivers, because they really just needed the flexibility to
>   program the hw correctly. Right now we definitely don't have enough shared
>   code, for sure with i915-gem, but we also need to make sure we're not
>   overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).
>
> All the above will make sure that the driver overall is in concepts and design
> aligned with the overall community direction, but I think it'd still be good if
> someone outside of the intel gpu group reviews the driver code itself. Last time
> we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
> time around I think we have a perfect candidate with Oded:
>
> - Oded needs/wants to spend some time on ramping up on how drm render drivers
>   work anyway, and xe is probably the best example of a driver that's both
>   supposed to be full-featured, but also doesn't contain an entire display
>   driver on the side.
>
> - Oded is in Habana, which is legally part of Intel. Bean counter budget
>   shuffling to make this happen should be possible.
>
> - Habana is still fairly distinct entity within Intel, so that is probably the
>   best approach for some independent review, without making the xe team
>   beholden to some non-Intel people.
Hi Daniel,
Thanks for suggesting it, I'll gladly do it.
I guess I'll have more feedback on the plan itself after I'll start
going over the current Xe driver code.

Oded

>
> The above should yield some pretty clear road towards landing xe, without any
> big review fights with Dave/me like we had with amd's DC/DAL, which took a
> rather long time to land unfortunately :-(
>
> These are just my thoughts, let the bikeshed commence!
>
> Ideally we put them all into a TODO like we've done for DC/DAL, once we have
> some consensus.
>
> Cheers, Daniel
>
> On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> > Hello,
> >
> > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > integrated and discrete platforms starting with Tiger Lake (first platform with
> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > to work from that is unencumbered by older platforms, whilst also taking the
> > opportunity to rearchitect our driver to increase sharing across the drm
> > subsystem, both leveraging and allowing us to contribute more towards other
> > shared components like TTM and drm/scheduler. The memory model is based on VM
> > bind which is similar to the i915 implementation. Likewise the execbuf
> > implementation for Xe is very similar to execbuf3 in the i915 [1].
> >
> > The code is at a stage where it is already functional and has experimental
> > support for multiple platforms starting from Tiger Lake, with initial support
> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > implementation will be released publicly early next year. We also have a suite
> > of IGTs for XE that will appear on the IGT list shortly.
> >
> > It has been built with the assumption of supporting multiple architectures from
> > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > intend to continue working on it and improving on it as part of the kernel
> > community upstream.
> >
> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > ready Xe for production throughout 2023.
> >
> > As for display, the intent is to share the display code with the i915 driver so
> > that there is maximum reuse there. Currently this is being done by compiling the
> > display code twice, but alternatives to that are under consideration and we want
> > to have more discussion on what the best final solution will look like over the
> > next few months. Right now, work is ongoing in refactoring the display codebase
> > to remove as much as possible any unnecessary dependencies on i915 specific data
> > structures there..
> >
> > We currently have 2 submission backends, execlists and GuC. The execlist is
> > meant mostly for testing and is not fully functional while GuC backend is fully
> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > required and should be placed in /lib/firmware/xe.
> >
> > The GuC firmware can be found in the below location:
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> >
> > The easiest way to setup firmware is:
> > cp -r /lib/firmware/i915 /lib/firmware/xe
> >
> > The code has been organized such that we have all patches that touch areas
> > outside of drm/xe first for review, and then the actual new driver in a separate
> > commit. The code which is outside of drm/xe is included in this RFC while
> > drm/xe is not due to the size of the commit. The drm/xe is code is available in
> > a public repo listed below.
> >
> > Xe driver commit:
> > https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> >
> > Xe kernel repo:
> > https://cgit.freedesktop.org/drm/drm-xe/
> >
> > There's a lot of work still to happen on Xe but we're very excited about it and
> > wanted to share it early and welcome feedback and discussion.
> >
> > Cheers,
> > Matthew Brost
> >
> > [1] https://patchwork.freedesktop.org/series/105879/
> > [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> >
> > Maarten Lankhorst (12):
> >   drm/amd: Convert amdgpu to use suballocation helper.
> >   drm/radeon: Use the drm suballocation manager implementation.
> >   drm/i915: Remove gem and overlay frontbuffer tracking
> >   drm/i915/display: Neuter frontbuffer tracking harder
> >   drm/i915/display: Add more macros to remove all direct calls to uncore
> >   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
> >   drm/i915: Rename find_section to find_bdb_section
> >   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
> >   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
> >     fails
> >   drm/i915/display: Remaining changes to make xe compile
> >   sound/hda: Allow XE as i915 replacement for sound
> >   mei/hdcp: Also enable for XE
> >
> > Matthew Brost (5):
> >   drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread
> >   drm/sched: Add generic scheduler message interface
> >   drm/sched: Start run wq before TDR in drm_sched_start
> >   drm/sched: Submit job before starting TDR
> >   drm/sched: Add helper to set TDR timeout
> >
> > Thomas Hellström (3):
> >   drm/suballoc: Introduce a generic suballocation manager
> >   drm: Add a gpu page-table walker helper
> >   drm/ttm: Don't print error message if eviction was interrupted
> >
> >  drivers/gpu/drm/Kconfig                       |   5 +
> >  drivers/gpu/drm/Makefile                      |   4 +
> >  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
> >  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
> >  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
> >  drivers/gpu/drm/i915/Makefile                 |   2 +-
> >  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
> >  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
> >  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
> >  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
> >  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
> >  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
> >  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
> >  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
> >  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
> >  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
> >  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
> >  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
> >  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
> >  .../drm/i915/display/intel_display_debugfs.c  |   8 +
> >  .../drm/i915/display/intel_display_power.c    |  40 ++-
> >  .../drm/i915/display/intel_display_power.h    |   6 +
> >  .../i915/display/intel_display_power_map.c    |   7 +
> >  .../i915/display/intel_display_power_well.c   |  24 +-
> >  .../drm/i915/display/intel_display_reg_defs.h |   4 +
> >  .../drm/i915/display/intel_display_trace.h    |   6 +
> >  .../drm/i915/display/intel_display_types.h    |  32 +-
> >  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
> >  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
> >  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
> >  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
> >  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
> >  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
> >  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
> >  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
> >  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
> >  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
> >  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
> >  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
> >  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
> >  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
> >  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
> >  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
> >  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
> >  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
> >  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
> >  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
> >  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
> >  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
> >  .../drm/i915/display/intel_plane_initial.c    |   3 +-
> >  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
> >  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
> >  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
> >  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
> >  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
> >  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
> >  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
> >  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
> >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
> >  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
> >  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
> >  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
> >  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
> >  drivers/gpu/drm/i915/i915_driver.c            |   1 +
> >  drivers/gpu/drm/i915/i915_gem.c               |   8 -
> >  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
> >  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
> >  drivers/gpu/drm/i915/i915_vma.c               |  12 -
> >  drivers/gpu/drm/radeon/radeon.h               |  55 +--
> >  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
> >  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
> >  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
> >  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
> >  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
> >  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
> >  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
> >  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
> >  include/drm/drm_pt_walk.h                     | 161 +++++++++
> >  include/drm/drm_suballoc.h                    | 112 ++++++
> >  include/drm/gpu_scheduler.h                   |  41 ++-
> >  sound/hda/hdac_i915.c                         |  17 +-
> >  sound/pci/hda/hda_intel.c                     |  56 +--
> >  sound/soc/intel/avs/core.c                    |  13 +-
> >  sound/soc/sof/intel/hda.c                     |   7 +-
> >  98 files changed, 2076 insertions(+), 1325 deletions(-)
> >  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
> >  create mode 100644 drivers/gpu/drm/drm_suballoc.c
> >  create mode 100644 include/drm/drm_pt_walk.h
> >  create mode 100644 include/drm/drm_suballoc.h
> >
> > --
> > 2.37.3
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-02-27 12:46     ` Oded Gabbay
  0 siblings, 0 replies; 161+ messages in thread
From: Oded Gabbay @ 2023-02-27 12:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Fri, Feb 17, 2023 at 10:51 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> Hi all,
>
> [I thought I've sent this out earlier this week, but alas got stuck, kinda
> bad timing now since I'm out next week but oh well]
>
> So xe is a quite substantial thing, and I think we need a clear plan how to land
> this or it will take forever, and managers will panic. Also I'm not a big fan of
> "Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
> not fun. The idea here is how to get everything reviewed without having two
> people end up somewhat arbitrary as deciders.
>
> I've compiled a bunch of topics on what I think the important areas are, first
> code that should be consistent about new-style render drivers that are aimed for
> vk/compute userspace as the primary feature driver:
>
> - figure out consensus solution for fw scheduler and drm/sched frontend among
>   interested driver parties (probably xe, amdgpu, nouveau, new panfrost)
>
> - for the interface itself it might be good to have the drm_gpu_scheduler as the
>   single per-hw-engine driver api object (but internally a new structure), while
>   renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
>   think we can address the main critique of the current xe scheduler plan
>   - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
>     scheduler
>   - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
>     1:n, the api functions simply iterate over a mutex protect list of internal
>     schedulers. this should also help drivers with locking mistakes around
>     setup/teardown and gpu reset.
>   - drivers select with a flag or something between the current mode (where the
>     drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
>     the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
>     drm_sched_entity)
>   - overall still no fundamental changes (like the current patches) to drm/sched
>     data structures and algorithms. But unlike the current patches we keep the
>     possibility open for eventual refactoring without having to again refactor
>     all the drivers. Even better, we can delay such refactoring until we have a
>     handful of real-word drivers test-driving this all so we know we actually do
>     the right thing. This should allow us to address all the
>     fairness/efficiency/whatever concerns that have been floating around without
>     having to fix them all up upfront, before we actually know what needs to be
>     fixed.
>
> - the generic scheduler code should also including the handling of endless
>   compute contexts, with the minimal scaffolding for preempt-ctx fences
>   (probably on the drm_sched_entity) and making sure drm/sched can cope with the
>   lack of job completion fence. This is very minimal amounts of code, but it
>   helps a lot for cross-driver review if this works the same (with the same
>   locking and all that) for everyone. Ideally this gets extracted from amdkfd,
>   but as long as it's going to be used by all drivers supporting
>   endless/compute context going forward it's good enough.
>
> - I'm assuming this also means Matt Brost will include a patch to add himself as
>   drm/sched reviewer in MAINTAINERS, or at least something like that
>
> - adopt the gem_exec/vma helpers. again we probably want consensus here among
>   the same driver projects. I don't care whether these helpers specify the ioctl
>   structs or not, but they absolutely need to enforce the overall locking scheme
>   for all major structs and list (so vm and vma).
>
> - we also should have cross-driver consensus on async vm_bind support. I think
>   everyone added in-syncobj support, the real fun is probably more in/out
>   userspace memory fences (and personally I'm still not sure that's a good idea
>   but ... *eh*). I think cross driver consensus on how this should work (ideally
>   with helper support so people don't get it wrong in all the possible ways)
>   would be best.
>
> - this also means some userptr integration and some consensus how userptr should
>   work for vm_bind across drivers. I don't think allowing drivers to reinvent
>   that wheel is a bright idea, there's just a bit too much to get wrong here.
>
> - for some of these the consensus might land on more/less shared code than what
>   I sketched out above, the important part really is that we have consensus on
>   these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
>   the code back into drivers, because they really just needed the flexibility to
>   program the hw correctly. Right now we definitely don't have enough shared
>   code, for sure with i915-gem, but we also need to make sure we're not
>   overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).
>
> All the above will make sure that the driver overall is in concepts and design
> aligned with the overall community direction, but I think it'd still be good if
> someone outside of the intel gpu group reviews the driver code itself. Last time
> we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
> time around I think we have a perfect candidate with Oded:
>
> - Oded needs/wants to spend some time on ramping up on how drm render drivers
>   work anyway, and xe is probably the best example of a driver that's both
>   supposed to be full-featured, but also doesn't contain an entire display
>   driver on the side.
>
> - Oded is in Habana, which is legally part of Intel. Bean counter budget
>   shuffling to make this happen should be possible.
>
> - Habana is still fairly distinct entity within Intel, so that is probably the
>   best approach for some independent review, without making the xe team
>   beholden to some non-Intel people.
Hi Daniel,
Thanks for suggesting it, I'll gladly do it.
I guess I'll have more feedback on the plan itself after I'll start
going over the current Xe driver code.

Oded

>
> The above should yield some pretty clear road towards landing xe, without any
> big review fights with Dave/me like we had with amd's DC/DAL, which took a
> rather long time to land unfortunately :-(
>
> These are just my thoughts, let the bikeshed commence!
>
> Ideally we put them all into a TODO like we've done for DC/DAL, once we have
> some consensus.
>
> Cheers, Daniel
>
> On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> > Hello,
> >
> > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > integrated and discrete platforms starting with Tiger Lake (first platform with
> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > to work from that is unencumbered by older platforms, whilst also taking the
> > opportunity to rearchitect our driver to increase sharing across the drm
> > subsystem, both leveraging and allowing us to contribute more towards other
> > shared components like TTM and drm/scheduler. The memory model is based on VM
> > bind which is similar to the i915 implementation. Likewise the execbuf
> > implementation for Xe is very similar to execbuf3 in the i915 [1].
> >
> > The code is at a stage where it is already functional and has experimental
> > support for multiple platforms starting from Tiger Lake, with initial support
> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > implementation will be released publicly early next year. We also have a suite
> > of IGTs for XE that will appear on the IGT list shortly.
> >
> > It has been built with the assumption of supporting multiple architectures from
> > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > intend to continue working on it and improving on it as part of the kernel
> > community upstream.
> >
> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > ready Xe for production throughout 2023.
> >
> > As for display, the intent is to share the display code with the i915 driver so
> > that there is maximum reuse there. Currently this is being done by compiling the
> > display code twice, but alternatives to that are under consideration and we want
> > to have more discussion on what the best final solution will look like over the
> > next few months. Right now, work is ongoing in refactoring the display codebase
> > to remove as much as possible any unnecessary dependencies on i915 specific data
> > structures there..
> >
> > We currently have 2 submission backends, execlists and GuC. The execlist is
> > meant mostly for testing and is not fully functional while GuC backend is fully
> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > required and should be placed in /lib/firmware/xe.
> >
> > The GuC firmware can be found in the below location:
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> >
> > The easiest way to setup firmware is:
> > cp -r /lib/firmware/i915 /lib/firmware/xe
> >
> > The code has been organized such that we have all patches that touch areas
> > outside of drm/xe first for review, and then the actual new driver in a separate
> > commit. The code which is outside of drm/xe is included in this RFC while
> > drm/xe is not due to the size of the commit. The drm/xe is code is available in
> > a public repo listed below.
> >
> > Xe driver commit:
> > https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> >
> > Xe kernel repo:
> > https://cgit.freedesktop.org/drm/drm-xe/
> >
> > There's a lot of work still to happen on Xe but we're very excited about it and
> > wanted to share it early and welcome feedback and discussion.
> >
> > Cheers,
> > Matthew Brost
> >
> > [1] https://patchwork.freedesktop.org/series/105879/
> > [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> >
> > Maarten Lankhorst (12):
> >   drm/amd: Convert amdgpu to use suballocation helper.
> >   drm/radeon: Use the drm suballocation manager implementation.
> >   drm/i915: Remove gem and overlay frontbuffer tracking
> >   drm/i915/display: Neuter frontbuffer tracking harder
> >   drm/i915/display: Add more macros to remove all direct calls to uncore
> >   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
> >   drm/i915: Rename find_section to find_bdb_section
> >   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
> >   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
> >     fails
> >   drm/i915/display: Remaining changes to make xe compile
> >   sound/hda: Allow XE as i915 replacement for sound
> >   mei/hdcp: Also enable for XE
> >
> > Matthew Brost (5):
> >   drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread
> >   drm/sched: Add generic scheduler message interface
> >   drm/sched: Start run wq before TDR in drm_sched_start
> >   drm/sched: Submit job before starting TDR
> >   drm/sched: Add helper to set TDR timeout
> >
> > Thomas Hellström (3):
> >   drm/suballoc: Introduce a generic suballocation manager
> >   drm: Add a gpu page-table walker helper
> >   drm/ttm: Don't print error message if eviction was interrupted
> >
> >  drivers/gpu/drm/Kconfig                       |   5 +
> >  drivers/gpu/drm/Makefile                      |   4 +
> >  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
> >  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
> >  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
> >  drivers/gpu/drm/i915/Makefile                 |   2 +-
> >  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
> >  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
> >  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
> >  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
> >  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
> >  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
> >  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
> >  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
> >  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
> >  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
> >  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
> >  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
> >  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
> >  .../drm/i915/display/intel_display_debugfs.c  |   8 +
> >  .../drm/i915/display/intel_display_power.c    |  40 ++-
> >  .../drm/i915/display/intel_display_power.h    |   6 +
> >  .../i915/display/intel_display_power_map.c    |   7 +
> >  .../i915/display/intel_display_power_well.c   |  24 +-
> >  .../drm/i915/display/intel_display_reg_defs.h |   4 +
> >  .../drm/i915/display/intel_display_trace.h    |   6 +
> >  .../drm/i915/display/intel_display_types.h    |  32 +-
> >  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
> >  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
> >  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
> >  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
> >  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
> >  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
> >  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
> >  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
> >  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
> >  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
> >  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
> >  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
> >  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
> >  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
> >  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
> >  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
> >  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
> >  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
> >  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
> >  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
> >  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
> >  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
> >  .../drm/i915/display/intel_plane_initial.c    |   3 +-
> >  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
> >  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
> >  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
> >  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
> >  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
> >  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
> >  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
> >  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
> >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
> >  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
> >  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
> >  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
> >  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
> >  drivers/gpu/drm/i915/i915_driver.c            |   1 +
> >  drivers/gpu/drm/i915/i915_gem.c               |   8 -
> >  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
> >  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
> >  drivers/gpu/drm/i915/i915_vma.c               |  12 -
> >  drivers/gpu/drm/radeon/radeon.h               |  55 +--
> >  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
> >  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
> >  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
> >  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
> >  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
> >  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
> >  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
> >  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
> >  include/drm/drm_pt_walk.h                     | 161 +++++++++
> >  include/drm/drm_suballoc.h                    | 112 ++++++
> >  include/drm/gpu_scheduler.h                   |  41 ++-
> >  sound/hda/hdac_i915.c                         |  17 +-
> >  sound/pci/hda/hda_intel.c                     |  56 +--
> >  sound/soc/intel/avs/core.c                    |  13 +-
> >  sound/soc/sof/intel/hda.c                     |   7 +-
> >  98 files changed, 2076 insertions(+), 1325 deletions(-)
> >  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
> >  create mode 100644 drivers/gpu/drm/drm_suballoc.c
> >  create mode 100644 include/drm/drm_pt_walk.h
> >  create mode 100644 include/drm/drm_suballoc.h
> >
> > --
> > 2.37.3
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2023-02-17 20:51   ` [Intel-gfx] " Daniel Vetter
@ 2023-03-01 23:00     ` Rodrigo Vivi
  -1 siblings, 0 replies; 161+ messages in thread
From: Rodrigo Vivi @ 2023-03-01 23:00 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Matthew Brost, intel-gfx, dri-devel

On Fri, Feb 17, 2023 at 09:51:37PM +0100, Daniel Vetter wrote:
> Hi all,
> 
> [I thought I've sent this out earlier this week, but alas got stuck, kinda
> bad timing now since I'm out next week but oh well]
> 
> So xe is a quite substantial thing, and I think we need a clear plan how to land
> this or it will take forever, and managers will panic. Also I'm not a big fan of
> "Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
> not fun. The idea here is how to get everything reviewed without having two
> people end up somewhat arbitrary as deciders.

Thank you so much for taking time to write it down. We need to get alignment
on the critical topics to see how we can move this forward.

> 
> I've compiled a bunch of topics on what I think the important areas are, first
> code that should be consistent about new-style render drivers that are aimed for
> vk/compute userspace as the primary feature driver:
> 
> - figure out consensus solution for fw scheduler and drm/sched frontend among
>   interested driver parties (probably xe, amdgpu, nouveau, new panfrost)

Yeap. We do need to figure this out. But just to ensure that we are in the same
page here. What I had in mind was that Matt would upstream the 5 or 6 drm_sched
related patches that we have underneath Xe patches on drm-misc with addressing
the community feedback, then we would merge Xe with the current schedule solution
(or modifications based on the modifications of these mentioned patches) and
then we would continue to work with the other drivers to improve the drm sched
frontend while we are already in tree. Possible? or do you want to see
fundamental changes before we can even consider to get in? Like the ones below?

> 
> - for the interface itself it might be good to have the drm_gpu_scheduler as the
>   single per-hw-engine driver api object (but internally a new structure), while
>   renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
>   think we can address the main critique of the current xe scheduler plan
>   - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
>     scheduler
>   - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
>     1:n, the api functions simply iterate over a mutex protect list of internal
>     schedulers. this should also help drivers with locking mistakes around
>     setup/teardown and gpu reset.
>   - drivers select with a flag or something between the current mode (where the
>     drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
>     the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
>     drm_sched_entity)
>   - overall still no fundamental changes (like the current patches) to drm/sched
>     data structures and algorithms. But unlike the current patches we keep the
>     possibility open for eventual refactoring without having to again refactor
>     all the drivers. Even better, we can delay such refactoring until we have a
>     handful of real-word drivers test-driving this all so we know we actually do
>     the right thing. This should allow us to address all the
>     fairness/efficiency/whatever concerns that have been floating around without
>     having to fix them all up upfront, before we actually know what needs to be
>     fixed.

do you believe this has to be decided and moved towards one of this before we
get merged?

> 
> - the generic scheduler code should also including the handling of endless
>   compute contexts, with the minimal scaffolding for preempt-ctx fences
>   (probably on the drm_sched_entity) and making sure drm/sched can cope with the
>   lack of job completion fence. This is very minimal amounts of code, but it
>   helps a lot for cross-driver review if this works the same (with the same
>   locking and all that) for everyone. Ideally this gets extracted from amdkfd,
>   but as long as it's going to be used by all drivers supporting
>   endless/compute context going forward it's good enough.

On this one I'm a bit clueless to be honest. I thought the biggest problem with
the long running context or even endless were due to the hangcheck premption or
migrations that would end in some pagefaults.
But yeap, it looks that there are opens to get these kind of workloads properly
supported. But with this in mind do you see any real blocker on Xe? or any must-have
thing?

> 
> - I'm assuming this also means Matt Brost will include a patch to add himself as
>   drm/sched reviewer in MAINTAINERS, or at least something like that

+1 on this idea!
This enforces our engagement and commitment with the drm_sched imho.

> 
> - adopt the gem_exec/vma helpers. again we probably want consensus here among
>   the same driver projects. I don't care whether these helpers specify the ioctl
>   structs or not, but they absolutely need to enforce the overall locking scheme
>   for all major structs and list (so vm and vma).

On this front I thought we would need to align on a common drm_vm_bind based on
the common parts of xe vm_bind and nouveau one. And also some engagement that
I thought it would be easier after we are integrated and part of the drm-next.
Do we need to do this earlier? Could you please open it a bit more on what
exactly you want to see before we can be considered to get merged or after?

> 
> - we also should have cross-driver consensus on async vm_bind support. I think
>   everyone added in-syncobj support, the real fun is probably more in/out
>   userspace memory fences (and personally I'm still not sure that's a good idea
>   but ... *eh*). I think cross driver consensus on how this should work (ideally
>   with helper support so people don't get it wrong in all the possible ways)
>   would be best.

Should the consensus API come first? should this block the nouveau implementation
and move us all towards the drm_vm_bind? or can we sync in-tree?

> 
> - this also means some userptr integration and some consensus how userptr should
>   work for vm_bind across drivers. I don't think allowing drivers to reinvent
>   that wheel is a bright idea, there's just a bit too much to get wrong here.

ack. but kind of same question. is it a blocker to align before? or easier to
align in tree?

> 
> - for some of these the consensus might land on more/less shared code than what
>   I sketched out above, the important part really is that we have consensus on
>   these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
>   the code back into drivers, because they really just needed the flexibility to
>   program the hw correctly. Right now we definitely don't have enough shared
>   code, for sure with i915-gem, but we also need to make sure we're not
>   overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).

+1 on this. We need to work more in the drm layers like display has done successfully!

> 
> All the above will make sure that the driver overall is in concepts and design
> aligned with the overall community direction, but I think it'd still be good if
> someone outside of the intel gpu group reviews the driver code itself. Last time
> we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
> time around I think we have a perfect candidate with Oded:
> 
> - Oded needs/wants to spend some time on ramping up on how drm render drivers
>   work anyway, and xe is probably the best example of a driver that's both
>   supposed to be full-featured, but also doesn't contain an entire display
>   driver on the side.
> 
> - Oded is in Habana, which is legally part of Intel. Bean counter budget
>   shuffling to make this happen should be possible.
> 
> - Habana is still fairly distinct entity within Intel, so that is probably the
>   best approach for some independent review, without making the xe team
>   beholden to some non-Intel people.

+1 on this entire idea here as well.

> 
> The above should yield some pretty clear road towards landing xe, without any
> big review fights with Dave/me like we had with amd's DC/DAL, which took a
> rather long time to land unfortunately :-(

As I wrote already, I really agree with you that we should work more with the
drm and more with the other drivers. But for the logistics of the work and
the rebase pains and to avoid a situation where we have a totally divergent
driver, I believe the fastest way is to solve any blockers and big issues
before, then merge, then work towards more collaboration on the next step.

Specially when with Xe we are not planning to remove the force_probe
flag for a while, what puts us in a "staging" situation.
We could even make use of the CONFIG_STAGING if needed.

Thoughts?
And most than that, any already know big blockers?

> 
> These are just my thoughts, let the bikeshed commence!

:)

> 
> Ideally we put them all into a TODO like we've done for DC/DAL, once we have
> some consensus.

I like the TODO list idea.
And also we need to use more the RFC doc section as well, like
i915-vmbind used.

On the TODO part, where do you recommend to add in the doc?

Again, thank you so much,
Rodrigo.

> 
> Cheers, Daniel
> 
> On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> > Hello,
> > 
> > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > integrated and discrete platforms starting with Tiger Lake (first platform with
> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > to work from that is unencumbered by older platforms, whilst also taking the
> > opportunity to rearchitect our driver to increase sharing across the drm
> > subsystem, both leveraging and allowing us to contribute more towards other
> > shared components like TTM and drm/scheduler. The memory model is based on VM
> > bind which is similar to the i915 implementation. Likewise the execbuf
> > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > 
> > The code is at a stage where it is already functional and has experimental
> > support for multiple platforms starting from Tiger Lake, with initial support
> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > implementation will be released publicly early next year. We also have a suite
> > of IGTs for XE that will appear on the IGT list shortly.
> > 
> > It has been built with the assumption of supporting multiple architectures from
> > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > intend to continue working on it and improving on it as part of the kernel
> > community upstream.
> > 
> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > ready Xe for production throughout 2023.
> > 
> > As for display, the intent is to share the display code with the i915 driver so
> > that there is maximum reuse there. Currently this is being done by compiling the
> > display code twice, but alternatives to that are under consideration and we want
> > to have more discussion on what the best final solution will look like over the
> > next few months. Right now, work is ongoing in refactoring the display codebase
> > to remove as much as possible any unnecessary dependencies on i915 specific data
> > structures there..
> > 
> > We currently have 2 submission backends, execlists and GuC. The execlist is
> > meant mostly for testing and is not fully functional while GuC backend is fully
> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > required and should be placed in /lib/firmware/xe.
> > 
> > The GuC firmware can be found in the below location:
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> > 
> > The easiest way to setup firmware is:
> > cp -r /lib/firmware/i915 /lib/firmware/xe
> > 
> > The code has been organized such that we have all patches that touch areas
> > outside of drm/xe first for review, and then the actual new driver in a separate
> > commit. The code which is outside of drm/xe is included in this RFC while
> > drm/xe is not due to the size of the commit. The drm/xe is code is available in
> > a public repo listed below.
> > 
> > Xe driver commit:
> > https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> > 
> > Xe kernel repo:
> > https://cgit.freedesktop.org/drm/drm-xe/
> > 
> > There's a lot of work still to happen on Xe but we're very excited about it and
> > wanted to share it early and welcome feedback and discussion.
> > 
> > Cheers,
> > Matthew Brost
> > 
> > [1] https://patchwork.freedesktop.org/series/105879/
> > [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> > 
> > Maarten Lankhorst (12):
> >   drm/amd: Convert amdgpu to use suballocation helper.
> >   drm/radeon: Use the drm suballocation manager implementation.
> >   drm/i915: Remove gem and overlay frontbuffer tracking
> >   drm/i915/display: Neuter frontbuffer tracking harder
> >   drm/i915/display: Add more macros to remove all direct calls to uncore
> >   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
> >   drm/i915: Rename find_section to find_bdb_section
> >   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
> >   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
> >     fails
> >   drm/i915/display: Remaining changes to make xe compile
> >   sound/hda: Allow XE as i915 replacement for sound
> >   mei/hdcp: Also enable for XE
> > 
> > Matthew Brost (5):
> >   drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread
> >   drm/sched: Add generic scheduler message interface
> >   drm/sched: Start run wq before TDR in drm_sched_start
> >   drm/sched: Submit job before starting TDR
> >   drm/sched: Add helper to set TDR timeout
> > 
> > Thomas Hellström (3):
> >   drm/suballoc: Introduce a generic suballocation manager
> >   drm: Add a gpu page-table walker helper
> >   drm/ttm: Don't print error message if eviction was interrupted
> > 
> >  drivers/gpu/drm/Kconfig                       |   5 +
> >  drivers/gpu/drm/Makefile                      |   4 +
> >  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
> >  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
> >  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
> >  drivers/gpu/drm/i915/Makefile                 |   2 +-
> >  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
> >  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
> >  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
> >  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
> >  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
> >  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
> >  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
> >  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
> >  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
> >  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
> >  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
> >  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
> >  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
> >  .../drm/i915/display/intel_display_debugfs.c  |   8 +
> >  .../drm/i915/display/intel_display_power.c    |  40 ++-
> >  .../drm/i915/display/intel_display_power.h    |   6 +
> >  .../i915/display/intel_display_power_map.c    |   7 +
> >  .../i915/display/intel_display_power_well.c   |  24 +-
> >  .../drm/i915/display/intel_display_reg_defs.h |   4 +
> >  .../drm/i915/display/intel_display_trace.h    |   6 +
> >  .../drm/i915/display/intel_display_types.h    |  32 +-
> >  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
> >  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
> >  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
> >  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
> >  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
> >  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
> >  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
> >  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
> >  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
> >  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
> >  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
> >  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
> >  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
> >  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
> >  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
> >  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
> >  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
> >  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
> >  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
> >  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
> >  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
> >  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
> >  .../drm/i915/display/intel_plane_initial.c    |   3 +-
> >  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
> >  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
> >  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
> >  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
> >  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
> >  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
> >  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
> >  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
> >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
> >  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
> >  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
> >  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
> >  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
> >  drivers/gpu/drm/i915/i915_driver.c            |   1 +
> >  drivers/gpu/drm/i915/i915_gem.c               |   8 -
> >  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
> >  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
> >  drivers/gpu/drm/i915/i915_vma.c               |  12 -
> >  drivers/gpu/drm/radeon/radeon.h               |  55 +--
> >  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
> >  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
> >  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
> >  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
> >  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
> >  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
> >  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
> >  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
> >  include/drm/drm_pt_walk.h                     | 161 +++++++++
> >  include/drm/drm_suballoc.h                    | 112 ++++++
> >  include/drm/gpu_scheduler.h                   |  41 ++-
> >  sound/hda/hdac_i915.c                         |  17 +-
> >  sound/pci/hda/hda_intel.c                     |  56 +--
> >  sound/soc/intel/avs/core.c                    |  13 +-
> >  sound/soc/sof/intel/hda.c                     |   7 +-
> >  98 files changed, 2076 insertions(+), 1325 deletions(-)
> >  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
> >  create mode 100644 drivers/gpu/drm/drm_suballoc.c
> >  create mode 100644 include/drm/drm_pt_walk.h
> >  create mode 100644 include/drm/drm_suballoc.h
> > 
> > -- 
> > 2.37.3
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-03-01 23:00     ` Rodrigo Vivi
  0 siblings, 0 replies; 161+ messages in thread
From: Rodrigo Vivi @ 2023-03-01 23:00 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel

On Fri, Feb 17, 2023 at 09:51:37PM +0100, Daniel Vetter wrote:
> Hi all,
> 
> [I thought I've sent this out earlier this week, but alas got stuck, kinda
> bad timing now since I'm out next week but oh well]
> 
> So xe is a quite substantial thing, and I think we need a clear plan how to land
> this or it will take forever, and managers will panic. Also I'm not a big fan of
> "Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
> not fun. The idea here is how to get everything reviewed without having two
> people end up somewhat arbitrary as deciders.

Thank you so much for taking time to write it down. We need to get alignment
on the critical topics to see how we can move this forward.

> 
> I've compiled a bunch of topics on what I think the important areas are, first
> code that should be consistent about new-style render drivers that are aimed for
> vk/compute userspace as the primary feature driver:
> 
> - figure out consensus solution for fw scheduler and drm/sched frontend among
>   interested driver parties (probably xe, amdgpu, nouveau, new panfrost)

Yeap. We do need to figure this out. But just to ensure that we are in the same
page here. What I had in mind was that Matt would upstream the 5 or 6 drm_sched
related patches that we have underneath Xe patches on drm-misc with addressing
the community feedback, then we would merge Xe with the current schedule solution
(or modifications based on the modifications of these mentioned patches) and
then we would continue to work with the other drivers to improve the drm sched
frontend while we are already in tree. Possible? or do you want to see
fundamental changes before we can even consider to get in? Like the ones below?

> 
> - for the interface itself it might be good to have the drm_gpu_scheduler as the
>   single per-hw-engine driver api object (but internally a new structure), while
>   renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
>   think we can address the main critique of the current xe scheduler plan
>   - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
>     scheduler
>   - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
>     1:n, the api functions simply iterate over a mutex protect list of internal
>     schedulers. this should also help drivers with locking mistakes around
>     setup/teardown and gpu reset.
>   - drivers select with a flag or something between the current mode (where the
>     drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
>     the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
>     drm_sched_entity)
>   - overall still no fundamental changes (like the current patches) to drm/sched
>     data structures and algorithms. But unlike the current patches we keep the
>     possibility open for eventual refactoring without having to again refactor
>     all the drivers. Even better, we can delay such refactoring until we have a
>     handful of real-word drivers test-driving this all so we know we actually do
>     the right thing. This should allow us to address all the
>     fairness/efficiency/whatever concerns that have been floating around without
>     having to fix them all up upfront, before we actually know what needs to be
>     fixed.

do you believe this has to be decided and moved towards one of this before we
get merged?

> 
> - the generic scheduler code should also including the handling of endless
>   compute contexts, with the minimal scaffolding for preempt-ctx fences
>   (probably on the drm_sched_entity) and making sure drm/sched can cope with the
>   lack of job completion fence. This is very minimal amounts of code, but it
>   helps a lot for cross-driver review if this works the same (with the same
>   locking and all that) for everyone. Ideally this gets extracted from amdkfd,
>   but as long as it's going to be used by all drivers supporting
>   endless/compute context going forward it's good enough.

On this one I'm a bit clueless to be honest. I thought the biggest problem with
the long running context or even endless were due to the hangcheck premption or
migrations that would end in some pagefaults.
But yeap, it looks that there are opens to get these kind of workloads properly
supported. But with this in mind do you see any real blocker on Xe? or any must-have
thing?

> 
> - I'm assuming this also means Matt Brost will include a patch to add himself as
>   drm/sched reviewer in MAINTAINERS, or at least something like that

+1 on this idea!
This enforces our engagement and commitment with the drm_sched imho.

> 
> - adopt the gem_exec/vma helpers. again we probably want consensus here among
>   the same driver projects. I don't care whether these helpers specify the ioctl
>   structs or not, but they absolutely need to enforce the overall locking scheme
>   for all major structs and list (so vm and vma).

On this front I thought we would need to align on a common drm_vm_bind based on
the common parts of xe vm_bind and nouveau one. And also some engagement that
I thought it would be easier after we are integrated and part of the drm-next.
Do we need to do this earlier? Could you please open it a bit more on what
exactly you want to see before we can be considered to get merged or after?

> 
> - we also should have cross-driver consensus on async vm_bind support. I think
>   everyone added in-syncobj support, the real fun is probably more in/out
>   userspace memory fences (and personally I'm still not sure that's a good idea
>   but ... *eh*). I think cross driver consensus on how this should work (ideally
>   with helper support so people don't get it wrong in all the possible ways)
>   would be best.

Should the consensus API come first? should this block the nouveau implementation
and move us all towards the drm_vm_bind? or can we sync in-tree?

> 
> - this also means some userptr integration and some consensus how userptr should
>   work for vm_bind across drivers. I don't think allowing drivers to reinvent
>   that wheel is a bright idea, there's just a bit too much to get wrong here.

ack. but kind of same question. is it a blocker to align before? or easier to
align in tree?

> 
> - for some of these the consensus might land on more/less shared code than what
>   I sketched out above, the important part really is that we have consensus on
>   these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
>   the code back into drivers, because they really just needed the flexibility to
>   program the hw correctly. Right now we definitely don't have enough shared
>   code, for sure with i915-gem, but we also need to make sure we're not
>   overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).

+1 on this. We need to work more in the drm layers like display has done successfully!

> 
> All the above will make sure that the driver overall is in concepts and design
> aligned with the overall community direction, but I think it'd still be good if
> someone outside of the intel gpu group reviews the driver code itself. Last time
> we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
> time around I think we have a perfect candidate with Oded:
> 
> - Oded needs/wants to spend some time on ramping up on how drm render drivers
>   work anyway, and xe is probably the best example of a driver that's both
>   supposed to be full-featured, but also doesn't contain an entire display
>   driver on the side.
> 
> - Oded is in Habana, which is legally part of Intel. Bean counter budget
>   shuffling to make this happen should be possible.
> 
> - Habana is still fairly distinct entity within Intel, so that is probably the
>   best approach for some independent review, without making the xe team
>   beholden to some non-Intel people.

+1 on this entire idea here as well.

> 
> The above should yield some pretty clear road towards landing xe, without any
> big review fights with Dave/me like we had with amd's DC/DAL, which took a
> rather long time to land unfortunately :-(

As I wrote already, I really agree with you that we should work more with the
drm and more with the other drivers. But for the logistics of the work and
the rebase pains and to avoid a situation where we have a totally divergent
driver, I believe the fastest way is to solve any blockers and big issues
before, then merge, then work towards more collaboration on the next step.

Specially when with Xe we are not planning to remove the force_probe
flag for a while, what puts us in a "staging" situation.
We could even make use of the CONFIG_STAGING if needed.

Thoughts?
And most than that, any already know big blockers?

> 
> These are just my thoughts, let the bikeshed commence!

:)

> 
> Ideally we put them all into a TODO like we've done for DC/DAL, once we have
> some consensus.

I like the TODO list idea.
And also we need to use more the RFC doc section as well, like
i915-vmbind used.

On the TODO part, where do you recommend to add in the doc?

Again, thank you so much,
Rodrigo.

> 
> Cheers, Daniel
> 
> On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> > Hello,
> > 
> > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > integrated and discrete platforms starting with Tiger Lake (first platform with
> > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > to work from that is unencumbered by older platforms, whilst also taking the
> > opportunity to rearchitect our driver to increase sharing across the drm
> > subsystem, both leveraging and allowing us to contribute more towards other
> > shared components like TTM and drm/scheduler. The memory model is based on VM
> > bind which is similar to the i915 implementation. Likewise the execbuf
> > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > 
> > The code is at a stage where it is already functional and has experimental
> > support for multiple platforms starting from Tiger Lake, with initial support
> > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > implementation will be released publicly early next year. We also have a suite
> > of IGTs for XE that will appear on the IGT list shortly.
> > 
> > It has been built with the assumption of supporting multiple architectures from
> > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > intend to continue working on it and improving on it as part of the kernel
> > community upstream.
> > 
> > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > ready Xe for production throughout 2023.
> > 
> > As for display, the intent is to share the display code with the i915 driver so
> > that there is maximum reuse there. Currently this is being done by compiling the
> > display code twice, but alternatives to that are under consideration and we want
> > to have more discussion on what the best final solution will look like over the
> > next few months. Right now, work is ongoing in refactoring the display codebase
> > to remove as much as possible any unnecessary dependencies on i915 specific data
> > structures there..
> > 
> > We currently have 2 submission backends, execlists and GuC. The execlist is
> > meant mostly for testing and is not fully functional while GuC backend is fully
> > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > required and should be placed in /lib/firmware/xe.
> > 
> > The GuC firmware can be found in the below location:
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> > 
> > The easiest way to setup firmware is:
> > cp -r /lib/firmware/i915 /lib/firmware/xe
> > 
> > The code has been organized such that we have all patches that touch areas
> > outside of drm/xe first for review, and then the actual new driver in a separate
> > commit. The code which is outside of drm/xe is included in this RFC while
> > drm/xe is not due to the size of the commit. The drm/xe is code is available in
> > a public repo listed below.
> > 
> > Xe driver commit:
> > https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> > 
> > Xe kernel repo:
> > https://cgit.freedesktop.org/drm/drm-xe/
> > 
> > There's a lot of work still to happen on Xe but we're very excited about it and
> > wanted to share it early and welcome feedback and discussion.
> > 
> > Cheers,
> > Matthew Brost
> > 
> > [1] https://patchwork.freedesktop.org/series/105879/
> > [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> > 
> > Maarten Lankhorst (12):
> >   drm/amd: Convert amdgpu to use suballocation helper.
> >   drm/radeon: Use the drm suballocation manager implementation.
> >   drm/i915: Remove gem and overlay frontbuffer tracking
> >   drm/i915/display: Neuter frontbuffer tracking harder
> >   drm/i915/display: Add more macros to remove all direct calls to uncore
> >   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
> >   drm/i915: Rename find_section to find_bdb_section
> >   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
> >   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
> >     fails
> >   drm/i915/display: Remaining changes to make xe compile
> >   sound/hda: Allow XE as i915 replacement for sound
> >   mei/hdcp: Also enable for XE
> > 
> > Matthew Brost (5):
> >   drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread
> >   drm/sched: Add generic scheduler message interface
> >   drm/sched: Start run wq before TDR in drm_sched_start
> >   drm/sched: Submit job before starting TDR
> >   drm/sched: Add helper to set TDR timeout
> > 
> > Thomas Hellström (3):
> >   drm/suballoc: Introduce a generic suballocation manager
> >   drm: Add a gpu page-table walker helper
> >   drm/ttm: Don't print error message if eviction was interrupted
> > 
> >  drivers/gpu/drm/Kconfig                       |   5 +
> >  drivers/gpu/drm/Makefile                      |   4 +
> >  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
> >  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
> >  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
> >  drivers/gpu/drm/i915/Makefile                 |   2 +-
> >  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
> >  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
> >  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
> >  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
> >  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
> >  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
> >  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
> >  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
> >  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
> >  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
> >  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
> >  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
> >  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
> >  .../drm/i915/display/intel_display_debugfs.c  |   8 +
> >  .../drm/i915/display/intel_display_power.c    |  40 ++-
> >  .../drm/i915/display/intel_display_power.h    |   6 +
> >  .../i915/display/intel_display_power_map.c    |   7 +
> >  .../i915/display/intel_display_power_well.c   |  24 +-
> >  .../drm/i915/display/intel_display_reg_defs.h |   4 +
> >  .../drm/i915/display/intel_display_trace.h    |   6 +
> >  .../drm/i915/display/intel_display_types.h    |  32 +-
> >  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
> >  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
> >  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
> >  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
> >  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
> >  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
> >  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
> >  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
> >  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
> >  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
> >  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
> >  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
> >  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
> >  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
> >  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
> >  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
> >  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
> >  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
> >  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
> >  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
> >  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
> >  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
> >  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
> >  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
> >  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
> >  .../drm/i915/display/intel_plane_initial.c    |   3 +-
> >  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
> >  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
> >  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
> >  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
> >  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
> >  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
> >  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
> >  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
> >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
> >  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
> >  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
> >  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
> >  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
> >  drivers/gpu/drm/i915/i915_driver.c            |   1 +
> >  drivers/gpu/drm/i915/i915_gem.c               |   8 -
> >  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
> >  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
> >  drivers/gpu/drm/i915/i915_vma.c               |  12 -
> >  drivers/gpu/drm/radeon/radeon.h               |  55 +--
> >  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
> >  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
> >  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
> >  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
> >  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
> >  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
> >  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
> >  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
> >  include/drm/drm_pt_walk.h                     | 161 +++++++++
> >  include/drm/drm_suballoc.h                    | 112 ++++++
> >  include/drm/gpu_scheduler.h                   |  41 ++-
> >  sound/hda/hdac_i915.c                         |  17 +-
> >  sound/pci/hda/hda_intel.c                     |  56 +--
> >  sound/soc/intel/avs/core.c                    |  13 +-
> >  sound/soc/sof/intel/hda.c                     |   7 +-
> >  98 files changed, 2076 insertions(+), 1325 deletions(-)
> >  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
> >  create mode 100644 drivers/gpu/drm/drm_suballoc.c
> >  create mode 100644 include/drm/drm_pt_walk.h
> >  create mode 100644 include/drm/drm_suballoc.h
> > 
> > -- 
> > 2.37.3
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
  2023-03-01 23:00     ` Rodrigo Vivi
@ 2023-03-09 15:10       ` Daniel Vetter
  -1 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-03-09 15:10 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: intel-gfx, dri-devel

On Thu, 2 Mar 2023 at 00:00, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> On Fri, Feb 17, 2023 at 09:51:37PM +0100, Daniel Vetter wrote:
> > Hi all,
> >
> > [I thought I've sent this out earlier this week, but alas got stuck, kinda
> > bad timing now since I'm out next week but oh well]
> >
> > So xe is a quite substantial thing, and I think we need a clear plan how to land
> > this or it will take forever, and managers will panic. Also I'm not a big fan of
> > "Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
> > not fun. The idea here is how to get everything reviewed without having two
> > people end up somewhat arbitrary as deciders.
>
> Thank you so much for taking time to write it down. We need to get alignment
> on the critical topics to see how we can move this forward.

Sorry for the delay on my side, last week was carneval and this week
big team meeting.

> > I've compiled a bunch of topics on what I think the important areas are, first
> > code that should be consistent about new-style render drivers that are aimed for
> > vk/compute userspace as the primary feature driver:
> >
> > - figure out consensus solution for fw scheduler and drm/sched frontend among
> >   interested driver parties (probably xe, amdgpu, nouveau, new panfrost)
>
> Yeap. We do need to figure this out. But just to ensure that we are in the same
> page here. What I had in mind was that Matt would upstream the 5 or 6 drm_sched
> related patches that we have underneath Xe patches on drm-misc with addressing
> the community feedback, then we would merge Xe with the current schedule solution
> (or modifications based on the modifications of these mentioned patches) and
> then we would continue to work with the other drivers to improve the drm sched
> frontend while we are already in tree. Possible? or do you want to see
> fundamental changes before we can even consider to get in? Like the ones below?

The trouble with that is that then you'll have a lot more driver
changes and big renames in drivers after they landed. Which might be
too painful, and why I suggested the below minimal-most driver-api
wrapping to decouple that. My worry is that if you don't do that, then
the driver merging will be bogged down in endless discussions about
what the refactoring should look like exactly (the discussions here
and elsewhere kinda gave a preview), and it'll make driver merging
actually slower. Hence my suggestion to just decouple things enough so
that people agree to merge now and refactor later as the reasonable
thing.

Now if there would be consensus that this here is already perfect and
nothing more needed for fw scheduling, then I think just going ahead
with that would be perfectly fine, but I'm kinda not seeing that.

Given that there is so much discussion here I don't want to step in
with an arbitrary maintainer verdict, that's largely the approach
we've done with amd's dal and I don't think it was the best way
forward really.

> > - for the interface itself it might be good to have the drm_gpu_scheduler as the
> >   single per-hw-engine driver api object (but internally a new structure), while
> >   renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
> >   think we can address the main critique of the current xe scheduler plan
> >   - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
> >     scheduler
> >   - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
> >     1:n, the api functions simply iterate over a mutex protect list of internal
> >     schedulers. this should also help drivers with locking mistakes around
> >     setup/teardown and gpu reset.
> >   - drivers select with a flag or something between the current mode (where the
> >     drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
> >     the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
> >     drm_sched_entity)
> >   - overall still no fundamental changes (like the current patches) to drm/sched
> >     data structures and algorithms. But unlike the current patches we keep the
> >     possibility open for eventual refactoring without having to again refactor
> >     all the drivers. Even better, we can delay such refactoring until we have a
> >     handful of real-word drivers test-driving this all so we know we actually do
> >     the right thing. This should allow us to address all the
> >     fairness/efficiency/whatever concerns that have been floating around without
> >     having to fix them all up upfront, before we actually know what needs to be
> >     fixed.
>
> do you believe this has to be decided and moved towards one of this before we
> get merged?

I think either clear consensus by stakeholders that refactoring
afterwards is the right thing, or something like the above to decouple
fw scheduling drivers from drm/sched is I think needed. And my gut
feeling is that the 2nd option is a much faster and less risky path to
xe, but if you want to do the big dri-devel arguing championship, you
can do that too :-)

> > - the generic scheduler code should also including the handling of endless
> >   compute contexts, with the minimal scaffolding for preempt-ctx fences
> >   (probably on the drm_sched_entity) and making sure drm/sched can cope with the
> >   lack of job completion fence. This is very minimal amounts of code, but it
> >   helps a lot for cross-driver review if this works the same (with the same
> >   locking and all that) for everyone. Ideally this gets extracted from amdkfd,
> >   but as long as it's going to be used by all drivers supporting
> >   endless/compute context going forward it's good enough.
>
> On this one I'm a bit clueless to be honest. I thought the biggest problem with
> the long running context or even endless were due to the hangcheck premption or
> migrations that would end in some pagefaults.
> But yeap, it looks that there are opens to get these kind of workloads properly
> supported. But with this in mind do you see any real blocker on Xe? or any must-have
> thing?

Yeah hangcheck is the architectural issue, this here is just my
suggestion for how to solve it technically in code in a consistent way
across drivers. From a subsystem maintainer pov the important stuff is
really that drivers share key concepts as much as possible, and I
think this is one such key concept. E.g. for amd's dal we didn't ask
them to rewrite the entire thing to our taste, but to properly
integrate key drm display concepts and structs directly into their
driver so that when you want to look for all crtc related code across
all drivers you just have to chase drm_crtc, and not figure out what a
crtc is in each driver (e.g. in i915 display speak crtc = pipe).

Same here, just minimal data structure and scaffolding for
long-running context and preempt ctx fences would be really good I
think.

> > - I'm assuming this also means Matt Brost will include a patch to add himself as
> >   drm/sched reviewer in MAINTAINERS, or at least something like that
>
> +1 on this idea!
> This enforces our engagement and commitment with the drm_sched imho.
>
> >
> > - adopt the gem_exec/vma helpers. again we probably want consensus here among
> >   the same driver projects. I don't care whether these helpers specify the ioctl
> >   structs or not, but they absolutely need to enforce the overall locking scheme
> >   for all major structs and list (so vm and vma).
>
> On this front I thought we would need to align on a common drm_vm_bind based on
> the common parts of xe vm_bind and nouveau one. And also some engagement that
> I thought it would be easier after we are integrated and part of the drm-next.
> Do we need to do this earlier? Could you please open it a bit more on what
> exactly you want to see before we can be considered to get merged or after?

Again, I don't really want to do the maintainer verdict here and just
dictate, I think much better if these discussions are done directly
among the involved people. I do generally think that the refactoring
should happen upfront for xe, simply due to past track record. Which
yes sucks a bit and is a special requirement, but I think a bit a
stricter barrier but a really clear one is much better for everyone
than some very, very handwaving "enough to make Dave&me happy" super
vague thing that will guarantee heated arguments like we've had plenty
with amd's dal.

> > - we also should have cross-driver consensus on async vm_bind support. I think
> >   everyone added in-syncobj support, the real fun is probably more in/out
> >   userspace memory fences (and personally I'm still not sure that's a good idea
> >   but ... *eh*). I think cross driver consensus on how this should work (ideally
> >   with helper support so people don't get it wrong in all the possible ways)
> >   would be best.
>
> Should the consensus API come first? should this block the nouveau implementation
> and move us all towards the drm_vm_bind? or can we sync in-tree?

Same as above.

Since async isn't a requirement (it's optional for both vk and I guess
also for compute since current compute still works on i915-gem, which
doesn't have this?) this shouldn't block merging the xe driver. So I
think it's not too horrendous to make this a blocker. Of course if the
vulkan folks disagree then maybe do some other merge order (and record
all that with appropriate amounts of acks).

> > - this also means some userptr integration and some consensus how userptr should
> >   work for vm_bind across drivers. I don't think allowing drivers to reinvent
> >   that wheel is a bright idea, there's just a bit too much to get wrong here.
>
> ack. but kind of same question. is it a blocker to align before? or easier to
> align in tree?

Still same as above, I think best is to make them all as clear
per-merge goals so that we avoid endless "is this now good enough"
discussions with all the frustration and arbitrary delays this would
bring. And yes again I realize this sucks a bit and is a bit special.
Kinda the same idea like we've tried doing with the
refactoring/feature-landing documents in Documenation/gpu/rfc.rst.

Actually maybe the entire xe merge plan should perhaps become
Documentation/gpu/rfc/xe.rst?

> > - for some of these the consensus might land on more/less shared code than what
> >   I sketched out above, the important part really is that we have consensus on
> >   these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
> >   the code back into drivers, because they really just needed the flexibility to
> >   program the hw correctly. Right now we definitely don't have enough shared
> >   code, for sure with i915-gem, but we also need to make sure we're not
> >   overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).
>
> +1 on this. We need to work more in the drm layers like display has done successfully!
>
> >
> > All the above will make sure that the driver overall is in concepts and design
> > aligned with the overall community direction, but I think it'd still be good if
> > someone outside of the intel gpu group reviews the driver code itself. Last time
> > we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
> > time around I think we have a perfect candidate with Oded:
> >
> > - Oded needs/wants to spend some time on ramping up on how drm render drivers
> >   work anyway, and xe is probably the best example of a driver that's both
> >   supposed to be full-featured, but also doesn't contain an entire display
> >   driver on the side.
> >
> > - Oded is in Habana, which is legally part of Intel. Bean counter budget
> >   shuffling to make this happen should be possible.
> >
> > - Habana is still fairly distinct entity within Intel, so that is probably the
> >   best approach for some independent review, without making the xe team
> >   beholden to some non-Intel people.
>
> +1 on this entire idea here as well.
>
> >
> > The above should yield some pretty clear road towards landing xe, without any
> > big review fights with Dave/me like we had with amd's DC/DAL, which took a
> > rather long time to land unfortunately :-(
>
> As I wrote already, I really agree with you that we should work more with the
> drm and more with the other drivers. But for the logistics of the work and
> the rebase pains and to avoid a situation where we have a totally divergent
> driver, I believe the fastest way is to solve any blockers and big issues
> before, then merge, then work towards more collaboration on the next step.
>
> Specially when with Xe we are not planning to remove the force_probe
> flag for a while, what puts us in a "staging" situation.
> We could even make use of the CONFIG_STAGING if needed.

I general, I agree with this.

But also, I've acked a bunch of these plans for intel-gem (like the
i915 guc scheduler), and then we had to backtrack those because
everyone realize that "refactor in-tree" was actually impossible.

It's definitely a bit "burned too many times on this" reaction, but
I'd like to make sure we don't end up in that situation again with the
next big pile of intel-gem code.

> Thoughts?
> And most than that, any already know big blockers?
>
> >
> > These are just my thoughts, let the bikeshed commence!
>
> :)
>
> >
> > Ideally we put them all into a TODO like we've done for DC/DAL, once we have
> > some consensus.
>
> I like the TODO list idea.
> And also we need to use more the RFC doc section as well, like
> i915-vmbind used.
>
> On the TODO part, where do you recommend to add in the doc?

See above, I think we have the right place with the rfc section already.

Cheers, Daniel

>
> Again, thank you so much,
> Rodrigo.
>
> >
> > Cheers, Daniel
> >
> > On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> > > Hello,
> > >
> > > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > > integrated and discrete platforms starting with Tiger Lake (first platform with
> > > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > > to work from that is unencumbered by older platforms, whilst also taking the
> > > opportunity to rearchitect our driver to increase sharing across the drm
> > > subsystem, both leveraging and allowing us to contribute more towards other
> > > shared components like TTM and drm/scheduler. The memory model is based on VM
> > > bind which is similar to the i915 implementation. Likewise the execbuf
> > > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > >
> > > The code is at a stage where it is already functional and has experimental
> > > support for multiple platforms starting from Tiger Lake, with initial support
> > > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > > implementation will be released publicly early next year. We also have a suite
> > > of IGTs for XE that will appear on the IGT list shortly.
> > >
> > > It has been built with the assumption of supporting multiple architectures from
> > > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > > intend to continue working on it and improving on it as part of the kernel
> > > community upstream.
> > >
> > > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > > ready Xe for production throughout 2023.
> > >
> > > As for display, the intent is to share the display code with the i915 driver so
> > > that there is maximum reuse there. Currently this is being done by compiling the
> > > display code twice, but alternatives to that are under consideration and we want
> > > to have more discussion on what the best final solution will look like over the
> > > next few months. Right now, work is ongoing in refactoring the display codebase
> > > to remove as much as possible any unnecessary dependencies on i915 specific data
> > > structures there..
> > >
> > > We currently have 2 submission backends, execlists and GuC. The execlist is
> > > meant mostly for testing and is not fully functional while GuC backend is fully
> > > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > > required and should be placed in /lib/firmware/xe.
> > >
> > > The GuC firmware can be found in the below location:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> > >
> > > The easiest way to setup firmware is:
> > > cp -r /lib/firmware/i915 /lib/firmware/xe
> > >
> > > The code has been organized such that we have all patches that touch areas
> > > outside of drm/xe first for review, and then the actual new driver in a separate
> > > commit. The code which is outside of drm/xe is included in this RFC while
> > > drm/xe is not due to the size of the commit. The drm/xe is code is available in
> > > a public repo listed below.
> > >
> > > Xe driver commit:
> > > https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> > >
> > > Xe kernel repo:
> > > https://cgit.freedesktop.org/drm/drm-xe/
> > >
> > > There's a lot of work still to happen on Xe but we're very excited about it and
> > > wanted to share it early and welcome feedback and discussion.
> > >
> > > Cheers,
> > > Matthew Brost
> > >
> > > [1] https://patchwork.freedesktop.org/series/105879/
> > > [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> > >
> > > Maarten Lankhorst (12):
> > >   drm/amd: Convert amdgpu to use suballocation helper.
> > >   drm/radeon: Use the drm suballocation manager implementation.
> > >   drm/i915: Remove gem and overlay frontbuffer tracking
> > >   drm/i915/display: Neuter frontbuffer tracking harder
> > >   drm/i915/display: Add more macros to remove all direct calls to uncore
> > >   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
> > >   drm/i915: Rename find_section to find_bdb_section
> > >   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
> > >   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
> > >     fails
> > >   drm/i915/display: Remaining changes to make xe compile
> > >   sound/hda: Allow XE as i915 replacement for sound
> > >   mei/hdcp: Also enable for XE
> > >
> > > Matthew Brost (5):
> > >   drm/sched: Convert drm scheduler to use a work queue rather than
> > >     kthread
> > >   drm/sched: Add generic scheduler message interface
> > >   drm/sched: Start run wq before TDR in drm_sched_start
> > >   drm/sched: Submit job before starting TDR
> > >   drm/sched: Add helper to set TDR timeout
> > >
> > > Thomas Hellström (3):
> > >   drm/suballoc: Introduce a generic suballocation manager
> > >   drm: Add a gpu page-table walker helper
> > >   drm/ttm: Don't print error message if eviction was interrupted
> > >
> > >  drivers/gpu/drm/Kconfig                       |   5 +
> > >  drivers/gpu/drm/Makefile                      |   4 +
> > >  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
> > >  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
> > >  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
> > >  drivers/gpu/drm/i915/Makefile                 |   2 +-
> > >  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
> > >  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
> > >  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
> > >  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
> > >  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
> > >  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
> > >  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
> > >  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
> > >  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
> > >  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
> > >  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
> > >  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
> > >  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
> > >  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
> > >  .../drm/i915/display/intel_display_debugfs.c  |   8 +
> > >  .../drm/i915/display/intel_display_power.c    |  40 ++-
> > >  .../drm/i915/display/intel_display_power.h    |   6 +
> > >  .../i915/display/intel_display_power_map.c    |   7 +
> > >  .../i915/display/intel_display_power_well.c   |  24 +-
> > >  .../drm/i915/display/intel_display_reg_defs.h |   4 +
> > >  .../drm/i915/display/intel_display_trace.h    |   6 +
> > >  .../drm/i915/display/intel_display_types.h    |  32 +-
> > >  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
> > >  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
> > >  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
> > >  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
> > >  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
> > >  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
> > >  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
> > >  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
> > >  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
> > >  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
> > >  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
> > >  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
> > >  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
> > >  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
> > >  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
> > >  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
> > >  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
> > >  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
> > >  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
> > >  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
> > >  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
> > >  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
> > >  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
> > >  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
> > >  .../drm/i915/display/intel_plane_initial.c    |   3 +-
> > >  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
> > >  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
> > >  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
> > >  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
> > >  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
> > >  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
> > >  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
> > >  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
> > >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
> > >  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
> > >  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
> > >  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
> > >  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
> > >  drivers/gpu/drm/i915/i915_driver.c            |   1 +
> > >  drivers/gpu/drm/i915/i915_gem.c               |   8 -
> > >  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
> > >  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
> > >  drivers/gpu/drm/i915/i915_vma.c               |  12 -
> > >  drivers/gpu/drm/radeon/radeon.h               |  55 +--
> > >  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
> > >  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
> > >  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
> > >  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
> > >  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
> > >  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
> > >  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
> > >  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
> > >  include/drm/drm_pt_walk.h                     | 161 +++++++++
> > >  include/drm/drm_suballoc.h                    | 112 ++++++
> > >  include/drm/gpu_scheduler.h                   |  41 ++-
> > >  sound/hda/hdac_i915.c                         |  17 +-
> > >  sound/pci/hda/hda_intel.c                     |  56 +--
> > >  sound/soc/intel/avs/core.c                    |  13 +-
> > >  sound/soc/sof/intel/hda.c                     |   7 +-
> > >  98 files changed, 2076 insertions(+), 1325 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
> > >  create mode 100644 drivers/gpu/drm/drm_suballoc.c
> > >  create mode 100644 include/drm/drm_pt_walk.h
> > >  create mode 100644 include/drm/drm_suballoc.h
> > >
> > > --
> > > 2.37.3
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [Intel-gfx] [RFC PATCH 00/20] Initial Xe driver submission
@ 2023-03-09 15:10       ` Daniel Vetter
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel Vetter @ 2023-03-09 15:10 UTC (permalink / raw)
  To: Rodrigo Vivi; +Cc: Matthew Brost, intel-gfx, dri-devel

On Thu, 2 Mar 2023 at 00:00, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote:
> On Fri, Feb 17, 2023 at 09:51:37PM +0100, Daniel Vetter wrote:
> > Hi all,
> >
> > [I thought I've sent this out earlier this week, but alas got stuck, kinda
> > bad timing now since I'm out next week but oh well]
> >
> > So xe is a quite substantial thing, and I think we need a clear plan how to land
> > this or it will take forever, and managers will panic. Also I'm not a big fan of
> > "Dave/me reviews everything", we defacto had that for amd's dc/dal and it was
> > not fun. The idea here is how to get everything reviewed without having two
> > people end up somewhat arbitrary as deciders.
>
> Thank you so much for taking time to write it down. We need to get alignment
> on the critical topics to see how we can move this forward.

Sorry for the delay on my side, last week was carneval and this week
big team meeting.

> > I've compiled a bunch of topics on what I think the important areas are, first
> > code that should be consistent about new-style render drivers that are aimed for
> > vk/compute userspace as the primary feature driver:
> >
> > - figure out consensus solution for fw scheduler and drm/sched frontend among
> >   interested driver parties (probably xe, amdgpu, nouveau, new panfrost)
>
> Yeap. We do need to figure this out. But just to ensure that we are in the same
> page here. What I had in mind was that Matt would upstream the 5 or 6 drm_sched
> related patches that we have underneath Xe patches on drm-misc with addressing
> the community feedback, then we would merge Xe with the current schedule solution
> (or modifications based on the modifications of these mentioned patches) and
> then we would continue to work with the other drivers to improve the drm sched
> frontend while we are already in tree. Possible? or do you want to see
> fundamental changes before we can even consider to get in? Like the ones below?

The trouble with that is that then you'll have a lot more driver
changes and big renames in drivers after they landed. Which might be
too painful, and why I suggested the below minimal-most driver-api
wrapping to decouple that. My worry is that if you don't do that, then
the driver merging will be bogged down in endless discussions about
what the refactoring should look like exactly (the discussions here
and elsewhere kinda gave a preview), and it'll make driver merging
actually slower. Hence my suggestion to just decouple things enough so
that people agree to merge now and refactor later as the reasonable
thing.

Now if there would be consensus that this here is already perfect and
nothing more needed for fw scheduling, then I think just going ahead
with that would be perfectly fine, but I'm kinda not seeing that.

Given that there is so much discussion here I don't want to step in
with an arbitrary maintainer verdict, that's largely the approach
we've done with amd's dal and I don't think it was the best way
forward really.

> > - for the interface itself it might be good to have the drm_gpu_scheduler as the
> >   single per-hw-engine driver api object (but internally a new structure), while
> >   renaming the current drm_gpu_scheduler to drm_gpu_sched_internal. That way I
> >   think we can address the main critique of the current xe scheduler plan
> >   - keep the drm_gpu_sched_internal : drm_sched_entity 1:1 relationship for fw
> >     scheduler
> >   - keep the driver api relationship of drm_gpu_scheduler : drm_sched_entity
> >     1:n, the api functions simply iterate over a mutex protect list of internal
> >     schedulers. this should also help drivers with locking mistakes around
> >     setup/teardown and gpu reset.
> >   - drivers select with a flag or something between the current mode (where the
> >     drm_gpu_sched_internal is attached to the drm_gpu_scheduler api object) or
> >     the new fw scheduler mode (where drm_gpu_sched_internal is attached to the
> >     drm_sched_entity)
> >   - overall still no fundamental changes (like the current patches) to drm/sched
> >     data structures and algorithms. But unlike the current patches we keep the
> >     possibility open for eventual refactoring without having to again refactor
> >     all the drivers. Even better, we can delay such refactoring until we have a
> >     handful of real-word drivers test-driving this all so we know we actually do
> >     the right thing. This should allow us to address all the
> >     fairness/efficiency/whatever concerns that have been floating around without
> >     having to fix them all up upfront, before we actually know what needs to be
> >     fixed.
>
> do you believe this has to be decided and moved towards one of this before we
> get merged?

I think either clear consensus by stakeholders that refactoring
afterwards is the right thing, or something like the above to decouple
fw scheduling drivers from drm/sched is I think needed. And my gut
feeling is that the 2nd option is a much faster and less risky path to
xe, but if you want to do the big dri-devel arguing championship, you
can do that too :-)

> > - the generic scheduler code should also including the handling of endless
> >   compute contexts, with the minimal scaffolding for preempt-ctx fences
> >   (probably on the drm_sched_entity) and making sure drm/sched can cope with the
> >   lack of job completion fence. This is very minimal amounts of code, but it
> >   helps a lot for cross-driver review if this works the same (with the same
> >   locking and all that) for everyone. Ideally this gets extracted from amdkfd,
> >   but as long as it's going to be used by all drivers supporting
> >   endless/compute context going forward it's good enough.
>
> On this one I'm a bit clueless to be honest. I thought the biggest problem with
> the long running context or even endless were due to the hangcheck premption or
> migrations that would end in some pagefaults.
> But yeap, it looks that there are opens to get these kind of workloads properly
> supported. But with this in mind do you see any real blocker on Xe? or any must-have
> thing?

Yeah hangcheck is the architectural issue, this here is just my
suggestion for how to solve it technically in code in a consistent way
across drivers. From a subsystem maintainer pov the important stuff is
really that drivers share key concepts as much as possible, and I
think this is one such key concept. E.g. for amd's dal we didn't ask
them to rewrite the entire thing to our taste, but to properly
integrate key drm display concepts and structs directly into their
driver so that when you want to look for all crtc related code across
all drivers you just have to chase drm_crtc, and not figure out what a
crtc is in each driver (e.g. in i915 display speak crtc = pipe).

Same here, just minimal data structure and scaffolding for
long-running context and preempt ctx fences would be really good I
think.

> > - I'm assuming this also means Matt Brost will include a patch to add himself as
> >   drm/sched reviewer in MAINTAINERS, or at least something like that
>
> +1 on this idea!
> This enforces our engagement and commitment with the drm_sched imho.
>
> >
> > - adopt the gem_exec/vma helpers. again we probably want consensus here among
> >   the same driver projects. I don't care whether these helpers specify the ioctl
> >   structs or not, but they absolutely need to enforce the overall locking scheme
> >   for all major structs and list (so vm and vma).
>
> On this front I thought we would need to align on a common drm_vm_bind based on
> the common parts of xe vm_bind and nouveau one. And also some engagement that
> I thought it would be easier after we are integrated and part of the drm-next.
> Do we need to do this earlier? Could you please open it a bit more on what
> exactly you want to see before we can be considered to get merged or after?

Again, I don't really want to do the maintainer verdict here and just
dictate, I think much better if these discussions are done directly
among the involved people. I do generally think that the refactoring
should happen upfront for xe, simply due to past track record. Which
yes sucks a bit and is a special requirement, but I think a bit a
stricter barrier but a really clear one is much better for everyone
than some very, very handwaving "enough to make Dave&me happy" super
vague thing that will guarantee heated arguments like we've had plenty
with amd's dal.

> > - we also should have cross-driver consensus on async vm_bind support. I think
> >   everyone added in-syncobj support, the real fun is probably more in/out
> >   userspace memory fences (and personally I'm still not sure that's a good idea
> >   but ... *eh*). I think cross driver consensus on how this should work (ideally
> >   with helper support so people don't get it wrong in all the possible ways)
> >   would be best.
>
> Should the consensus API come first? should this block the nouveau implementation
> and move us all towards the drm_vm_bind? or can we sync in-tree?

Same as above.

Since async isn't a requirement (it's optional for both vk and I guess
also for compute since current compute still works on i915-gem, which
doesn't have this?) this shouldn't block merging the xe driver. So I
think it's not too horrendous to make this a blocker. Of course if the
vulkan folks disagree then maybe do some other merge order (and record
all that with appropriate amounts of acks).

> > - this also means some userptr integration and some consensus how userptr should
> >   work for vm_bind across drivers. I don't think allowing drivers to reinvent
> >   that wheel is a bright idea, there's just a bit too much to get wrong here.
>
> ack. but kind of same question. is it a blocker to align before? or easier to
> align in tree?

Still same as above, I think best is to make them all as clear
per-merge goals so that we avoid endless "is this now good enough"
discussions with all the frustration and arbitrary delays this would
bring. And yes again I realize this sucks a bit and is a bit special.
Kinda the same idea like we've tried doing with the
refactoring/feature-landing documents in Documenation/gpu/rfc.rst.

Actually maybe the entire xe merge plan should perhaps become
Documentation/gpu/rfc/xe.rst?

> > - for some of these the consensus might land on more/less shared code than what
> >   I sketched out above, the important part really is that we have consensus on
> >   these. Kinda similar to how the atomic kms infrastructure move a _lot_ more of
> >   the code back into drivers, because they really just needed the flexibility to
> >   program the hw correctly. Right now we definitely don't have enough shared
> >   code, for sure with i915-gem, but we also need to make sure we're not
> >   overcorrecting too badly (a bit of overcorrecting generally doesn't hurt).
>
> +1 on this. We need to work more in the drm layers like display has done successfully!
>
> >
> > All the above will make sure that the driver overall is in concepts and design
> > aligned with the overall community direction, but I think it'd still be good if
> > someone outside of the intel gpu group reviews the driver code itself. Last time
> > we had a huge driver submission (amd's DC/DAL) this fell on Dave&me, but this
> > time around I think we have a perfect candidate with Oded:
> >
> > - Oded needs/wants to spend some time on ramping up on how drm render drivers
> >   work anyway, and xe is probably the best example of a driver that's both
> >   supposed to be full-featured, but also doesn't contain an entire display
> >   driver on the side.
> >
> > - Oded is in Habana, which is legally part of Intel. Bean counter budget
> >   shuffling to make this happen should be possible.
> >
> > - Habana is still fairly distinct entity within Intel, so that is probably the
> >   best approach for some independent review, without making the xe team
> >   beholden to some non-Intel people.
>
> +1 on this entire idea here as well.
>
> >
> > The above should yield some pretty clear road towards landing xe, without any
> > big review fights with Dave/me like we had with amd's DC/DAL, which took a
> > rather long time to land unfortunately :-(
>
> As I wrote already, I really agree with you that we should work more with the
> drm and more with the other drivers. But for the logistics of the work and
> the rebase pains and to avoid a situation where we have a totally divergent
> driver, I believe the fastest way is to solve any blockers and big issues
> before, then merge, then work towards more collaboration on the next step.
>
> Specially when with Xe we are not planning to remove the force_probe
> flag for a while, what puts us in a "staging" situation.
> We could even make use of the CONFIG_STAGING if needed.

I general, I agree with this.

But also, I've acked a bunch of these plans for intel-gem (like the
i915 guc scheduler), and then we had to backtrack those because
everyone realize that "refactor in-tree" was actually impossible.

It's definitely a bit "burned too many times on this" reaction, but
I'd like to make sure we don't end up in that situation again with the
next big pile of intel-gem code.

> Thoughts?
> And most than that, any already know big blockers?
>
> >
> > These are just my thoughts, let the bikeshed commence!
>
> :)
>
> >
> > Ideally we put them all into a TODO like we've done for DC/DAL, once we have
> > some consensus.
>
> I like the TODO list idea.
> And also we need to use more the RFC doc section as well, like
> i915-vmbind used.
>
> On the TODO part, where do you recommend to add in the doc?

See above, I think we have the right place with the rfc section already.

Cheers, Daniel

>
> Again, thank you so much,
> Rodrigo.
>
> >
> > Cheers, Daniel
> >
> > On Thu, Dec 22, 2022 at 02:21:07PM -0800, Matthew Brost wrote:
> > > Hello,
> > >
> > > This is a submission for Xe, a new driver for Intel GPUs that supports both
> > > integrated and discrete platforms starting with Tiger Lake (first platform with
> > > Intel Xe Architecture). The intention of this new driver is to have a fresh base
> > > to work from that is unencumbered by older platforms, whilst also taking the
> > > opportunity to rearchitect our driver to increase sharing across the drm
> > > subsystem, both leveraging and allowing us to contribute more towards other
> > > shared components like TTM and drm/scheduler. The memory model is based on VM
> > > bind which is similar to the i915 implementation. Likewise the execbuf
> > > implementation for Xe is very similar to execbuf3 in the i915 [1].
> > >
> > > The code is at a stage where it is already functional and has experimental
> > > support for multiple platforms starting from Tiger Lake, with initial support
> > > implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as well
> > > as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> > > implementation will be released publicly early next year. We also have a suite
> > > of IGTs for XE that will appear on the IGT list shortly.
> > >
> > > It has been built with the assumption of supporting multiple architectures from
> > > the get-go, right now with tests running both on X86 and ARM hosts. And we
> > > intend to continue working on it and improving on it as part of the kernel
> > > community upstream.
> > >
> > > The new Xe driver leverages a lot from i915 and work on i915 continues as we
> > > ready Xe for production throughout 2023.
> > >
> > > As for display, the intent is to share the display code with the i915 driver so
> > > that there is maximum reuse there. Currently this is being done by compiling the
> > > display code twice, but alternatives to that are under consideration and we want
> > > to have more discussion on what the best final solution will look like over the
> > > next few months. Right now, work is ongoing in refactoring the display codebase
> > > to remove as much as possible any unnecessary dependencies on i915 specific data
> > > structures there..
> > >
> > > We currently have 2 submission backends, execlists and GuC. The execlist is
> > > meant mostly for testing and is not fully functional while GuC backend is fully
> > > functional. As with the i915 and GuC submission, in Xe the GuC firmware is
> > > required and should be placed in /lib/firmware/xe.
> > >
> > > The GuC firmware can be found in the below location:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
> > >
> > > The easiest way to setup firmware is:
> > > cp -r /lib/firmware/i915 /lib/firmware/xe
> > >
> > > The code has been organized such that we have all patches that touch areas
> > > outside of drm/xe first for review, and then the actual new driver in a separate
> > > commit. The code which is outside of drm/xe is included in this RFC while
> > > drm/xe is not due to the size of the commit. The drm/xe is code is available in
> > > a public repo listed below.
> > >
> > > Xe driver commit:
> > > https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> > >
> > > Xe kernel repo:
> > > https://cgit.freedesktop.org/drm/drm-xe/
> > >
> > > There's a lot of work still to happen on Xe but we're very excited about it and
> > > wanted to share it early and welcome feedback and discussion.
> > >
> > > Cheers,
> > > Matthew Brost
> > >
> > > [1] https://patchwork.freedesktop.org/series/105879/
> > > [2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20418
> > >
> > > Maarten Lankhorst (12):
> > >   drm/amd: Convert amdgpu to use suballocation helper.
> > >   drm/radeon: Use the drm suballocation manager implementation.
> > >   drm/i915: Remove gem and overlay frontbuffer tracking
> > >   drm/i915/display: Neuter frontbuffer tracking harder
> > >   drm/i915/display: Add more macros to remove all direct calls to uncore
> > >   drm/i915/display: Remove all uncore mmio accesses in favor of intel_de
> > >   drm/i915: Rename find_section to find_bdb_section
> > >   drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe
> > >   drm/i915/display: Fix a use-after-free when intel_edp_init_connector
> > >     fails
> > >   drm/i915/display: Remaining changes to make xe compile
> > >   sound/hda: Allow XE as i915 replacement for sound
> > >   mei/hdcp: Also enable for XE
> > >
> > > Matthew Brost (5):
> > >   drm/sched: Convert drm scheduler to use a work queue rather than
> > >     kthread
> > >   drm/sched: Add generic scheduler message interface
> > >   drm/sched: Start run wq before TDR in drm_sched_start
> > >   drm/sched: Submit job before starting TDR
> > >   drm/sched: Add helper to set TDR timeout
> > >
> > > Thomas Hellström (3):
> > >   drm/suballoc: Introduce a generic suballocation manager
> > >   drm: Add a gpu page-table walker helper
> > >   drm/ttm: Don't print error message if eviction was interrupted
> > >
> > >  drivers/gpu/drm/Kconfig                       |   5 +
> > >  drivers/gpu/drm/Makefile                      |   4 +
> > >  drivers/gpu/drm/amd/amdgpu/Kconfig            |   1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  26 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  12 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c        |   5 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  23 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |   3 +-
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c        | 320 +-----------------
> > >  drivers/gpu/drm/drm_pt_walk.c                 | 159 +++++++++
> > >  drivers/gpu/drm/drm_suballoc.c                | 301 ++++++++++++++++
> > >  drivers/gpu/drm/i915/Makefile                 |   2 +-
> > >  drivers/gpu/drm/i915/display/hsw_ips.c        |   7 +-
> > >  drivers/gpu/drm/i915/display/i9xx_plane.c     |   1 +
> > >  drivers/gpu/drm/i915/display/intel_atomic.c   |   2 +
> > >  .../gpu/drm/i915/display/intel_atomic_plane.c |  25 +-
> > >  .../gpu/drm/i915/display/intel_backlight.c    |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_bios.c     |  71 ++--
> > >  drivers/gpu/drm/i915/display/intel_bw.c       |  36 +-
> > >  drivers/gpu/drm/i915/display/intel_cdclk.c    |  68 ++--
> > >  drivers/gpu/drm/i915/display/intel_color.c    |   1 +
> > >  drivers/gpu/drm/i915/display/intel_crtc.c     |  14 +-
> > >  drivers/gpu/drm/i915/display/intel_cursor.c   |  14 +-
> > >  drivers/gpu/drm/i915/display/intel_de.h       |  38 +++
> > >  drivers/gpu/drm/i915/display/intel_display.c  | 155 +++++++--
> > >  drivers/gpu/drm/i915/display/intel_display.h  |   9 +-
> > >  .../gpu/drm/i915/display/intel_display_core.h |   5 +-
> > >  .../drm/i915/display/intel_display_debugfs.c  |   8 +
> > >  .../drm/i915/display/intel_display_power.c    |  40 ++-
> > >  .../drm/i915/display/intel_display_power.h    |   6 +
> > >  .../i915/display/intel_display_power_map.c    |   7 +
> > >  .../i915/display/intel_display_power_well.c   |  24 +-
> > >  .../drm/i915/display/intel_display_reg_defs.h |   4 +
> > >  .../drm/i915/display/intel_display_trace.h    |   6 +
> > >  .../drm/i915/display/intel_display_types.h    |  32 +-
> > >  drivers/gpu/drm/i915/display/intel_dmc.c      |  17 +-
> > >  drivers/gpu/drm/i915/display/intel_dp.c       |  11 +-
> > >  drivers/gpu/drm/i915/display/intel_dp_aux.c   |   6 +
> > >  drivers/gpu/drm/i915/display/intel_dpio_phy.c |   9 +-
> > >  drivers/gpu/drm/i915/display/intel_dpio_phy.h |  15 +
> > >  drivers/gpu/drm/i915/display/intel_dpll.c     |   8 +-
> > >  drivers/gpu/drm/i915/display/intel_dpll_mgr.c |   4 +
> > >  drivers/gpu/drm/i915/display/intel_drrs.c     |   1 +
> > >  drivers/gpu/drm/i915/display/intel_dsb.c      | 124 +++++--
> > >  drivers/gpu/drm/i915/display/intel_dsi_vbt.c  |  26 +-
> > >  drivers/gpu/drm/i915/display/intel_fb.c       | 108 ++++--
> > >  drivers/gpu/drm/i915/display/intel_fb_pin.c   |   6 -
> > >  drivers/gpu/drm/i915/display/intel_fbc.c      |  49 ++-
> > >  drivers/gpu/drm/i915/display/intel_fbdev.c    | 108 +++++-
> > >  .../gpu/drm/i915/display/intel_frontbuffer.c  | 103 +-----
> > >  .../gpu/drm/i915/display/intel_frontbuffer.h  |  67 +---
> > >  drivers/gpu/drm/i915/display/intel_gmbus.c    |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_hdcp.c     |   9 +-
> > >  drivers/gpu/drm/i915/display/intel_hdmi.c     |   1 -
> > >  .../gpu/drm/i915/display/intel_lpe_audio.h    |   8 +
> > >  .../drm/i915/display/intel_modeset_setup.c    |  11 +-
> > >  drivers/gpu/drm/i915/display/intel_opregion.c |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_overlay.c  |  14 -
> > >  .../gpu/drm/i915/display/intel_pch_display.h  |  16 +
> > >  .../gpu/drm/i915/display/intel_pch_refclk.h   |   8 +
> > >  drivers/gpu/drm/i915/display/intel_pipe_crc.c |   1 +
> > >  .../drm/i915/display/intel_plane_initial.c    |   3 +-
> > >  drivers/gpu/drm/i915/display/intel_psr.c      |   1 +
> > >  drivers/gpu/drm/i915/display/intel_sprite.c   |  21 ++
> > >  drivers/gpu/drm/i915/display/intel_vbt_defs.h |   2 +-
> > >  drivers/gpu/drm/i915/display/intel_vga.c      |   5 +
> > >  drivers/gpu/drm/i915/display/skl_scaler.c     |   2 +
> > >  .../drm/i915/display/skl_universal_plane.c    |  52 ++-
> > >  drivers/gpu/drm/i915/display/skl_watermark.c  |  25 +-
> > >  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   4 -
> > >  drivers/gpu/drm/i915/gem/i915_gem_domain.c    |   7 -
> > >  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |   2 -
> > >  drivers/gpu/drm/i915/gem/i915_gem_object.c    |  25 --
> > >  drivers/gpu/drm/i915/gem/i915_gem_object.h    |  22 --
> > >  drivers/gpu/drm/i915/gem/i915_gem_phys.c      |   4 -
> > >  drivers/gpu/drm/i915/gt/intel_gt_regs.h       |   3 +-
> > >  drivers/gpu/drm/i915/i915_driver.c            |   1 +
> > >  drivers/gpu/drm/i915/i915_gem.c               |   8 -
> > >  drivers/gpu/drm/i915/i915_gem_gtt.c           |   1 -
> > >  drivers/gpu/drm/i915/i915_reg_defs.h          |   8 +
> > >  drivers/gpu/drm/i915/i915_vma.c               |  12 -
> > >  drivers/gpu/drm/radeon/radeon.h               |  55 +--
> > >  drivers/gpu/drm/radeon/radeon_ib.c            |  12 +-
> > >  drivers/gpu/drm/radeon/radeon_object.h        |  25 +-
> > >  drivers/gpu/drm/radeon/radeon_sa.c            | 314 ++---------------
> > >  drivers/gpu/drm/radeon/radeon_semaphore.c     |   6 +-
> > >  drivers/gpu/drm/scheduler/sched_main.c        | 182 +++++++---
> > >  drivers/gpu/drm/ttm/ttm_bo.c                  |   3 +-
> > >  drivers/misc/mei/hdcp/Kconfig                 |   2 +-
> > >  drivers/misc/mei/hdcp/mei_hdcp.c              |   3 +-
> > >  include/drm/drm_pt_walk.h                     | 161 +++++++++
> > >  include/drm/drm_suballoc.h                    | 112 ++++++
> > >  include/drm/gpu_scheduler.h                   |  41 ++-
> > >  sound/hda/hdac_i915.c                         |  17 +-
> > >  sound/pci/hda/hda_intel.c                     |  56 +--
> > >  sound/soc/intel/avs/core.c                    |  13 +-
> > >  sound/soc/sof/intel/hda.c                     |   7 +-
> > >  98 files changed, 2076 insertions(+), 1325 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/drm_pt_walk.c
> > >  create mode 100644 drivers/gpu/drm/drm_suballoc.c
> > >  create mode 100644 include/drm/drm_pt_walk.h
> > >  create mode 100644 include/drm/drm_suballoc.h
> > >
> > > --
> > > 2.37.3
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2023-03-09 15:10 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-22 22:21 [RFC PATCH 00/20] Initial Xe driver submission Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 01/20] drm/suballoc: Introduce a generic suballocation manager Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 02/20] drm/amd: Convert amdgpu to use suballocation helper Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 03/20] drm/radeon: Use the drm suballocation manager implementation Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-23 17:42   ` [Intel-gfx] " Rob Clark
2022-12-28 22:21     ` Matthew Brost
2022-12-30 10:20   ` Boris Brezillon
2022-12-30 10:20     ` [Intel-gfx] " Boris Brezillon
2022-12-30 11:55     ` Boris Brezillon
2022-12-30 11:55       ` [Intel-gfx] " Boris Brezillon
2023-01-02  7:30       ` Boris Brezillon
2023-01-02  7:30         ` [Intel-gfx] " Boris Brezillon
2023-01-03 13:02         ` Tvrtko Ursulin
2023-01-03 14:21           ` Boris Brezillon
2023-01-03 14:21             ` Boris Brezillon
2023-01-05 21:43           ` Matthew Brost
2023-01-05 21:43             ` Matthew Brost
2023-01-06 23:52             ` Matthew Brost
2023-01-09 13:46               ` Tvrtko Ursulin
2023-01-09 17:27                 ` Jason Ekstrand
2023-01-09 17:27                   ` Jason Ekstrand
2023-01-10 11:28                   ` Tvrtko Ursulin
2023-01-10 11:28                     ` Tvrtko Ursulin
2023-01-10 12:19                     ` Tvrtko Ursulin
2023-01-10 12:19                       ` Tvrtko Ursulin
2023-01-10 15:55                       ` Matthew Brost
2023-01-10 15:55                         ` Matthew Brost
2023-01-10 16:50                         ` Tvrtko Ursulin
2023-01-10 16:50                           ` Tvrtko Ursulin
2023-01-10 19:01                           ` Matthew Brost
2023-01-10 19:01                             ` Matthew Brost
2023-01-11  9:17                             ` Tvrtko Ursulin
2023-01-11  9:17                               ` Tvrtko Ursulin
2023-01-11 18:07                               ` Matthew Brost
2023-01-11 18:07                                 ` Matthew Brost
2023-01-11 18:52                                 ` John Harrison
2023-01-11 18:55                                   ` Matthew Brost
2023-01-11 18:55                                     ` Matthew Brost
2023-01-10 14:08                     ` Jason Ekstrand
2023-01-10 14:08                       ` Jason Ekstrand
2023-01-11  8:50                       ` Tvrtko Ursulin
2023-01-11  8:50                         ` Tvrtko Ursulin
2023-01-11 19:40                         ` Matthew Brost
2023-01-11 19:40                           ` Matthew Brost
2023-01-12 18:43                           ` Tvrtko Ursulin
2023-01-12 18:43                             ` Tvrtko Ursulin
2023-01-11 22:18                         ` Jason Ekstrand
2023-01-11 22:18                           ` Jason Ekstrand
2023-01-11 22:31                           ` Matthew Brost
2023-01-11 22:31                             ` Matthew Brost
2023-01-11 22:56                             ` Jason Ekstrand
2023-01-11 22:56                               ` Jason Ekstrand
2023-01-13  0:39                               ` John Harrison
2023-01-18  3:06                                 ` Matthew Brost
2023-01-18  3:06                                   ` Matthew Brost
2023-01-10 16:39                     ` Matthew Brost
2023-01-10 16:39                       ` Matthew Brost
2023-01-11  1:13                       ` Matthew Brost
2023-01-11  1:13                         ` Matthew Brost
2023-01-11  9:09                         ` Tvrtko Ursulin
2023-01-11  9:09                           ` Tvrtko Ursulin
2023-01-11 17:52                           ` Matthew Brost
2023-01-11 17:52                             ` Matthew Brost
2023-01-12 18:21                             ` Tvrtko Ursulin
2023-01-12 18:21                               ` Tvrtko Ursulin
2023-01-05 19:40         ` Matthew Brost
2023-01-05 19:40           ` [Intel-gfx] " Matthew Brost
2023-01-09 15:45           ` Jason Ekstrand
2023-01-09 15:45             ` Jason Ekstrand
2023-01-09 17:17             ` Boris Brezillon
2023-01-09 17:17               ` Boris Brezillon
2023-01-09 20:40               ` Daniel Vetter
2023-01-09 20:40                 ` Daniel Vetter
2023-01-10  8:46                 ` Boris Brezillon
2023-01-10  8:46                   ` Boris Brezillon
2023-01-11 21:47                   ` Daniel Vetter
2023-01-11 21:47                     ` Daniel Vetter
2023-01-12  9:10                     ` Boris Brezillon
2023-01-12  9:10                       ` Boris Brezillon
2023-01-12  9:32                       ` Daniel Vetter
2023-01-12  9:32                         ` Daniel Vetter
2023-01-12 10:11                         ` Boris Brezillon
2023-01-12 10:11                           ` Boris Brezillon
2023-01-12 10:25                           ` Boris Brezillon
2023-01-12 10:25                             ` Boris Brezillon
2023-01-12 10:42                             ` Daniel Vetter
2023-01-12 10:42                               ` Daniel Vetter
2023-01-12 12:08                               ` Boris Brezillon
2023-01-12 12:08                                 ` Boris Brezillon
2023-01-12 15:38                                 ` Daniel Vetter
2023-01-12 15:38                                   ` Daniel Vetter
2023-01-12 16:48                                   ` Boris Brezillon
2023-01-12 16:48                                     ` Boris Brezillon
2023-01-12 10:30                           ` Boris Brezillon
2023-01-12 10:30                             ` Boris Brezillon
2022-12-22 22:21 ` [RFC PATCH 05/20] drm/sched: Add generic scheduler message interface Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 06/20] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 07/20] drm/sched: Submit job before starting TDR Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 08/20] drm/sched: Add helper to set TDR timeout Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 09/20] drm: Add a gpu page-table walker helper Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 10/20] drm/ttm: Don't print error message if eviction was interrupted Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 11/20] drm/i915: Remove gem and overlay frontbuffer tracking Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-23 11:13   ` Tvrtko Ursulin
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 12/20] drm/i915/display: Neuter frontbuffer tracking harder Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 13/20] drm/i915/display: Add more macros to remove all direct calls to uncore Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 14/20] drm/i915/display: Remove all uncore mmio accesses in favor of intel_de Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [Intel-gfx] [RFC PATCH 15/20] drm/i915: Rename find_section to find_bdb_section Matthew Brost
2022-12-22 22:21   ` Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 16/20] drm/i915/regs: Set DISPLAY_MMIO_BASE to 0 for xe Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 17/20] drm/i915/display: Fix a use-after-free when intel_edp_init_connector fails Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 18/20] drm/i915/display: Remaining changes to make xe compile Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 19/20] sound/hda: Allow XE as i915 replacement for sound Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:21 ` [RFC PATCH 20/20] mei/hdcp: Also enable for XE Matthew Brost
2022-12-22 22:21   ` [Intel-gfx] " Matthew Brost
2022-12-22 22:41 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Initial Xe driver submission Patchwork
2023-01-02  8:14 ` [RFC PATCH 00/20] " Thomas Zimmermann
2023-01-02  8:14   ` [Intel-gfx] " Thomas Zimmermann
2023-01-02 11:42   ` Jani Nikula
2023-01-02 11:42     ` [Intel-gfx] " Jani Nikula
2023-01-03 13:56     ` Boris Brezillon
2023-01-03 13:56       ` [Intel-gfx] " Boris Brezillon
2023-01-03 14:41       ` Alyssa Rosenzweig
2023-01-03 14:41         ` [Intel-gfx] " Alyssa Rosenzweig
2023-01-03 12:21 ` Tvrtko Ursulin
2023-01-05 21:27   ` Matthew Brost
2023-01-12  9:54     ` Lucas De Marchi
2023-01-12  9:54       ` Lucas De Marchi
2023-01-12 17:10       ` Matthew Brost
2023-01-12 17:10         ` Matthew Brost
2023-01-17 16:40         ` Jason Ekstrand
2023-01-10 12:33 ` Boris Brezillon
2023-01-10 12:33   ` [Intel-gfx] " Boris Brezillon
2023-01-17 16:12 ` Jason Ekstrand
2023-02-17 20:51 ` Daniel Vetter
2023-02-17 20:51   ` [Intel-gfx] " Daniel Vetter
2023-02-27 12:46   ` Oded Gabbay
2023-02-27 12:46     ` [Intel-gfx] " Oded Gabbay
2023-03-01 23:00   ` Rodrigo Vivi
2023-03-01 23:00     ` Rodrigo Vivi
2023-03-09 15:10     ` Daniel Vetter
2023-03-09 15:10       ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.