From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> To: Intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, Tvrtko Ursulin <tvrtko.ursulin@intel.com> Subject: [PATCH v4 0/7] Default request/fence expiry + watchdog Date: Wed, 24 Mar 2021 12:13:28 +0000 [thread overview] Message-ID: <20210324121335.2307063-1-tvrtko.ursulin@linux.intel.com> (raw) From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second post of a somewhat controversial feature, now upgraded to patch status. I quote the "watchdog" becuase in classical sense watchdog would allow userspace to ping it and so remain alive. I quote "restoring hangcheck" because this series, contrary to the old hangcheck, is not looking at whether the workload is making any progress from the kernel side either. (Although disclaimer my memory may be leaky - Daniel suspects old hangcheck had some stricter, more indiscriminatory, angles to it. But apart from being prone to both false negatives and false positives I can't remember that myself.) Short version - ask is to fail any user submissions after a set time period. In this RFC that time is twelve seconds. Time counts from the moment user submission is "runnable" (implicit and explicit dependencies have been cleared) and keeps counting regardless of the GPU contetion caused by other users of the system. So semantics are really a bit weak, but again, I understand this is really really wanted by the DRM core even if I am not convinced it is a good idea. There are some dangers with doing this - text borrowed from a patch in the series: This can have an effect that workloads which used to work fine will suddenly start failing. Even workloads comprised of short batches but in long dependency chains can be terminated. And becuase of lack of agreement on usefulness and safety of fence error propagation this partial execution can be invisible to userspace even if it is "listening" to returned fence status. Another interaction is with hangcheck where care needs to be taken timeout is not set lower or close to three times the heartbeat interval. Otherwise a hang in any application can cause complete termination of all submissions from unrelated clients. Any users modifying the per engine heartbeat intervals therefore need to be aware of this potential denial of service to avoid inadvertently enabling it. Given all this I am personally not convinced the scheme is a good idea. Intuitively it feels object importers would be better positioned to enforce the time they are willing to wait for something to complete. v2: * Dropped context param. * Improved commit messages and Kconfig text. v3: * Log timeouts. * Bump timeout to 20s to see if it helps Tigerlake. * Fix sentinel assert. v4: * A round of review feedback applied. Chris Wilson (1): drm/i915: Individual request cancellation Tvrtko Ursulin (6): drm/i915: Extract active lookup engine to a helper drm/i915: Restrict sentinel requests further drm/i915: Handle async cancellation in sentinel assert drm/i915: Request watchdog infrastructure drm/i915: Fail too long user submissions by default drm/i915: Allow configuring default request expiry via modparam drivers/gpu/drm/i915/Kconfig.profile | 14 ++ drivers/gpu/drm/i915/gem/i915_gem_context.c | 73 ++++--- .../gpu/drm/i915/gem/i915_gem_context_types.h | 4 + drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +- drivers/gpu/drm/i915/gt/intel_context_types.h | 4 + .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 1 + .../drm/i915/gt/intel_execlists_submission.c | 23 +- .../drm/i915/gt/intel_execlists_submission.h | 2 + drivers/gpu/drm/i915/gt/intel_gt.c | 3 + drivers/gpu/drm/i915/gt/intel_gt.h | 2 + drivers/gpu/drm/i915/gt/intel_gt_requests.c | 28 +++ drivers/gpu/drm/i915/gt/intel_gt_types.h | 7 + drivers/gpu/drm/i915/i915_params.c | 5 + drivers/gpu/drm/i915/i915_params.h | 1 + drivers/gpu/drm/i915/i915_request.c | 129 ++++++++++- drivers/gpu/drm/i915/i915_request.h | 16 +- drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++ 17 files changed, 479 insertions(+), 45 deletions(-) -- 2.27.0 _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> To: Intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Subject: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Date: Wed, 24 Mar 2021 12:13:28 +0000 [thread overview] Message-ID: <20210324121335.2307063-1-tvrtko.ursulin@linux.intel.com> (raw) From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second post of a somewhat controversial feature, now upgraded to patch status. I quote the "watchdog" becuase in classical sense watchdog would allow userspace to ping it and so remain alive. I quote "restoring hangcheck" because this series, contrary to the old hangcheck, is not looking at whether the workload is making any progress from the kernel side either. (Although disclaimer my memory may be leaky - Daniel suspects old hangcheck had some stricter, more indiscriminatory, angles to it. But apart from being prone to both false negatives and false positives I can't remember that myself.) Short version - ask is to fail any user submissions after a set time period. In this RFC that time is twelve seconds. Time counts from the moment user submission is "runnable" (implicit and explicit dependencies have been cleared) and keeps counting regardless of the GPU contetion caused by other users of the system. So semantics are really a bit weak, but again, I understand this is really really wanted by the DRM core even if I am not convinced it is a good idea. There are some dangers with doing this - text borrowed from a patch in the series: This can have an effect that workloads which used to work fine will suddenly start failing. Even workloads comprised of short batches but in long dependency chains can be terminated. And becuase of lack of agreement on usefulness and safety of fence error propagation this partial execution can be invisible to userspace even if it is "listening" to returned fence status. Another interaction is with hangcheck where care needs to be taken timeout is not set lower or close to three times the heartbeat interval. Otherwise a hang in any application can cause complete termination of all submissions from unrelated clients. Any users modifying the per engine heartbeat intervals therefore need to be aware of this potential denial of service to avoid inadvertently enabling it. Given all this I am personally not convinced the scheme is a good idea. Intuitively it feels object importers would be better positioned to enforce the time they are willing to wait for something to complete. v2: * Dropped context param. * Improved commit messages and Kconfig text. v3: * Log timeouts. * Bump timeout to 20s to see if it helps Tigerlake. * Fix sentinel assert. v4: * A round of review feedback applied. Chris Wilson (1): drm/i915: Individual request cancellation Tvrtko Ursulin (6): drm/i915: Extract active lookup engine to a helper drm/i915: Restrict sentinel requests further drm/i915: Handle async cancellation in sentinel assert drm/i915: Request watchdog infrastructure drm/i915: Fail too long user submissions by default drm/i915: Allow configuring default request expiry via modparam drivers/gpu/drm/i915/Kconfig.profile | 14 ++ drivers/gpu/drm/i915/gem/i915_gem_context.c | 73 ++++--- .../gpu/drm/i915/gem/i915_gem_context_types.h | 4 + drivers/gpu/drm/i915/gt/intel_context_param.h | 11 +- drivers/gpu/drm/i915/gt/intel_context_types.h | 4 + .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 1 + .../drm/i915/gt/intel_execlists_submission.c | 23 +- .../drm/i915/gt/intel_execlists_submission.h | 2 + drivers/gpu/drm/i915/gt/intel_gt.c | 3 + drivers/gpu/drm/i915/gt/intel_gt.h | 2 + drivers/gpu/drm/i915/gt/intel_gt_requests.c | 28 +++ drivers/gpu/drm/i915/gt/intel_gt_types.h | 7 + drivers/gpu/drm/i915/i915_params.c | 5 + drivers/gpu/drm/i915/i915_params.h | 1 + drivers/gpu/drm/i915/i915_request.c | 129 ++++++++++- drivers/gpu/drm/i915/i915_request.h | 16 +- drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++ 17 files changed, 479 insertions(+), 45 deletions(-) -- 2.27.0 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next reply other threads:[~2021-03-24 12:13 UTC|newest] Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-03-24 12:13 Tvrtko Ursulin [this message] 2021-03-24 12:13 ` [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Tvrtko Ursulin 2021-03-24 12:13 ` [PATCH 1/7] drm/i915: Extract active lookup engine to a helper Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-24 12:21 ` Matthew Auld 2021-03-24 12:21 ` Matthew Auld 2021-03-24 12:13 ` [PATCH 2/7] drm/i915: Individual request cancellation Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-24 15:24 ` Matthew Auld 2021-03-24 15:24 ` Matthew Auld 2021-03-24 12:13 ` [PATCH 3/7] drm/i915: Restrict sentinel requests further Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-24 15:25 ` Matthew Auld 2021-03-24 15:25 ` [Intel-gfx] " Matthew Auld 2021-03-26 0:01 ` Daniel Vetter 2021-03-26 0:01 ` Daniel Vetter 2021-03-24 12:13 ` [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-24 17:22 ` Matthew Auld 2021-03-24 17:22 ` Matthew Auld 2021-03-24 12:13 ` [PATCH 5/7] drm/i915: Request watchdog infrastructure Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-26 0:00 ` Daniel Vetter 2021-03-26 0:00 ` [Intel-gfx] " Daniel Vetter 2021-03-26 10:32 ` Tvrtko Ursulin 2021-03-26 10:32 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-24 12:13 ` [PATCH 6/7] drm/i915: Fail too long user submissions by default Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-24 12:13 ` [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam Tvrtko Ursulin 2021-03-24 12:13 ` [Intel-gfx] " Tvrtko Ursulin 2021-03-26 0:25 ` Daniel Vetter 2021-03-26 0:25 ` [Intel-gfx] " Daniel Vetter 2021-03-24 13:16 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog (rev5) Patchwork 2021-03-24 13:21 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork 2021-03-24 13:48 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork 2021-03-24 23:29 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork 2021-03-26 9:10 ` [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Daniel Vetter 2021-03-26 9:10 ` Daniel Vetter 2021-03-26 10:31 ` Tvrtko Ursulin 2021-03-26 10:31 ` Tvrtko Ursulin 2021-04-08 10:18 ` Daniel Vetter 2021-04-08 10:18 ` Daniel Vetter
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210324121335.2307063-1-tvrtko.ursulin@linux.intel.com \ --to=tvrtko.ursulin@linux.intel.com \ --cc=Intel-gfx@lists.freedesktop.org \ --cc=dri-devel@lists.freedesktop.org \ --cc=tvrtko.ursulin@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.