All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Subject: [PATCH 20/27] drm/i915/gt: Expose heartbeat interval via sysfs
Date: Tue, 12 Nov 2019 09:28:47 +0000	[thread overview]
Message-ID: <20191112092854.869-20-chris@chris-wilson.co.uk> (raw)
In-Reply-To: <20191112092854.869-1-chris@chris-wilson.co.uk>

We monitor the health of the system via periodic heartbeat pulses. The
pulses also provide the opportunity to perform garbage collection.
However, we interpret an incomplete pulse (a missed heartbeat) as an
indication that the system is no longer responsive, i.e. hung, and
perform an engine or full GPU reset. Given that the preemption
granularity can be very coarse on a system, we let the sysadmin override
our legacy timeouts which were "optimised" for desktop applications.

The heartbeat interval can be adjusted per-engine using,

	/sys/class/drm/card?/engine/*/heartbeat_interval_ms

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/Kconfig.profile         |  3 ++
 drivers/gpu/drm/i915/gt/intel_engine_sysfs.c | 47 ++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 233bcdb6a6ca..d25f5c1316cc 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -20,6 +20,9 @@ config DRM_I915_HEARTBEAT_INTERVAL
 	  check the health of the GPU and undertake regular house-keeping of
 	  internal driver state.
 
+	  This is adjustable via
+	  /sys/class/drm/card?/engine/*/heartbeat_interval_ms
+
 	  May be 0 to disable heartbeats and therefore disable automatic GPU
 	  hang detection.
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c b/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c
index d299c66cf7ec..33b4c00b93f2 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c
@@ -9,6 +9,7 @@
 
 #include "i915_drv.h"
 #include "intel_engine.h"
+#include "intel_engine_heartbeat.h"
 #include "intel_engine_sysfs.h"
 
 struct kobj_engine {
@@ -315,6 +316,49 @@ preempt_timeout_show(struct kobject *kobj, struct kobj_attribute *attr,
 static struct kobj_attribute preempt_timeout_attr =
 __ATTR(preempt_timeout_ms, 0644, preempt_timeout_show, preempt_timeout_store);
 
+static ssize_t
+heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct intel_engine_cs *engine = kobj_to_engine(kobj);
+	unsigned long long delay;
+	int err;
+
+	/*
+	 * We monitor the health of the system via periodic heartbeat pulses.
+	 * The pulses also provide the opportunity to perform garbage
+	 * collection.  However, we interpret an incomplete pulse (a missed
+	 * heartbeat) as an indication that the system is no longer responsive,
+	 * i.e. hung, and perform an engine or full GPU reset. Given that the
+	 * preemption granularity can be very coarse on a system, the optimal
+	 * value for any workload is unknowable!
+	 */
+
+	err = kstrtoull(buf, 0, &delay);
+	if (err)
+		return err;
+
+	if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+		return -EINVAL;
+
+	err = intel_engine_set_heartbeat(engine, delay);
+	if (err)
+		return err;
+
+	return count;
+}
+
+static ssize_t
+heartbeat_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	struct intel_engine_cs *engine = kobj_to_engine(kobj);
+
+	return sprintf(buf, "%lu\n", engine->props.heartbeat_interval_ms);
+}
+
+static struct kobj_attribute heartbeat_interval_attr =
+__ATTR(heartbeat_interval_ms, 0644, heartbeat_show, heartbeat_store);
+
 static void kobj_engine_release(struct kobject *kobj)
 {
 	kfree(kobj);
@@ -357,6 +401,9 @@ void intel_engines_add_sysfs(struct drm_i915_private *i915)
 		&all_caps_attr.attr,
 		&max_spin_attr.attr,
 		&stop_timeout_attr.attr,
+#if CONFIG_DRM_I915_HEARTBEAT_INTERVAL
+		&heartbeat_interval_attr.attr,
+#endif
 		NULL
 	};
 
-- 
2.24.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Subject: [Intel-gfx] [PATCH 20/27] drm/i915/gt: Expose heartbeat interval via sysfs
Date: Tue, 12 Nov 2019 09:28:47 +0000	[thread overview]
Message-ID: <20191112092854.869-20-chris@chris-wilson.co.uk> (raw)
Message-ID: <20191112092847.JkHtJ9xp_ViqZtnXGDZwPDCBN-lKHLRdhp-neMHYS-Q@z> (raw)
In-Reply-To: <20191112092854.869-1-chris@chris-wilson.co.uk>

We monitor the health of the system via periodic heartbeat pulses. The
pulses also provide the opportunity to perform garbage collection.
However, we interpret an incomplete pulse (a missed heartbeat) as an
indication that the system is no longer responsive, i.e. hung, and
perform an engine or full GPU reset. Given that the preemption
granularity can be very coarse on a system, we let the sysadmin override
our legacy timeouts which were "optimised" for desktop applications.

The heartbeat interval can be adjusted per-engine using,

	/sys/class/drm/card?/engine/*/heartbeat_interval_ms

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/Kconfig.profile         |  3 ++
 drivers/gpu/drm/i915/gt/intel_engine_sysfs.c | 47 ++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 233bcdb6a6ca..d25f5c1316cc 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -20,6 +20,9 @@ config DRM_I915_HEARTBEAT_INTERVAL
 	  check the health of the GPU and undertake regular house-keeping of
 	  internal driver state.
 
+	  This is adjustable via
+	  /sys/class/drm/card?/engine/*/heartbeat_interval_ms
+
 	  May be 0 to disable heartbeats and therefore disable automatic GPU
 	  hang detection.
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c b/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c
index d299c66cf7ec..33b4c00b93f2 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_sysfs.c
@@ -9,6 +9,7 @@
 
 #include "i915_drv.h"
 #include "intel_engine.h"
+#include "intel_engine_heartbeat.h"
 #include "intel_engine_sysfs.h"
 
 struct kobj_engine {
@@ -315,6 +316,49 @@ preempt_timeout_show(struct kobject *kobj, struct kobj_attribute *attr,
 static struct kobj_attribute preempt_timeout_attr =
 __ATTR(preempt_timeout_ms, 0644, preempt_timeout_show, preempt_timeout_store);
 
+static ssize_t
+heartbeat_store(struct kobject *kobj, struct kobj_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct intel_engine_cs *engine = kobj_to_engine(kobj);
+	unsigned long long delay;
+	int err;
+
+	/*
+	 * We monitor the health of the system via periodic heartbeat pulses.
+	 * The pulses also provide the opportunity to perform garbage
+	 * collection.  However, we interpret an incomplete pulse (a missed
+	 * heartbeat) as an indication that the system is no longer responsive,
+	 * i.e. hung, and perform an engine or full GPU reset. Given that the
+	 * preemption granularity can be very coarse on a system, the optimal
+	 * value for any workload is unknowable!
+	 */
+
+	err = kstrtoull(buf, 0, &delay);
+	if (err)
+		return err;
+
+	if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
+		return -EINVAL;
+
+	err = intel_engine_set_heartbeat(engine, delay);
+	if (err)
+		return err;
+
+	return count;
+}
+
+static ssize_t
+heartbeat_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	struct intel_engine_cs *engine = kobj_to_engine(kobj);
+
+	return sprintf(buf, "%lu\n", engine->props.heartbeat_interval_ms);
+}
+
+static struct kobj_attribute heartbeat_interval_attr =
+__ATTR(heartbeat_interval_ms, 0644, heartbeat_show, heartbeat_store);
+
 static void kobj_engine_release(struct kobject *kobj)
 {
 	kfree(kobj);
@@ -357,6 +401,9 @@ void intel_engines_add_sysfs(struct drm_i915_private *i915)
 		&all_caps_attr.attr,
 		&max_spin_attr.attr,
 		&stop_timeout_attr.attr,
+#if CONFIG_DRM_I915_HEARTBEAT_INTERVAL
+		&heartbeat_interval_attr.attr,
+#endif
 		NULL
 	};
 
-- 
2.24.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2019-11-12  9:29 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-12  9:28 [PATCH 01/27] drm/i915: Flush context free work on cleanup Chris Wilson
2019-11-12  9:28 ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 02/27] drm/i915/gt: Try an extra flush on the Haswell blitter Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 03/27] drm/i915/gem: Silence sparse for RCU protection inside the constructor Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 04/27] drm/i915/selftests: Mock the engine sorting for easy validation Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 05/27] Revert "drm/i915: use a separate context for gpu relocs" Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 06/27] drm/i915: Use a ctor for TYPESAFE_BY_RCU i915_request Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 07/27] drm/i915: Drop GEM context as a direct link from i915_request Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 08/27] drm/i915: Push the use-semaphore marker onto the intel_context Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 09/27] drm/i915: Remove i915->kernel_context Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 10/27] drm/i915: Move i915_gem_init_contexts() earlier Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 11/27] drm/i915/uc: Use an internal buffer for firmware images Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 12/27] drm/i915/gt: Pull GT initialisation under intel_gt_init() Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 13/27] drm/i915/gt: Merge engine init/setup loops Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 14/27] drm/i915/gt: Expose engine properties via sysfs Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 15/27] drm/i915/gt: Expose engine->mmio_base " Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-21 13:18   ` Lionel Landwerlin
2019-11-21 13:18     ` [Intel-gfx] " Lionel Landwerlin
2019-11-21 13:23     ` Chris Wilson
2019-11-21 13:23       ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 16/27] drm/i915/gt: Expose timeslice duration to sysfs Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 17/27] drm/i915/gt: Expose busywait " Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 18/27] drm/i915/gt: Expose reset stop timeout via sysfs Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 19/27] drm/i915/gt: Expose preempt reset " Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` Chris Wilson [this message]
2019-11-12  9:28   ` [Intel-gfx] [PATCH 20/27] drm/i915/gt: Expose heartbeat interval " Chris Wilson
2019-11-12  9:28 ` [PATCH 21/27] drm/i915: Flush idle barriers when waiting Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 22/27] drm/i915: Allow userspace to specify ringsize on construction Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 23/27] drm/i915/gem: Honour O_NONBLOCK before throttling execbuf submissions Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 24/27] drm/i915/gt: Set unused mocs entry to follow PTE on tgl as on all others Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12 14:13   ` Mika Kuoppala
2019-11-12 14:13     ` [Intel-gfx] " Mika Kuoppala
2019-11-12 14:39     ` Chris Wilson
2019-11-12 14:39       ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 25/27] drm/i915/gt: Tidy up debug-warns for the mocs control table Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 26/27] drm/i915/gt: Refactor mocs loops into single control macro Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12 17:02   ` Mika Kuoppala
2019-11-12 17:02     ` [Intel-gfx] " Mika Kuoppala
2019-11-12 18:12     ` Chris Wilson
2019-11-12 18:12       ` [Intel-gfx] " Chris Wilson
2019-11-12  9:28 ` [PATCH 27/27] drm/i915/selftests: Add coverage of mocs registers Chris Wilson
2019-11-12  9:28   ` [Intel-gfx] " Chris Wilson
2019-11-12 10:12 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/27] drm/i915: Flush context free work on cleanup Patchwork
2019-11-12 10:12   ` [Intel-gfx] " Patchwork
2019-11-12 10:23 ` ✗ Fi.CI.SPARSE: " Patchwork
2019-11-12 10:23   ` [Intel-gfx] " Patchwork
2019-11-12 10:33 ` ✗ Fi.CI.BAT: failure " Patchwork
2019-11-12 10:33   ` [Intel-gfx] " Patchwork
2019-11-12 14:23 ` [PATCH 01/27] " Mika Kuoppala
2019-11-12 14:23   ` [Intel-gfx] " Mika Kuoppala
2019-11-12 14:39   ` Chris Wilson
2019-11-12 14:39     ` [Intel-gfx] " Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191112092854.869-20-chris@chris-wilson.co.uk \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.