From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 880F8C433E1 for ; Thu, 20 Aug 2020 14:53:25 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6180E2075E for ; Thu, 20 Aug 2020 14:53:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6180E2075E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 05BDE6E59D; Thu, 20 Aug 2020 14:53:25 +0000 (UTC) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4E2CF6E59D; Thu, 20 Aug 2020 14:53:22 +0000 (UTC) IronPort-SDR: 0FeBtpiTJ9ysTwr+uscO/p0qc3cQIw4rHtHRVOJAcPX4C9JsysTwDQlum43h244iwzGmhElycs QVIqdAVljhzw== X-IronPort-AV: E=McAfee;i="6000,8403,9718"; a="240136327" X-IronPort-AV: E=Sophos;i="5.76,333,1592895600"; d="scan'208";a="240136327" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Aug 2020 07:53:21 -0700 IronPort-SDR: ThsNfHdZ3G6ZMfY100LjtWiCFOPp7b2VFxDB+J6JBRDeePUD8YSn991uVnxeudBJn58vTiAmhX Yw7RgstIFnqw== X-IronPort-AV: E=Sophos;i="5.76,333,1592895600"; d="scan'208";a="472679824" Received: from jkrzyszt-desk.igk.intel.com ([172.22.244.18]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Aug 2020 07:53:20 -0700 From: Janusz Krzysztofik To: igt-dev@lists.freedesktop.org Date: Thu, 20 Aug 2020 16:52:13 +0200 Message-Id: <20200820145215.13238-18-janusz.krzysztofik@linux.intel.com> X-Mailer: git-send-email 2.21.1 In-Reply-To: <20200820145215.13238-1-janusz.krzysztofik@linux.intel.com> References: <20200820145215.13238-1-janusz.krzysztofik@linux.intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH i-g-t v3 17/19] tests/core_hotunplug: More thorough i915 healthcheck and recovery X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?UTF-8?q?Micha=C5=82=20Winiarski?= , intel-gfx@lists.freedesktop.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" The test now assumes the i915 driver is able to identify potential hardware or driver issues while rebinding to a device and indicate them by marking the GPU wedged. Should that assumption occur wrong, the health check phase of the test would happily succeed while potentially leaving the device in an unusable state. That would not only give us falsely positive test results but could also potentially affect subsequently run applications. Then, we should examine health of the exercised device more thoroughly and try harder to recover it from potentially detected stalls. We could use a gem_test_engine() library function which submits and asserts successful execution of a NOP batch on each physical engine. Unfortunately, on failure this function jumps out of an IGT test section it is called from, while we would like to continue with recovery steps, possibly not adding another level of test section group nesting. Moreover, the function opens the device again and doesn't close the extra file descriptor before the jump, while we care for being able to close the exercised device completely before running certain subtest operations. Then, reimplement the function locally with those issues fixed and use it as an i915 healthcheck. Call it also on test startup so operations performed by the test are never blamed for driver or hardware issues which may potentially exist and be possible to detect on test start. Should the i915 GPU be found unresponsive by the health check called from a recovery section, try harder to recover it to a usable state with a global GPU reset. For still more effective detection of GPU hangs, use a hang detector provided by IGT library. However, replace the library signal handler with our own implementation that doesn't jump out of the current IGT test section on GPU hang so we are still able to perform the reset and retry. Signed-off-by: Janusz Krzysztofik --- tests/core_hotunplug.c | 88 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 80 insertions(+), 8 deletions(-) diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c index 24beed81a..277679ea1 100644 --- a/tests/core_hotunplug.c +++ b/tests/core_hotunplug.c @@ -23,8 +23,10 @@ #include #include +#include #include #include +#include #include #include #include @@ -195,7 +197,71 @@ static void cleanup(struct hotunplug *priv) priv->fd.sysfs_dev = close_sysfs(priv->fd.sysfs_dev); } -static void healthcheck(struct hotunplug *priv) +static bool local_i915_is_wedged(int i915) +{ + int err = 0; + + if (ioctl(i915, DRM_IOCTL_I915_GEM_THROTTLE)) + err = -errno; + return err == -EIO; +} + +static bool hang_detected; + +static void local_sig_abort(int sig) +{ + errno = 0; /* inside a signal, last errno reporting is confusing */ + hang_detected = true; +} + +static int local_i915_healthcheck(int i915) +{ + const uint32_t bbe = MI_BATCH_BUFFER_END; + struct drm_i915_gem_exec_object2 obj = { }; + struct drm_i915_gem_execbuffer2 execbuf = { + .buffers_ptr = to_user_pointer(&obj), + .buffer_count = 1, + }; + const struct intel_execution_engine2 *engine; + + igt_debug("running i915 GPU healthcheck\n"); + + if (local_i915_is_wedged(i915)) + return -EIO; + + obj.handle = gem_create(i915, 4096); + gem_write(i915, obj.handle, 0, &bbe, sizeof(bbe)); + + hang_detected = false; + igt_fork_hang_detector(i915); + signal(SIGIO, local_sig_abort); + + __for_each_physical_engine(i915, engine) { + execbuf.flags = engine->flags; + gem_execbuf(i915, &execbuf); + } + + gem_sync(i915, obj.handle); + gem_close(i915, obj.handle); + + igt_stop_hang_detector(); + if (hang_detected) + return -EIO; + + if (local_i915_is_wedged(i915)) + return -EIO; + + return 0; +} + +static int local_i915_recover(int i915) +{ + igt_debug("forcing i915 GPU reset\n"); + igt_force_gpu_reset(i915); + return local_i915_healthcheck(i915); +} + +static void healthcheck(struct hotunplug *priv, bool recover) { /* preserve error code potentially stored before in priv->fd.drm */ bool closed = priv->fd.drm == -1; @@ -210,9 +276,14 @@ static void healthcheck(struct hotunplug *priv) priv->fd.drm = fd_drm; if (is_i915_device(fd_drm)) { - priv->failure = "GEM failure"; - igt_require_gem(fd_drm); - priv->failure = NULL; + /* don't report library failed asserts as healthcheck failure */ + priv->failure = "Unrecoverable test failure"; + if (local_i915_healthcheck(fd_drm) && + (!recover || local_i915_recover(fd_drm))) + priv->failure = "Healthcheck failure!"; + else + priv->failure = NULL; + } else { /* no device specific healthcheck, rely on reopen result */ priv->failure = NULL; @@ -237,7 +308,7 @@ static void recover(struct hotunplug *priv) driver_bind(priv, 60); if (priv->failure) - healthcheck(priv); + healthcheck(priv, true); } static void post_healthcheck(struct hotunplug *priv) @@ -271,7 +342,7 @@ static void unbind_rebind(struct hotunplug *priv) driver_bind(priv, 0); - healthcheck(priv); + healthcheck(priv, false); } static void unplug_rescan(struct hotunplug *priv) @@ -280,7 +351,7 @@ static void unplug_rescan(struct hotunplug *priv) bus_rescan(priv, 0); - healthcheck(priv); + healthcheck(priv, false); } static void hotunbind_lateclose(struct hotunplug *priv) @@ -326,7 +397,8 @@ igt_main if (is_i915_device(fd_drm)) { gem_quiescent_gpu(fd_drm); - igt_require_gem(fd_drm); + igt_skip_on_f(local_i915_healthcheck(fd_drm), + "i915 device not healthy on test start\n"); } /* Make sure subtests always reopen the same device */ -- 2.21.1 _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Janusz Krzysztofik Date: Thu, 20 Aug 2020 16:52:13 +0200 Message-Id: <20200820145215.13238-18-janusz.krzysztofik@linux.intel.com> In-Reply-To: <20200820145215.13238-1-janusz.krzysztofik@linux.intel.com> References: <20200820145215.13238-1-janusz.krzysztofik@linux.intel.com> MIME-Version: 1.0 Subject: [igt-dev] [PATCH i-g-t v3 17/19] tests/core_hotunplug: More thorough i915 healthcheck and recovery List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: igt-dev-bounces@lists.freedesktop.org Sender: "igt-dev" To: igt-dev@lists.freedesktop.org Cc: =?UTF-8?q?Micha=C5=82=20Winiarski?= , Petri Latvala , intel-gfx@lists.freedesktop.org, Tvrtko Ursulin List-ID: The test now assumes the i915 driver is able to identify potential hardware or driver issues while rebinding to a device and indicate them by marking the GPU wedged. Should that assumption occur wrong, the health check phase of the test would happily succeed while potentially leaving the device in an unusable state. That would not only give us falsely positive test results but could also potentially affect subsequently run applications. Then, we should examine health of the exercised device more thoroughly and try harder to recover it from potentially detected stalls. We could use a gem_test_engine() library function which submits and asserts successful execution of a NOP batch on each physical engine. Unfortunately, on failure this function jumps out of an IGT test section it is called from, while we would like to continue with recovery steps, possibly not adding another level of test section group nesting. Moreover, the function opens the device again and doesn't close the extra file descriptor before the jump, while we care for being able to close the exercised device completely before running certain subtest operations. Then, reimplement the function locally with those issues fixed and use it as an i915 healthcheck. Call it also on test startup so operations performed by the test are never blamed for driver or hardware issues which may potentially exist and be possible to detect on test start. Should the i915 GPU be found unresponsive by the health check called from a recovery section, try harder to recover it to a usable state with a global GPU reset. For still more effective detection of GPU hangs, use a hang detector provided by IGT library. However, replace the library signal handler with our own implementation that doesn't jump out of the current IGT test section on GPU hang so we are still able to perform the reset and retry. Signed-off-by: Janusz Krzysztofik --- tests/core_hotunplug.c | 88 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 80 insertions(+), 8 deletions(-) diff --git a/tests/core_hotunplug.c b/tests/core_hotunplug.c index 24beed81a..277679ea1 100644 --- a/tests/core_hotunplug.c +++ b/tests/core_hotunplug.c @@ -23,8 +23,10 @@ #include #include +#include #include #include +#include #include #include #include @@ -195,7 +197,71 @@ static void cleanup(struct hotunplug *priv) priv->fd.sysfs_dev = close_sysfs(priv->fd.sysfs_dev); } -static void healthcheck(struct hotunplug *priv) +static bool local_i915_is_wedged(int i915) +{ + int err = 0; + + if (ioctl(i915, DRM_IOCTL_I915_GEM_THROTTLE)) + err = -errno; + return err == -EIO; +} + +static bool hang_detected; + +static void local_sig_abort(int sig) +{ + errno = 0; /* inside a signal, last errno reporting is confusing */ + hang_detected = true; +} + +static int local_i915_healthcheck(int i915) +{ + const uint32_t bbe = MI_BATCH_BUFFER_END; + struct drm_i915_gem_exec_object2 obj = { }; + struct drm_i915_gem_execbuffer2 execbuf = { + .buffers_ptr = to_user_pointer(&obj), + .buffer_count = 1, + }; + const struct intel_execution_engine2 *engine; + + igt_debug("running i915 GPU healthcheck\n"); + + if (local_i915_is_wedged(i915)) + return -EIO; + + obj.handle = gem_create(i915, 4096); + gem_write(i915, obj.handle, 0, &bbe, sizeof(bbe)); + + hang_detected = false; + igt_fork_hang_detector(i915); + signal(SIGIO, local_sig_abort); + + __for_each_physical_engine(i915, engine) { + execbuf.flags = engine->flags; + gem_execbuf(i915, &execbuf); + } + + gem_sync(i915, obj.handle); + gem_close(i915, obj.handle); + + igt_stop_hang_detector(); + if (hang_detected) + return -EIO; + + if (local_i915_is_wedged(i915)) + return -EIO; + + return 0; +} + +static int local_i915_recover(int i915) +{ + igt_debug("forcing i915 GPU reset\n"); + igt_force_gpu_reset(i915); + return local_i915_healthcheck(i915); +} + +static void healthcheck(struct hotunplug *priv, bool recover) { /* preserve error code potentially stored before in priv->fd.drm */ bool closed = priv->fd.drm == -1; @@ -210,9 +276,14 @@ static void healthcheck(struct hotunplug *priv) priv->fd.drm = fd_drm; if (is_i915_device(fd_drm)) { - priv->failure = "GEM failure"; - igt_require_gem(fd_drm); - priv->failure = NULL; + /* don't report library failed asserts as healthcheck failure */ + priv->failure = "Unrecoverable test failure"; + if (local_i915_healthcheck(fd_drm) && + (!recover || local_i915_recover(fd_drm))) + priv->failure = "Healthcheck failure!"; + else + priv->failure = NULL; + } else { /* no device specific healthcheck, rely on reopen result */ priv->failure = NULL; @@ -237,7 +308,7 @@ static void recover(struct hotunplug *priv) driver_bind(priv, 60); if (priv->failure) - healthcheck(priv); + healthcheck(priv, true); } static void post_healthcheck(struct hotunplug *priv) @@ -271,7 +342,7 @@ static void unbind_rebind(struct hotunplug *priv) driver_bind(priv, 0); - healthcheck(priv); + healthcheck(priv, false); } static void unplug_rescan(struct hotunplug *priv) @@ -280,7 +351,7 @@ static void unplug_rescan(struct hotunplug *priv) bus_rescan(priv, 0); - healthcheck(priv); + healthcheck(priv, false); } static void hotunbind_lateclose(struct hotunplug *priv) @@ -326,7 +397,8 @@ igt_main if (is_i915_device(fd_drm)) { gem_quiescent_gpu(fd_drm); - igt_require_gem(fd_drm); + igt_skip_on_f(local_i915_healthcheck(fd_drm), + "i915 device not healthy on test start\n"); } /* Make sure subtests always reopen the same device */ -- 2.21.1 _______________________________________________ igt-dev mailing list igt-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/igt-dev