[igt-dev] [PATCH v8 0/1 i-g-t] tests: Add a new test for driver/device hot reload

All of lore.kernel.org
 help / color / mirror / Atom feed

* [igt-dev] [PATCH v8 0/1 i-g-t] tests: Add a new test for driver/device hot reload
@ 2019-04-30 11:29 Janusz Krzysztofik
  2019-04-30 11:29 ` [igt-dev] [PATCH v8 1/1 " Janusz Krzysztofik
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Janusz Krzysztofik @ 2019-04-30 11:29 UTC (permalink / raw)
  To: Arkadiusz Hiler, Petri Latvala; +Cc: janusz.krzysztofik, igt-dev, Daniel Vetter

The test should help resolving driver bugs which exhibit themselves
when a device is unplugged / driver unbind from a device while the
device is busy (different from simple module unload which requires 
device references being put first).

Janusz Krzysztofik (1):
  tests: Add a new test for driver/device hot reload

 tests/Makefile.sources  |   1 +
 tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
 tests/meson.build       |   1 +
 3 files changed, 410 insertions(+)
 create mode 100644 tests/core_hot_reload.c

Changelog:
v7->v8:
- move workload functions back from fixture to subtests,
- register different actions and different workloads in respective
  tables and iterate over those tables while enumerating subtests,
- introduce new subtest flavors by simply omiting module unload step,
- instead of simply requesting bus rescan or not, introduce action
  specific device recovery helpers, required specifically with those
  new subtests not touching the module,
- split workload functions in two parts, one spawning the workload,
  the other waiting for its completion,
- for the new subtests not requiring module unload, run workload
  functions directly from the test process and use new workload
  completion wait functions in place of subprocess completion wait,
- take more control over logging, longjumps and exit codes in
  workload subprocesses,
- add some debug messages for easy progress watching,
- move function API descriptions on top of respective typedefs,
- drop patch 2/2 with external workload command again, still nobody
  likes it.

v6->v7:
- add missing igt_exit() needed with the second patch.

v5->v6 (third public submission, incorrectly marked as v5, sorry):
- run workload inside an igt helper subprocess so resources consumed
  by the workload are cleaned up automatically on workload subprocess
  crash, without affecting test results,
- move the igt helper with workload back from subtests to initial
  fixture so workload crash also does not affect test results,
- re-add the second patch which extends the test with an option for
  using an external command as a workload,
- other cleanups suggested by Kasia and Chris.

v4->v5 (second public submission, marked as v2):
- try to restore the device to a working state after each subtest
  (Petri, Daniel).

v3->v4 (first public submission, not marked with any version number):
- run dummy_load from inside subtests (Antonio).

v2->v3 (internal submission):
- run dummy_load from the test process directly (Antonio),
- drop the patch for running external workload (Antonio).

v1->v2 (internal submission):
- run a subprocess with dummy_load instead of external command
  (Antonio),
- keep use of external workload command as an option, move that to a
  separate patch.
-- 
2.20.1

_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-04-30 11:29 [igt-dev] [PATCH v8 0/1 i-g-t] tests: Add a new test for driver/device hot reload Janusz Krzysztofik
@ 2019-04-30 11:29 ` Janusz Krzysztofik
  2019-04-30 15:05   ` Daniel Vetter
  2019-04-30 12:19 ` [igt-dev] ✓ Fi.CI.BAT: success for tests: Add a new test for driver/device hot reload (rev2) Patchwork
  2019-05-01  0:26 ` [igt-dev] ✗ Fi.CI.IGT: failure " Patchwork
  2 siblings, 1 reply; 13+ messages in thread
From: Janusz Krzysztofik @ 2019-04-30 11:29 UTC (permalink / raw)
  To: Arkadiusz Hiler, Petri Latvala; +Cc: janusz.krzysztofik, igt-dev, Daniel Vetter

From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>

Put some workload on a device, then try to either remove (unplug) the
device from its bus, or unbind the device's driver from it, possibly
followed by module unload, depending on which specific subtest has been
selected.  If succeeded, rescan the device's bus if needed and perform
health checks on the device with the driver possibly loaded back.

If module unload is requested, the workload is run in a sub-process,
not directly from the test, as it is expected to crash while still
keeping the device open for as long as its process has not exited.

The driver hot unbind / device hot unplug operation is expected to
succeed and the background workload sub-process to crash in a
reasonable time, however long timeouts are used to let kernel level
timeouts pop up first if hit by a bug.

The driver is ready for extending it with an arbitrary workload
functions as needed.  For now, a workload based on igt_dummyload is
implemented, hence subtests work only on i915 driver and are skipped on
other hardware, unless they provide their implementation of
igt_spin_new() and friends, or other workloads are implemented.

Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
---
 tests/Makefile.sources  |   1 +
 tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
 tests/meson.build       |   1 +
 3 files changed, 410 insertions(+)
 create mode 100644 tests/core_hot_reload.c

diff --git a/tests/Makefile.sources b/tests/Makefile.sources
index 7f921f6c..452d8ed7 100644
--- a/tests/Makefile.sources
+++ b/tests/Makefile.sources
@@ -16,6 +16,7 @@ TESTS_progs = \
 	core_getclient \
 	core_getstats \
 	core_getversion \
+	core_hot_reload \
 	core_setmaster_vs_auth \
 	debugfs_test \
 	drm_import_export \
diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
new file mode 100644
index 00000000..6673f55c
--- /dev/null
+++ b/tests/core_hot_reload.c
@@ -0,0 +1,408 @@
+/*
+ * Copyright © 2019 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include "igt.h"
+#include "igt_device.h"
+#include "igt_dummyload.h"
+#include "igt_kmod.h"
+#include "igt_sysfs.h"
+
+#include <getopt.h>
+#include <limits.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/types.h>
+#include <sys/wait.h>
+
+/**
+ * A post-action device recovery function:
+ * @priv: a pointer to private data required for device recovery
+ *
+ * Make the device re-appear
+ */
+typedef void (*recover_t)(const void *priv);
+
+/**
+ * A test action function:
+ * @dir: file descriptor of an open device sysfs directory
+ * @module: module name, non-NULL indicates post-action module unload requested
+ * @recover: for returning a pointer to a post-action device recovery function
+ * @priv: for returning a pointer to data to be passed to @recover
+ *
+ * Make the device disappear
+ */
+typedef void (*action_t)(int device, const char *module,
+			 recover_t *recover, const void **priv);
+
+/**
+ * A workload completion wait function:
+ * @device: open device file descriptor
+ * @priv: a pointer to private data required by the wait function
+ *
+ * Wait for completion of background workload
+ */
+typedef void (*workload_wait_t)(int device, void *priv);
+
+/**
+ * A workload function:
+ * @device: open device file descriptor
+ * @arg: a optional string argument passed to the workload function
+ * @workload_wait: for returning a pointer to workload completion wait function
+ * @priv: for returning a pointer to data to be passed to @workload_wait
+ *
+ * Put some long lasting load on the device
+ */
+typedef void (*workload_t)(int device, const char *arg,
+			   workload_wait_t *workload_wait, void **priv);
+
+/**
+ * Pairs of test action / device recovery functions
+ */
+
+/* Unbind / re-bind */
+
+struct rebind_data {
+	int driver;	/* open file descriptor of driver sysfs directory */
+	char *device;	/* bus specific device address as string */
+};
+
+/* Re-bind the driver to the device */
+static void driver_bind(const void *priv)
+{
+	const struct rebind_data *data = priv;
+
+	igt_set_timeout(60, "Driver re-bind timeout!");
+	igt_sysfs_set(data->driver, "bind", data->device);
+
+	close(data->driver);
+}
+
+/* Unbind the driver from the device */
+static void driver_unbind(int device, const char *module,
+			  recover_t *recover, const void **priv)
+{
+	static char path[PATH_MAX];
+	static struct rebind_data data;
+	int len;
+
+	/* collect information required for driver bind/unbind */
+	data.driver = openat(device, "device/driver", O_DIRECTORY);
+	igt_assert(data.driver >= 0);
+
+	len = readlinkat(device, "device", path, sizeof(path) - 1);
+	path[len] = '\0';
+	data.device = strrchr(path, '/') + 1;
+
+	/* unbind the driver */
+	igt_set_timeout(60, "Driver unbind timeout!");
+	igt_sysfs_set(data.driver, "unbind", data.device);
+
+	/* pass back info on how to recover the device */
+	if (module) {
+		/* don't try to rebind if module will be unloaded */
+		*recover = NULL;
+	} else {
+		*recover = driver_bind;
+		*priv = &data;
+	}
+}
+
+/* Unplug / re-plug */
+
+/* Re-discover the device by rescanning its bus */
+static void bus_rescan(const void *priv)
+{
+	const int *bus = priv;
+
+	igt_set_timeout(60, "Bus rescan timeout!");
+	igt_sysfs_set(*bus, "rescan", "1");
+
+	close(*bus);
+}
+
+/* Remove (virtually unplug) the device from its bus */
+static void device_unplug(int device, const char *module,
+			  recover_t *recover, const void **priv)
+{
+	static int bus;
+
+	/* collect information required for bus rescan */
+	bus = openat(device, "device/subsystem", O_DIRECTORY);
+	igt_assert(bus >= 0);
+
+	/* remove the device */
+	igt_set_timeout(60, "Device unplug timeout!");
+	igt_sysfs_set(device, "device/remove", "1");
+
+	/* pass back info on how to recover the device */
+	*recover = bus_rescan;
+	*priv = &bus;
+}
+
+/* Each test action function must be registered in the following table */
+static const struct {
+	const char *name;	/* unique test action name used in test names */
+	action_t function;	/* test action function pointer */
+} actions[] = {
+	{ "unbind", driver_unbind, },
+	{ "unplug", device_unplug, },
+};
+
+/**
+ * Pairs of workload / wait completion functions
+ */
+
+/* A workload using igt_spin_run() */
+
+/* Wait for completaion of dummy load */
+static void dummy_wait(int device, void *priv)
+{
+	igt_spin_t *spin = priv;
+
+	/* wait until the spin no longer runs, don't fail on error */
+	if (gem_wait(device, spin->handle, NULL))
+		__gem_set_domain(device, spin->handle,
+				 I915_GEM_DOMAIN_GTT, I915_GEM_DOMAIN_GTT);
+}
+
+/* Run dummy load */
+static void dummy_load(int device, const char *arg,
+		       workload_wait_t *workload_wait, void **priv)
+{
+	igt_spin_t *spin;
+
+	/* submit a job */
+	spin = igt_spin_new(device);
+
+	*workload_wait = dummy_wait;
+	*priv = spin;
+}
+
+/**
+ * Each workload function must be registered in the following table.
+ * A function may be registered more than once under different workload names,
+ * that makes sense as long as a different argument is specified for each name.
+ */
+static const struct {
+	const char *name;	/* unique workload name used in test names */
+	workload_t function;	/* workload function pointer */
+	const char *arg;	/* optional constant string argument */
+} workloads[] = {
+	{ "spin", dummy_load, NULL, },
+};
+
+/**
+ * Framework
+ */
+
+static void healthcheck(int chipset)
+{
+	int device;
+
+	device = __drm_open_driver(chipset);
+	igt_assert(device >= 0);
+
+	if (chipset == DRIVER_INTEL)
+		gem_test_engine(device, ALL_ENGINES);
+
+	close(device);
+}
+
+static void module_unload(int chipset, const char *module)
+{
+	if (chipset == DRIVER_INTEL)
+		igt_assert(igt_i915_driver_unload() == IGT_EXIT_SUCCESS);
+	else
+		igt_assert(igt_kmod_unload(module, 0) == 0);
+}
+
+static void run_action(int device, action_t action, const char *module,
+		      recover_t *recover, const void **priv)
+{
+	int dir;
+
+	dir = igt_sysfs_open(device);
+	igt_assert(dir >= 0);
+
+	action(dir, module, recover, priv);
+
+	close(dir);
+}
+
+static void wait_helper(int device, void *priv)
+{
+	struct igt_helper_process *proc = priv;
+
+	/* wait until the workload subprocess has completed */
+	igt_ignore_warn(igt_wait_helper(proc));
+}
+
+static void run_workload(int device, workload_t workload, const char *arg,
+			 const char *module, workload_wait_t *workload_wait,
+			 void **priv)
+{
+	if (module) {
+		/* run workload in a subprocess so the module is put on crash */
+		static struct igt_helper_process proc;
+		int wstatus, ret;
+
+		bzero(&proc, sizeof(proc));
+
+		igt_fork_helper(&proc) {
+			/* suppress igt_log messages */
+			igt_log_level = IGT_LOG_NONE;
+
+			/* intercept igt_fail/skip() long jumps */
+			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) {
+				workload(device, arg, workload_wait, priv);
+
+				(*workload_wait)(device, *priv);
+
+				/* success if not diverted by igt_fail/skip() */
+				igt_success();
+			}
+
+			/* pass exit code back to the caller */
+			igt_exit();
+		}
+		/* let the background process start doing its job or fail */
+		sleep(2);
+		/* fail or skip on workload premature completion */
+		ret = waitpid(proc.pid, &wstatus, WNOHANG);
+		if (ret < 0)
+			igt_fail(IGT_EXIT_INVALID);
+		if (ret) {
+			if (!WIFEXITED(wstatus))
+				igt_fail(IGT_EXIT_INVALID);
+			if (WEXITSTATUS(wstatus) == IGT_EXIT_SUCCESS)
+				igt_fail(IGT_EXIT_INVALID);
+			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
+				igt_skip(NULL);
+			igt_fail(WEXITSTATUS(wstatus));
+		}
+
+		/* pass back info on how to wait for helper completion */
+		*workload_wait = wait_helper;
+		*priv = &proc;
+	} else {
+		/* run the requested workload directly */
+		workload(device, arg, workload_wait, priv);
+	}
+}
+
+static void run_subtest(int chipset, int workload, int action,
+			const char *module)
+{
+	workload_wait_t workload_wait;
+	void *workload_priv;
+	recover_t recover;
+	const void *recover_priv;
+	int device;
+
+	igt_subtest_f("%s-%s%s", workloads[workload].name, actions[action].name,
+		      module ? "-unload" : "") {
+		device = __drm_open_driver(chipset);
+		igt_assert(device >= 0);
+
+		/* spawn the requested workload */
+		igt_debug("spawning background workload\n");
+		run_workload(device, workloads[workload].function,
+			     workloads[workload].arg, module,
+			     &workload_wait, &workload_priv);
+
+		/* run the requested test action */
+		igt_debug("running test action\n");
+		run_action(device, actions[action].function, module,
+			   &recover, &recover_priv);
+
+		if (workload_wait) {
+			igt_debug("waiting for workload completion\n");
+			workload_wait(device, workload_priv);
+		}
+
+		close(device);
+
+		if (module) {
+			igt_debug("unloading %s\n", module);
+			module_unload(chipset, module);
+		}
+
+		if (recover) {
+			igt_debug("recovering device\n");
+			recover(recover_priv);
+			igt_reset_timeout();
+		}
+
+		igt_debug("running healthcheck\n");
+		healthcheck(chipset);
+	}
+}
+
+igt_main {
+	int device, chipset;
+	char *module;
+	int i, j;
+
+	igt_fixture {
+		char path[PATH_MAX];
+		int dir, len;
+
+		/**
+		 * Since some subtests depend on successful unload of a driver
+		 * module, don't use drm_open_driver() as it keeps a device file
+		 * descriptor open for exit handler use and that effectively
+		 * prevents the module from being unloaded.
+		 */
+		device = __drm_open_driver(DRIVER_ANY);
+		igt_assert(device >= 0);
+
+		if (is_i915_device(device)) {
+			chipset = DRIVER_INTEL;
+			module = strdup("i915");
+		} else {
+			chipset = DRIVER_ANY;
+
+			/* Capture module name to be unloaded */
+			dir = igt_sysfs_open(device);
+			len = readlinkat(dir, "device/driver/module", path,
+					 sizeof(path) - 1);
+			close(dir);
+			path[len] = '\0';
+			module = strdup(strrchr(path, '/') + 1);
+		}
+		close(device);
+
+		igt_info("Running the test on driver \"%s\", chipset mask %#0x\n",
+			 module, chipset);
+	}
+
+	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
+		for (j = 0; j < sizeof(actions) / sizeof(*actions); j++) {
+			/* with module unload */
+			run_subtest(chipset, i, j, module);
+			/* without module unload */
+			run_subtest(chipset, i, j, NULL);
+		}
+	}
+}
diff --git a/tests/meson.build b/tests/meson.build
index 711979b4..0d418035 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -3,6 +3,7 @@ test_progs = [
 	'core_getclient',
 	'core_getstats',
 	'core_getversion',
+	'core_hot_reload',
 	'core_setmaster_vs_auth',
 	'debugfs_test',
 	'drm_import_export',
-- 
2.20.1

_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [igt-dev] ✓ Fi.CI.BAT: success for tests: Add a new test for driver/device hot reload (rev2)
  2019-04-30 11:29 [igt-dev] [PATCH v8 0/1 i-g-t] tests: Add a new test for driver/device hot reload Janusz Krzysztofik
  2019-04-30 11:29 ` [igt-dev] [PATCH v8 1/1 " Janusz Krzysztofik
@ 2019-04-30 12:19 ` Patchwork
  2019-05-01  0:26 ` [igt-dev] ✗ Fi.CI.IGT: failure " Patchwork
  2 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2019-04-30 12:19 UTC (permalink / raw)
  To: Janusz Krzysztofik; +Cc: igt-dev

== Series Details ==

Series: tests: Add a new test for driver/device hot reload (rev2)
URL   : https://patchwork.freedesktop.org/series/58991/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_6015 -> IGTPW_2937
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/58991/revisions/2/mbox/

Known issues
------------

  Here are the changes found in IGTPW_2937 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@i915_selftest@live_contexts:
    - fi-bdw-gvtdvm:      [PASS][1] -> [DMESG-FAIL][2] ([fdo#110235])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/fi-bdw-gvtdvm/igt@i915_selftest@live_contexts.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/fi-bdw-gvtdvm/igt@i915_selftest@live_contexts.html

  * igt@i915_selftest@live_hangcheck:
    - fi-icl-y:           [PASS][3] -> [INCOMPLETE][4] ([fdo#107713] / [fdo#108569])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/fi-icl-y/igt@i915_selftest@live_hangcheck.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/fi-icl-y/igt@i915_selftest@live_hangcheck.html

  
#### Warnings ####

  * igt@i915_pm_rpm@basic-pci-d3-state:
    - fi-kbl-guc:         [SKIP][5] ([fdo#109271]) -> [INCOMPLETE][6] ([fdo#107807])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/fi-kbl-guc/igt@i915_pm_rpm@basic-pci-d3-state.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/fi-kbl-guc/igt@i915_pm_rpm@basic-pci-d3-state.html

  
  [fdo#107713]: https://bugs.freedesktop.org/show_bug.cgi?id=107713
  [fdo#107807]: https://bugs.freedesktop.org/show_bug.cgi?id=107807
  [fdo#108569]: https://bugs.freedesktop.org/show_bug.cgi?id=108569
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#110235]: https://bugs.freedesktop.org/show_bug.cgi?id=110235


Participating hosts (54 -> 46)
------------------------------

  Missing    (8): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 fi-byt-clapper fi-bdw-samus 


Build changes
-------------

  * IGT: IGT_4971 -> IGTPW_2937

  CI_DRM_6015: fef230161046ccb2ee8e5b6bf56ac85733f49af1 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGTPW_2937: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/
  IGT_4971: fc5e0467eb6913d21ad932aa8a31c77fdb5a9c77 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools



== Testlist changes ==

+igt@core_hot_reload@spin-unbind
+igt@core_hot_reload@spin-unbind-unload
+igt@core_hot_reload@spin-unplug
+igt@core_hot_reload@spin-unplug-unload

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-04-30 11:29 ` [igt-dev] [PATCH v8 1/1 " Janusz Krzysztofik
@ 2019-04-30 15:05   ` Daniel Vetter
  2019-05-06  8:44     ` Janusz Krzysztofik
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2019-04-30 15:05 UTC (permalink / raw)
  To: Janusz Krzysztofik
  Cc: janusz.krzysztofik, Petri Latvala, igt-dev, Daniel Vetter

On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> 
> Put some workload on a device, then try to either remove (unplug) the
> device from its bus, or unbind the device's driver from it, possibly
> followed by module unload, depending on which specific subtest has been
> selected.  If succeeded, rescan the device's bus if needed and perform
> health checks on the device with the driver possibly loaded back.
> 
> If module unload is requested, the workload is run in a sub-process,
> not directly from the test, as it is expected to crash while still
> keeping the device open for as long as its process has not exited.
> 
> The driver hot unbind / device hot unplug operation is expected to
> succeed and the background workload sub-process to crash in a
> reasonable time, however long timeouts are used to let kernel level
> timeouts pop up first if hit by a bug.
> 
> The driver is ready for extending it with an arbitrary workload
> functions as needed.  For now, a workload based on igt_dummyload is
> implemented, hence subtests work only on i915 driver and are skipped on
> other hardware, unless they provide their implementation of
> igt_spin_new() and friends, or other workloads are implemented.
> 
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>

High level comments and apologies that I didn't look at v2-v7 in between.

This all seems extremely complex for a simple batch spinner subtest ... do
we really need all that complexity with 2nd process and watchers and a
bunch of callbacks and everything, just do to a hotremove testcase?

Very first patch looked much more reasonable, aside from that it broke CI
since it didn't rebind the driver. We can always add complexity later on
once we have dma-buf/dma-fence/kms/whatever else substests here.

Also, I think we should have at least one hotremove-only-nothing-special
subtest here, i.e. without even the busy batch.

I'm also not sure why we also put module unload tests in there. Compared
to hotunplug of a discrete gfx card (external one over usb or thunderbolt
or whatever), which is something users can do, module unload is explicitly
a developer only feature. We do not expect module unload to work under all
possible conditions (it doesn't). I'd drop that part and focus completely
on the hotremove/unbind testcase here.
-Daniel

> ---
>  tests/Makefile.sources  |   1 +
>  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
>  tests/meson.build       |   1 +
>  3 files changed, 410 insertions(+)
>  create mode 100644 tests/core_hot_reload.c
> 
> diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> index 7f921f6c..452d8ed7 100644
> --- a/tests/Makefile.sources
> +++ b/tests/Makefile.sources
> @@ -16,6 +16,7 @@ TESTS_progs = \
>  	core_getclient \
>  	core_getstats \
>  	core_getversion \
> +	core_hot_reload \
>  	core_setmaster_vs_auth \
>  	debugfs_test \
>  	drm_import_export \
> diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> new file mode 100644
> index 00000000..6673f55c
> --- /dev/null
> +++ b/tests/core_hot_reload.c
> @@ -0,0 +1,408 @@
> +/*
> + * Copyright © 2019 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + */
> +
> +#include "igt.h"
> +#include "igt_device.h"
> +#include "igt_dummyload.h"
> +#include "igt_kmod.h"
> +#include "igt_sysfs.h"
> +
> +#include <getopt.h>
> +#include <limits.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +
> +/**
> + * A post-action device recovery function:
> + * @priv: a pointer to private data required for device recovery
> + *
> + * Make the device re-appear
> + */
> +typedef void (*recover_t)(const void *priv);
> +
> +/**
> + * A test action function:
> + * @dir: file descriptor of an open device sysfs directory
> + * @module: module name, non-NULL indicates post-action module unload requested
> + * @recover: for returning a pointer to a post-action device recovery function
> + * @priv: for returning a pointer to data to be passed to @recover
> + *
> + * Make the device disappear
> + */
> +typedef void (*action_t)(int device, const char *module,
> +			 recover_t *recover, const void **priv);
> +
> +/**
> + * A workload completion wait function:
> + * @device: open device file descriptor
> + * @priv: a pointer to private data required by the wait function
> + *
> + * Wait for completion of background workload
> + */
> +typedef void (*workload_wait_t)(int device, void *priv);
> +
> +/**
> + * A workload function:
> + * @device: open device file descriptor
> + * @arg: a optional string argument passed to the workload function
> + * @workload_wait: for returning a pointer to workload completion wait function
> + * @priv: for returning a pointer to data to be passed to @workload_wait
> + *
> + * Put some long lasting load on the device
> + */
> +typedef void (*workload_t)(int device, const char *arg,
> +			   workload_wait_t *workload_wait, void **priv);
> +
> +/**
> + * Pairs of test action / device recovery functions
> + */
> +
> +/* Unbind / re-bind */
> +
> +struct rebind_data {
> +	int driver;	/* open file descriptor of driver sysfs directory */
> +	char *device;	/* bus specific device address as string */
> +};
> +
> +/* Re-bind the driver to the device */
> +static void driver_bind(const void *priv)
> +{
> +	const struct rebind_data *data = priv;
> +
> +	igt_set_timeout(60, "Driver re-bind timeout!");
> +	igt_sysfs_set(data->driver, "bind", data->device);
> +
> +	close(data->driver);
> +}
> +
> +/* Unbind the driver from the device */
> +static void driver_unbind(int device, const char *module,
> +			  recover_t *recover, const void **priv)
> +{
> +	static char path[PATH_MAX];
> +	static struct rebind_data data;
> +	int len;
> +
> +	/* collect information required for driver bind/unbind */
> +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> +	igt_assert(data.driver >= 0);
> +
> +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> +	path[len] = '\0';
> +	data.device = strrchr(path, '/') + 1;
> +
> +	/* unbind the driver */
> +	igt_set_timeout(60, "Driver unbind timeout!");
> +	igt_sysfs_set(data.driver, "unbind", data.device);
> +
> +	/* pass back info on how to recover the device */
> +	if (module) {
> +		/* don't try to rebind if module will be unloaded */
> +		*recover = NULL;
> +	} else {
> +		*recover = driver_bind;
> +		*priv = &data;
> +	}
> +}
> +
> +/* Unplug / re-plug */
> +
> +/* Re-discover the device by rescanning its bus */
> +static void bus_rescan(const void *priv)
> +{
> +	const int *bus = priv;
> +
> +	igt_set_timeout(60, "Bus rescan timeout!");
> +	igt_sysfs_set(*bus, "rescan", "1");
> +
> +	close(*bus);
> +}
> +
> +/* Remove (virtually unplug) the device from its bus */
> +static void device_unplug(int device, const char *module,
> +			  recover_t *recover, const void **priv)
> +{
> +	static int bus;
> +
> +	/* collect information required for bus rescan */
> +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> +	igt_assert(bus >= 0);
> +
> +	/* remove the device */
> +	igt_set_timeout(60, "Device unplug timeout!");
> +	igt_sysfs_set(device, "device/remove", "1");
> +
> +	/* pass back info on how to recover the device */
> +	*recover = bus_rescan;
> +	*priv = &bus;
> +}
> +
> +/* Each test action function must be registered in the following table */
> +static const struct {
> +	const char *name;	/* unique test action name used in test names */
> +	action_t function;	/* test action function pointer */
> +} actions[] = {
> +	{ "unbind", driver_unbind, },
> +	{ "unplug", device_unplug, },
> +};
> +
> +/**
> + * Pairs of workload / wait completion functions
> + */
> +
> +/* A workload using igt_spin_run() */
> +
> +/* Wait for completaion of dummy load */
> +static void dummy_wait(int device, void *priv)
> +{
> +	igt_spin_t *spin = priv;
> +
> +	/* wait until the spin no longer runs, don't fail on error */
> +	if (gem_wait(device, spin->handle, NULL))
> +		__gem_set_domain(device, spin->handle,
> +				 I915_GEM_DOMAIN_GTT, I915_GEM_DOMAIN_GTT);
> +}
> +
> +/* Run dummy load */
> +static void dummy_load(int device, const char *arg,
> +		       workload_wait_t *workload_wait, void **priv)
> +{
> +	igt_spin_t *spin;
> +
> +	/* submit a job */
> +	spin = igt_spin_new(device);
> +
> +	*workload_wait = dummy_wait;
> +	*priv = spin;
> +}
> +
> +/**
> + * Each workload function must be registered in the following table.
> + * A function may be registered more than once under different workload names,
> + * that makes sense as long as a different argument is specified for each name.
> + */
> +static const struct {
> +	const char *name;	/* unique workload name used in test names */
> +	workload_t function;	/* workload function pointer */
> +	const char *arg;	/* optional constant string argument */
> +} workloads[] = {
> +	{ "spin", dummy_load, NULL, },
> +};
> +
> +/**
> + * Framework
> + */
> +
> +static void healthcheck(int chipset)
> +{
> +	int device;
> +
> +	device = __drm_open_driver(chipset);
> +	igt_assert(device >= 0);
> +
> +	if (chipset == DRIVER_INTEL)
> +		gem_test_engine(device, ALL_ENGINES);
> +
> +	close(device);
> +}
> +
> +static void module_unload(int chipset, const char *module)
> +{
> +	if (chipset == DRIVER_INTEL)
> +		igt_assert(igt_i915_driver_unload() == IGT_EXIT_SUCCESS);
> +	else
> +		igt_assert(igt_kmod_unload(module, 0) == 0);
> +}
> +
> +static void run_action(int device, action_t action, const char *module,
> +		      recover_t *recover, const void **priv)
> +{
> +	int dir;
> +
> +	dir = igt_sysfs_open(device);
> +	igt_assert(dir >= 0);
> +
> +	action(dir, module, recover, priv);
> +
> +	close(dir);
> +}
> +
> +static void wait_helper(int device, void *priv)
> +{
> +	struct igt_helper_process *proc = priv;
> +
> +	/* wait until the workload subprocess has completed */
> +	igt_ignore_warn(igt_wait_helper(proc));
> +}
> +
> +static void run_workload(int device, workload_t workload, const char *arg,
> +			 const char *module, workload_wait_t *workload_wait,
> +			 void **priv)
> +{
> +	if (module) {
> +		/* run workload in a subprocess so the module is put on crash */
> +		static struct igt_helper_process proc;
> +		int wstatus, ret;
> +
> +		bzero(&proc, sizeof(proc));
> +
> +		igt_fork_helper(&proc) {
> +			/* suppress igt_log messages */
> +			igt_log_level = IGT_LOG_NONE;
> +
> +			/* intercept igt_fail/skip() long jumps */
> +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) {
> +				workload(device, arg, workload_wait, priv);
> +
> +				(*workload_wait)(device, *priv);
> +
> +				/* success if not diverted by igt_fail/skip() */
> +				igt_success();
> +			}
> +
> +			/* pass exit code back to the caller */
> +			igt_exit();
> +		}
> +		/* let the background process start doing its job or fail */
> +		sleep(2);
> +		/* fail or skip on workload premature completion */
> +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> +		if (ret < 0)
> +			igt_fail(IGT_EXIT_INVALID);
> +		if (ret) {
> +			if (!WIFEXITED(wstatus))
> +				igt_fail(IGT_EXIT_INVALID);
> +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SUCCESS)
> +				igt_fail(IGT_EXIT_INVALID);
> +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> +				igt_skip(NULL);
> +			igt_fail(WEXITSTATUS(wstatus));
> +		}
> +
> +		/* pass back info on how to wait for helper completion */
> +		*workload_wait = wait_helper;
> +		*priv = &proc;
> +	} else {
> +		/* run the requested workload directly */
> +		workload(device, arg, workload_wait, priv);
> +	}
> +}
> +
> +static void run_subtest(int chipset, int workload, int action,
> +			const char *module)
> +{
> +	workload_wait_t workload_wait;
> +	void *workload_priv;
> +	recover_t recover;
> +	const void *recover_priv;
> +	int device;
> +
> +	igt_subtest_f("%s-%s%s", workloads[workload].name, actions[action].name,
> +		      module ? "-unload" : "") {
> +		device = __drm_open_driver(chipset);
> +		igt_assert(device >= 0);
> +
> +		/* spawn the requested workload */
> +		igt_debug("spawning background workload\n");
> +		run_workload(device, workloads[workload].function,
> +			     workloads[workload].arg, module,
> +			     &workload_wait, &workload_priv);
> +
> +		/* run the requested test action */
> +		igt_debug("running test action\n");
> +		run_action(device, actions[action].function, module,
> +			   &recover, &recover_priv);
> +
> +		if (workload_wait) {
> +			igt_debug("waiting for workload completion\n");
> +			workload_wait(device, workload_priv);
> +		}
> +
> +		close(device);
> +
> +		if (module) {
> +			igt_debug("unloading %s\n", module);
> +			module_unload(chipset, module);
> +		}
> +
> +		if (recover) {
> +			igt_debug("recovering device\n");
> +			recover(recover_priv);
> +			igt_reset_timeout();
> +		}
> +
> +		igt_debug("running healthcheck\n");
> +		healthcheck(chipset);
> +	}
> +}
> +
> +igt_main {
> +	int device, chipset;
> +	char *module;
> +	int i, j;
> +
> +	igt_fixture {
> +		char path[PATH_MAX];
> +		int dir, len;
> +
> +		/**
> +		 * Since some subtests depend on successful unload of a driver
> +		 * module, don't use drm_open_driver() as it keeps a device file
> +		 * descriptor open for exit handler use and that effectively
> +		 * prevents the module from being unloaded.
> +		 */
> +		device = __drm_open_driver(DRIVER_ANY);
> +		igt_assert(device >= 0);
> +
> +		if (is_i915_device(device)) {
> +			chipset = DRIVER_INTEL;
> +			module = strdup("i915");
> +		} else {
> +			chipset = DRIVER_ANY;
> +
> +			/* Capture module name to be unloaded */
> +			dir = igt_sysfs_open(device);
> +			len = readlinkat(dir, "device/driver/module", path,
> +					 sizeof(path) - 1);
> +			close(dir);
> +			path[len] = '\0';
> +			module = strdup(strrchr(path, '/') + 1);
> +		}
> +		close(device);
> +
> +		igt_info("Running the test on driver \"%s\", chipset mask %#0x\n",
> +			 module, chipset);
> +	}
> +
> +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j++) {
> +			/* with module unload */
> +			run_subtest(chipset, i, j, module);
> +			/* without module unload */
> +			run_subtest(chipset, i, j, NULL);
> +		}
> +	}
> +}
> diff --git a/tests/meson.build b/tests/meson.build
> index 711979b4..0d418035 100644
> --- a/tests/meson.build
> +++ b/tests/meson.build
> @@ -3,6 +3,7 @@ test_progs = [
>  	'core_getclient',
>  	'core_getstats',
>  	'core_getversion',
> +	'core_hot_reload',
>  	'core_setmaster_vs_auth',
>  	'debugfs_test',
>  	'drm_import_export',
> -- 
> 2.20.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [igt-dev] ✗ Fi.CI.IGT: failure for tests: Add a new test for driver/device hot reload (rev2)
  2019-04-30 11:29 [igt-dev] [PATCH v8 0/1 i-g-t] tests: Add a new test for driver/device hot reload Janusz Krzysztofik
  2019-04-30 11:29 ` [igt-dev] [PATCH v8 1/1 " Janusz Krzysztofik
  2019-04-30 12:19 ` [igt-dev] ✓ Fi.CI.BAT: success for tests: Add a new test for driver/device hot reload (rev2) Patchwork
@ 2019-05-01  0:26 ` Patchwork
  2 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2019-05-01  0:26 UTC (permalink / raw)
  To: Janusz Krzysztofik; +Cc: igt-dev

== Series Details ==

Series: tests: Add a new test for driver/device hot reload (rev2)
URL   : https://patchwork.freedesktop.org/series/58991/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_6015_full -> IGTPW_2937_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with IGTPW_2937_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in IGTPW_2937_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/58991/revisions/2/mbox/

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in IGTPW_2937_full:

### IGT changes ###

#### Possible regressions ####

  * {igt@core_hot_reload@spin-unbind} (NEW):
    - shard-snb:          NOTRUN -> [DMESG-WARN][1] +3 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb2/igt@core_hot_reload@spin-unbind.html
    - shard-hsw:          NOTRUN -> [DMESG-WARN][2] +1 similar issue
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw5/igt@core_hot_reload@spin-unbind.html

  * {igt@core_hot_reload@spin-unplug} (NEW):
    - shard-hsw:          NOTRUN -> [DMESG-FAIL][3] +1 similar issue
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw1/igt@core_hot_reload@spin-unplug.html

  * igt@kms_color@pipe-b-ctm-max:
    - shard-apl:          [PASS][4] -> [FAIL][5]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-apl1/igt@kms_color@pipe-b-ctm-max.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-apl4/igt@kms_color@pipe-b-ctm-max.html

  * igt@runner@aborted:
    - shard-hsw:          NOTRUN -> ([FAIL][6], [FAIL][7], [FAIL][8], [FAIL][9])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw5/igt@runner@aborted.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw5/igt@runner@aborted.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw1/igt@runner@aborted.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw1/igt@runner@aborted.html
    - shard-snb:          NOTRUN -> ([FAIL][10], [FAIL][11], [FAIL][12], [FAIL][13])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb4/igt@runner@aborted.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb7/igt@runner@aborted.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb5/igt@runner@aborted.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb2/igt@runner@aborted.html

  
New tests
---------

  New tests have been introduced between CI_DRM_6015_full and IGTPW_2937_full:

### New IGT tests (4) ###

  * igt@core_hot_reload@spin-unbind:
    - Statuses : 2 dmesg-warn(s) 4 incomplete(s)
    - Exec time: [0.0, 0.49] s

  * igt@core_hot_reload@spin-unbind-unload:
    - Statuses : 2 dmesg-warn(s) 4 incomplete(s)
    - Exec time: [0.0, 3.32] s

  * igt@core_hot_reload@spin-unplug:
    - Statuses : 1 dmesg-fail(s) 1 dmesg-warn(s) 4 incomplete(s)
    - Exec time: [0.0, 1.23] s

  * igt@core_hot_reload@spin-unplug-unload:
    - Statuses : 1 dmesg-fail(s) 1 dmesg-warn(s) 4 incomplete(s)
    - Exec time: [0.0, 5.06] s

  

Known issues
------------

  Here are the changes found in IGTPW_2937_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_eio@in-flight-suspend:
    - shard-kbl:          [PASS][14] -> [INCOMPLETE][15] ([fdo#103665])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-kbl3/igt@gem_eio@in-flight-suspend.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-kbl4/igt@gem_eio@in-flight-suspend.html

  * igt@gem_eio@unwedge-stress:
    - shard-glk:          [PASS][16] -> [INCOMPLETE][17] ([fdo#103359] / [k.org#198133]) +2 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-glk9/igt@gem_eio@unwedge-stress.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-glk1/igt@gem_eio@unwedge-stress.html
    - shard-iclb:         [PASS][18] -> [INCOMPLETE][19] ([fdo#107713]) +1 similar issue
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb4/igt@gem_eio@unwedge-stress.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb1/igt@gem_eio@unwedge-stress.html

  * igt@gem_softpin@softpin:
    - shard-hsw:          [PASS][20] -> [INCOMPLETE][21] ([fdo#103540]) +1 similar issue
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-hsw4/igt@gem_softpin@softpin.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw4/igt@gem_softpin@softpin.html

  * igt@kms_busy@extended-modeset-hang-oldfb-with-reset-render-a:
    - shard-snb:          [PASS][22] -> [SKIP][23] ([fdo#109271] / [fdo#109278])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-snb1/igt@kms_busy@extended-modeset-hang-oldfb-with-reset-render-a.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb1/igt@kms_busy@extended-modeset-hang-oldfb-with-reset-render-a.html

  * igt@kms_flip@modeset-vs-vblank-race-interruptible:
    - shard-hsw:          [PASS][24] -> [FAIL][25] ([fdo#103060])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-hsw6/igt@kms_flip@modeset-vs-vblank-race-interruptible.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw1/igt@kms_flip@modeset-vs-vblank-race-interruptible.html

  * igt@kms_flip@plain-flip-ts-check-interruptible:
    - shard-glk:          [PASS][26] -> [FAIL][27] ([fdo#100368])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-glk3/igt@kms_flip@plain-flip-ts-check-interruptible.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-glk8/igt@kms_flip@plain-flip-ts-check-interruptible.html

  * igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-draw-render:
    - shard-iclb:         [PASS][28] -> [FAIL][29] ([fdo#103167]) +8 similar issues
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb2/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-draw-render.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb8/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-draw-render.html

  * igt@kms_plane@plane-panning-bottom-right-suspend-pipe-c-planes:
    - shard-apl:          [PASS][30] -> [DMESG-WARN][31] ([fdo#108566]) +3 similar issues
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-apl2/igt@kms_plane@plane-panning-bottom-right-suspend-pipe-c-planes.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-apl7/igt@kms_plane@plane-panning-bottom-right-suspend-pipe-c-planes.html

  * igt@kms_psr@no_drrs:
    - shard-iclb:         [PASS][32] -> [FAIL][33] ([fdo#108341])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb3/igt@kms_psr@no_drrs.html
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb1/igt@kms_psr@no_drrs.html

  * igt@kms_psr@psr2_cursor_mmap_gtt:
    - shard-iclb:         [PASS][34] -> [SKIP][35] ([fdo#109441])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb2/igt@kms_psr@psr2_cursor_mmap_gtt.html
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb3/igt@kms_psr@psr2_cursor_mmap_gtt.html

  * igt@kms_rotation_crc@primary-rotation-90:
    - shard-iclb:         [PASS][36] -> [INCOMPLETE][37] ([fdo#107713] / [fdo#110026] / [fdo#110040 ])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb5/igt@kms_rotation_crc@primary-rotation-90.html
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb1/igt@kms_rotation_crc@primary-rotation-90.html

  * igt@kms_vblank@pipe-b-ts-continuation-modeset-hang:
    - shard-snb:          [PASS][38] -> [SKIP][39] ([fdo#109271]) +4 similar issues
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-snb7/igt@kms_vblank@pipe-b-ts-continuation-modeset-hang.html
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-snb1/igt@kms_vblank@pipe-b-ts-continuation-modeset-hang.html

  
#### Possible fixes ####

  * igt@gem_eio@in-flight-suspend:
    - shard-apl:          [DMESG-WARN][40] ([fdo#108566]) -> [PASS][41] +1 similar issue
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-apl6/igt@gem_eio@in-flight-suspend.html
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-apl6/igt@gem_eio@in-flight-suspend.html

  * igt@gem_fence_thrash@bo-write-verify-threaded-y:
    - shard-iclb:         [INCOMPLETE][42] ([fdo#107713] / [fdo#109100]) -> [PASS][43]
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb8/igt@gem_fence_thrash@bo-write-verify-threaded-y.html
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb5/igt@gem_fence_thrash@bo-write-verify-threaded-y.html

  * igt@kms_flip@2x-flip-vs-suspend:
    - shard-glk:          [INCOMPLETE][44] ([fdo#103359] / [k.org#198133]) -> [PASS][45] +1 similar issue
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-glk9/igt@kms_flip@2x-flip-vs-suspend.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-glk7/igt@kms_flip@2x-flip-vs-suspend.html

  * igt@kms_flip@flip-vs-suspend-interruptible:
    - shard-hsw:          [INCOMPLETE][46] ([fdo#103540]) -> [PASS][47]
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-hsw1/igt@kms_flip@flip-vs-suspend-interruptible.html
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-hsw4/igt@kms_flip@flip-vs-suspend-interruptible.html

  * igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-onoff:
    - shard-iclb:         [FAIL][48] ([fdo#103167]) -> [PASS][49]
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb7/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-onoff.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb3/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-cur-indfb-onoff.html

  * igt@kms_plane@pixel-format-pipe-c-planes-source-clamping:
    - shard-iclb:         [INCOMPLETE][50] ([fdo#107713] / [fdo#110036 ]) -> [PASS][51]
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb3/igt@kms_plane@pixel-format-pipe-c-planes-source-clamping.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb2/igt@kms_plane@pixel-format-pipe-c-planes-source-clamping.html

  * igt@kms_plane_lowres@pipe-a-tiling-x:
    - shard-iclb:         [FAIL][52] ([fdo#103166]) -> [PASS][53]
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb7/igt@kms_plane_lowres@pipe-a-tiling-x.html
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb3/igt@kms_plane_lowres@pipe-a-tiling-x.html

  * igt@kms_psr@psr2_primary_mmap_gtt:
    - shard-iclb:         [SKIP][54] ([fdo#109441]) -> [PASS][55] +1 similar issue
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb8/igt@kms_psr@psr2_primary_mmap_gtt.html
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb2/igt@kms_psr@psr2_primary_mmap_gtt.html

  * igt@kms_rotation_crc@bad-tiling:
    - shard-iclb:         [INCOMPLETE][56] ([fdo#107713] / [fdo#110026] / [fdo#110040 ]) -> [PASS][57]
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb6/igt@kms_rotation_crc@bad-tiling.html
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb1/igt@kms_rotation_crc@bad-tiling.html

  * igt@kms_setmode@basic:
    - shard-apl:          [FAIL][58] ([fdo#99912]) -> [PASS][59]
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-apl8/igt@kms_setmode@basic.html
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-apl6/igt@kms_setmode@basic.html

  * igt@kms_sysfs_edid_timing:
    - shard-iclb:         [FAIL][60] ([fdo#100047]) -> [PASS][61]
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-iclb2/igt@kms_sysfs_edid_timing.html
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-iclb4/igt@kms_sysfs_edid_timing.html

  
#### Warnings ####

  * igt@kms_cursor_crc@cursor-512x512-onscreen:
    - shard-apl:          [INCOMPLETE][62] ([fdo#103927]) -> [SKIP][63] ([fdo#109271])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6015/shard-apl4/igt@kms_cursor_crc@cursor-512x512-onscreen.html
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/shard-apl3/igt@kms_cursor_crc@cursor-512x512-onscreen.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#100047]: https://bugs.freedesktop.org/show_bug.cgi?id=100047
  [fdo#100368]: https://bugs.freedesktop.org/show_bug.cgi?id=100368
  [fdo#103060]: https://bugs.freedesktop.org/show_bug.cgi?id=103060
  [fdo#103166]: https://bugs.freedesktop.org/show_bug.cgi?id=103166
  [fdo#103167]: https://bugs.freedesktop.org/show_bug.cgi?id=103167
  [fdo#103359]: https://bugs.freedesktop.org/show_bug.cgi?id=103359
  [fdo#103540]: https://bugs.freedesktop.org/show_bug.cgi?id=103540
  [fdo#103665]: https://bugs.freedesktop.org/show_bug.cgi?id=103665
  [fdo#103927]: https://bugs.freedesktop.org/show_bug.cgi?id=103927
  [fdo#107713]: https://bugs.freedesktop.org/show_bug.cgi?id=107713
  [fdo#108341]: https://bugs.freedesktop.org/show_bug.cgi?id=108341
  [fdo#108566]: https://bugs.freedesktop.org/show_bug.cgi?id=108566
  [fdo#109100]: https://bugs.freedesktop.org/show_bug.cgi?id=109100
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
  [fdo#110026]: https://bugs.freedesktop.org/show_bug.cgi?id=110026
  [fdo#110036 ]: https://bugs.freedesktop.org/show_bug.cgi?id=110036 
  [fdo#110040 ]: https://bugs.freedesktop.org/show_bug.cgi?id=110040 
  [fdo#99912]: https://bugs.freedesktop.org/show_bug.cgi?id=99912
  [k.org#198133]: https://bugzilla.kernel.org/show_bug.cgi?id=198133


Participating hosts (10 -> 6)
------------------------------

  Missing    (4): pig-skl-6260u shard-skl pig-hsw-4770r pig-glk-j5005 


Build changes
-------------

  * IGT: IGT_4971 -> IGTPW_2937
  * Piglit: piglit_4509 -> None

  CI_DRM_6015: fef230161046ccb2ee8e5b6bf56ac85733f49af1 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGTPW_2937: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/
  IGT_4971: fc5e0467eb6913d21ad932aa8a31c77fdb5a9c77 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2937/
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-04-30 15:05   ` Daniel Vetter
@ 2019-05-06  8:44     ` Janusz Krzysztofik
  2019-05-06  9:21       ` Daniel Vetter
  0 siblings, 1 reply; 13+ messages in thread
From: Janusz Krzysztofik @ 2019-05-06  8:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Janusz Krzysztofik, Petri Latvala, igt-dev

Hi Daniel,

On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > 
> > Put some workload on a device, then try to either remove (unplug) the
> > device from its bus, or unbind the device's driver from it, possibly
> > followed by module unload, depending on which specific subtest has been
> > selected.  If succeeded, rescan the device's bus if needed and perform
> > health checks on the device with the driver possibly loaded back.
> > 
> > If module unload is requested, the workload is run in a sub-process,
> > not directly from the test, as it is expected to crash while still
> > keeping the device open for as long as its process has not exited.
> > 
> > The driver hot unbind / device hot unplug operation is expected to
> > succeed and the background workload sub-process to crash in a
> > reasonable time, however long timeouts are used to let kernel level
> > timeouts pop up first if hit by a bug.
> > 
> > The driver is ready for extending it with an arbitrary workload
> > functions as needed.  For now, a workload based on igt_dummyload is
> > implemented, hence subtests work only on i915 driver and are skipped on
> > other hardware, unless they provide their implementation of
> > igt_spin_new() and friends, or other workloads are implemented.
> > 
> > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> 
> High level comments and apologies that I didn't look at v2-v7 in between.
> 
> This all seems extremely complex for a simple batch spinner subtest ...

My initial intention was to build a simple hot unplug/unbind only test. I 
proposed to use an arbitrary external command as a workload.  Then, on 
Antonio's advice, I switched to the spinner based internal workload and I 
agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
the scope of the test with device recovery and health checking.  Also, a few 
people, including you, Daniel, requested availability of more workload type 
options.  As a result, I've decided to build a *framework* for testing driver 
unbind + rebind / device unplug + bus rescan behavior under different workload 
types, easily extendable with more workload options as needed, with one 
example workload type - dummy load or spin batch - initially implemented.  
That was at least my intention for v6-8.  I wouldn't call it a simple batch 
spinner subtest any longer.

> do we really need all that complexity with 2nd process 

If we drop module unload option then no, we don't need 2nd process.  

> and watchers 

That was primarily needed for successful module unload.  If we drop that 
option and you think driver rebind / bus rescan operations can be performed 
blindly, without checking for completion of background workload, then I can 
drop the watchers.

> and a bunch of callbacks and everything, just do to a hotremove testcase?

I can still drop the framework and switch back to the initial simple structure 
with one or two fixed subtests if you don't like my structural approach.

> Very first patch looked much more reasonable, aside from that it broke CI
> since it didn't rebind the driver. 

Sorry, my understanding of your and Petri's comments was a bit different, I 
thought that by more than best effort you meant doing everything possible to 
restore the device to be ready for next test without reboot, and module unload 
and reload seemed the most reliable option to me.  Now I can see that there 
were probably two different requirements.  You were considering the test 
incomplete because it was performing only the unbind/unplug part and not 
rebind/rescan, while Petri was probably interested mostly in the device being 
ready for next tests without reboot, no matter which way.

> We can always add complexity later on
> once we have dma-buf/dma-fence/kms/whatever else substests here.

OK, as you wish.

> Also, I think we should have at least one hotremove-only-nothing-special
> subtest here, i.e. without even the busy batch.

That seems trivial to adjust the framework so it accepted NULL workload, if 
the framework survived. Anyway, I'll do that.  Should I put it in a separate 
NULL workload subtest function to be called from igt_main?  Or add it to the 
spin workload subtest function specifically as an option?

> I'm also not sure why we also put module unload tests in there. 

As I tried to explain above, I introduced module unload in order to satisfy 
the CI requirement on the device being ready for next test without reboot as 
much as possible.

> Compared
> to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> or whatever), which is something users can do, module unload is explicitly a 
> developer only feature.

My approach was to be able to test driver behavior under any hot unload 
operation available to a user, no matter if developer oriented or not, so we 
can make the driver resistant to users performing potentially dangerous hot 
unbind/unplug operations available to them, intentionally or not.

> We do not expect module unload to work under all
> possible conditions (it doesn't). 

Do you think that driver rebind operation has more chances to succeed, 
especially on a device on which a bus unplug operation was not actually 
performed but only simulated via sysfs, on a device which then has been left 
in an unpredictable state and hasn't undergone a hardware power-on reset on 
physical bus re-plug?

> I'd drop that part and focus completely
> on the hotremove/unbind testcase here.

Driver unbind / device unplug via sysfs can also be considered developer only 
features. Do you think we should drop driver unbind option, leaving only 
device unplug via sysfs for which we may have no good non-developer 
alternative?

Thanks,
Janusz


> -Daniel
> 
> > ---
> >  tests/Makefile.sources  |   1 +
> >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> >  tests/meson.build       |   1 +
> >  3 files changed, 410 insertions(+)
> >  create mode 100644 tests/core_hot_reload.c
> > 
> > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > index 7f921f6c..452d8ed7 100644
> > --- a/tests/Makefile.sources
> > +++ b/tests/Makefile.sources
> > @@ -16,6 +16,7 @@ TESTS_progs = \
> >  	core_getclient \
> >  	core_getstats \
> >  	core_getversion \
> > +	core_hot_reload \
> >  	core_setmaster_vs_auth \
> >  	debugfs_test \
> >  	drm_import_export \
> > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > new file mode 100644
> > index 00000000..6673f55c
> > --- /dev/null
> > +++ b/tests/core_hot_reload.c
> > @@ -0,0 +1,408 @@
> > +/*
> > + * Copyright © 2019 Intel Corporation
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining 
a
> > + * copy of this software and associated documentation files (the 
"Software"),
> > + * to deal in the Software without restriction, including without 
limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, 
sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice (including the 
next
> > + * paragraph) shall be included in all copies or substantial portions of 
the
> > + * Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
ARISING
> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
DEALINGS
> > + * IN THE SOFTWARE.
> > + */
> > +
> > +#include "igt.h"
> > +#include "igt_device.h"
> > +#include "igt_dummyload.h"
> > +#include "igt_kmod.h"
> > +#include "igt_sysfs.h"
> > +
> > +#include <getopt.h>
> > +#include <limits.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +
> > +#include <sys/types.h>
> > +#include <sys/wait.h>
> > +
> > +/**
> > + * A post-action device recovery function:
> > + * @priv: a pointer to private data required for device recovery
> > + *
> > + * Make the device re-appear
> > + */
> > +typedef void (*recover_t)(const void *priv);
> > +
> > +/**
> > + * A test action function:
> > + * @dir: file descriptor of an open device sysfs directory
> > + * @module: module name, non-NULL indicates post-action module unload 
requested
> > + * @recover: for returning a pointer to a post-action device recovery 
function
> > + * @priv: for returning a pointer to data to be passed to @recover
> > + *
> > + * Make the device disappear
> > + */
> > +typedef void (*action_t)(int device, const char *module,
> > +			 recover_t *recover, const void **priv);
> > +
> > +/**
> > + * A workload completion wait function:
> > + * @device: open device file descriptor
> > + * @priv: a pointer to private data required by the wait function
> > + *
> > + * Wait for completion of background workload
> > + */
> > +typedef void (*workload_wait_t)(int device, void *priv);
> > +
> > +/**
> > + * A workload function:
> > + * @device: open device file descriptor
> > + * @arg: a optional string argument passed to the workload function
> > + * @workload_wait: for returning a pointer to workload completion wait 
function
> > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > + *
> > + * Put some long lasting load on the device
> > + */
> > +typedef void (*workload_t)(int device, const char *arg,
> > +			   workload_wait_t *workload_wait, void 
**priv);
> > +
> > +/**
> > + * Pairs of test action / device recovery functions
> > + */
> > +
> > +/* Unbind / re-bind */
> > +
> > +struct rebind_data {
> > +	int driver;	/* open file descriptor of driver sysfs directory */
> > +	char *device;	/* bus specific device address as string */
> > +};
> > +
> > +/* Re-bind the driver to the device */
> > +static void driver_bind(const void *priv)
> > +{
> > +	const struct rebind_data *data = priv;
> > +
> > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > +	igt_sysfs_set(data->driver, "bind", data->device);
> > +
> > +	close(data->driver);
> > +}
> > +
> > +/* Unbind the driver from the device */
> > +static void driver_unbind(int device, const char *module,
> > +			  recover_t *recover, const void **priv)
> > +{
> > +	static char path[PATH_MAX];
> > +	static struct rebind_data data;
> > +	int len;
> > +
> > +	/* collect information required for driver bind/unbind */
> > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > +	igt_assert(data.driver >= 0);
> > +
> > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > +	path[len] = '\0';
> > +	data.device = strrchr(path, '/') + 1;
> > +
> > +	/* unbind the driver */
> > +	igt_set_timeout(60, "Driver unbind timeout!");
> > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > +
> > +	/* pass back info on how to recover the device */
> > +	if (module) {
> > +		/* don't try to rebind if module will be unloaded */
> > +		*recover = NULL;
> > +	} else {
> > +		*recover = driver_bind;
> > +		*priv = &data;
> > +	}
> > +}
> > +
> > +/* Unplug / re-plug */
> > +
> > +/* Re-discover the device by rescanning its bus */
> > +static void bus_rescan(const void *priv)
> > +{
> > +	const int *bus = priv;
> > +
> > +	igt_set_timeout(60, "Bus rescan timeout!");
> > +	igt_sysfs_set(*bus, "rescan", "1");
> > +
> > +	close(*bus);
> > +}
> > +
> > +/* Remove (virtually unplug) the device from its bus */
> > +static void device_unplug(int device, const char *module,
> > +			  recover_t *recover, const void **priv)
> > +{
> > +	static int bus;
> > +
> > +	/* collect information required for bus rescan */
> > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > +	igt_assert(bus >= 0);
> > +
> > +	/* remove the device */
> > +	igt_set_timeout(60, "Device unplug timeout!");
> > +	igt_sysfs_set(device, "device/remove", "1");
> > +
> > +	/* pass back info on how to recover the device */
> > +	*recover = bus_rescan;
> > +	*priv = &bus;
> > +}
> > +
> > +/* Each test action function must be registered in the following table */
> > +static const struct {
> > +	const char *name;	/* unique test action name used in test 
names */
> > +	action_t function;	/* test action function pointer */
> > +} actions[] = {
> > +	{ "unbind", driver_unbind, },
> > +	{ "unplug", device_unplug, },
> > +};
> > +
> > +/**
> > + * Pairs of workload / wait completion functions
> > + */
> > +
> > +/* A workload using igt_spin_run() */
> > +
> > +/* Wait for completaion of dummy load */
> > +static void dummy_wait(int device, void *priv)
> > +{
> > +	igt_spin_t *spin = priv;
> > +
> > +	/* wait until the spin no longer runs, don't fail on error */
> > +	if (gem_wait(device, spin->handle, NULL))
> > +		__gem_set_domain(device, spin->handle,
> > +				 I915_GEM_DOMAIN_GTT, 
I915_GEM_DOMAIN_GTT);
> > +}
> > +
> > +/* Run dummy load */
> > +static void dummy_load(int device, const char *arg,
> > +		       workload_wait_t *workload_wait, void **priv)
> > +{
> > +	igt_spin_t *spin;
> > +
> > +	/* submit a job */
> > +	spin = igt_spin_new(device);
> > +
> > +	*workload_wait = dummy_wait;
> > +	*priv = spin;
> > +}
> > +
> > +/**
> > + * Each workload function must be registered in the following table.
> > + * A function may be registered more than once under different workload 
names,
> > + * that makes sense as long as a different argument is specified for each 
name.
> > + */
> > +static const struct {
> > +	const char *name;	/* unique workload name used in test names 
*/
> > +	workload_t function;	/* workload function pointer */
> > +	const char *arg;	/* optional constant string argument */
> > +} workloads[] = {
> > +	{ "spin", dummy_load, NULL, },
> > +};
> > +
> > +/**
> > + * Framework
> > + */
> > +
> > +static void healthcheck(int chipset)
> > +{
> > +	int device;
> > +
> > +	device = __drm_open_driver(chipset);
> > +	igt_assert(device >= 0);
> > +
> > +	if (chipset == DRIVER_INTEL)
> > +		gem_test_engine(device, ALL_ENGINES);
> > +
> > +	close(device);
> > +}
> > +
> > +static void module_unload(int chipset, const char *module)
> > +{
> > +	if (chipset == DRIVER_INTEL)
> > +		igt_assert(igt_i915_driver_unload() == 
IGT_EXIT_SUCCESS);
> > +	else
> > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > +}
> > +
> > +static void run_action(int device, action_t action, const char *module,
> > +		      recover_t *recover, const void **priv)
> > +{
> > +	int dir;
> > +
> > +	dir = igt_sysfs_open(device);
> > +	igt_assert(dir >= 0);
> > +
> > +	action(dir, module, recover, priv);
> > +
> > +	close(dir);
> > +}
> > +
> > +static void wait_helper(int device, void *priv)
> > +{
> > +	struct igt_helper_process *proc = priv;
> > +
> > +	/* wait until the workload subprocess has completed */
> > +	igt_ignore_warn(igt_wait_helper(proc));
> > +}
> > +
> > +static void run_workload(int device, workload_t workload, const char 
*arg,
> > +			 const char *module, workload_wait_t 
*workload_wait,
> > +			 void **priv)
> > +{
> > +	if (module) {
> > +		/* run workload in a subprocess so the module is put on 
crash */
> > +		static struct igt_helper_process proc;
> > +		int wstatus, ret;
> > +
> > +		bzero(&proc, sizeof(proc));
> > +
> > +		igt_fork_helper(&proc) {
> > +			/* suppress igt_log messages */
> > +			igt_log_level = IGT_LOG_NONE;
> > +
> > +			/* intercept igt_fail/skip() long jumps */
> > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
{
> > +				workload(device, arg, 
workload_wait, priv);
> > +
> > +				(*workload_wait)(device, 
*priv);
> > +
> > +				/* success if not diverted by 
igt_fail/skip() */
> > +				igt_success();
> > +			}
> > +
> > +			/* pass exit code back to the caller */
> > +			igt_exit();
> > +		}
> > +		/* let the background process start doing its job or 
fail */
> > +		sleep(2);
> > +		/* fail or skip on workload premature completion */
> > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > +		if (ret < 0)
> > +			igt_fail(IGT_EXIT_INVALID);
> > +		if (ret) {
> > +			if (!WIFEXITED(wstatus))
> > +				igt_fail(IGT_EXIT_INVALID);
> > +			if (WEXITSTATUS(wstatus) == 
IGT_EXIT_SUCCESS)
> > +				igt_fail(IGT_EXIT_INVALID);
> > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > +				igt_skip(NULL);
> > +			igt_fail(WEXITSTATUS(wstatus));
> > +		}
> > +
> > +		/* pass back info on how to wait for helper completion 
*/
> > +		*workload_wait = wait_helper;
> > +		*priv = &proc;
> > +	} else {
> > +		/* run the requested workload directly */
> > +		workload(device, arg, workload_wait, priv);
> > +	}
> > +}
> > +
> > +static void run_subtest(int chipset, int workload, int action,
> > +			const char *module)
> > +{
> > +	workload_wait_t workload_wait;
> > +	void *workload_priv;
> > +	recover_t recover;
> > +	const void *recover_priv;
> > +	int device;
> > +
> > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
actions[action].name,
> > +		      module ? "-unload" : "") {
> > +		device = __drm_open_driver(chipset);
> > +		igt_assert(device >= 0);
> > +
> > +		/* spawn the requested workload */
> > +		igt_debug("spawning background workload\n");
> > +		run_workload(device, workloads[workload].function,
> > +			     workloads[workload].arg, module,
> > +			     &workload_wait, &workload_priv);
> > +
> > +		/* run the requested test action */
> > +		igt_debug("running test action\n");
> > +		run_action(device, actions[action].function, module,
> > +			   &recover, &recover_priv);
> > +
> > +		if (workload_wait) {
> > +			igt_debug("waiting for workload 
completion\n");
> > +			workload_wait(device, workload_priv);
> > +		}
> > +
> > +		close(device);
> > +
> > +		if (module) {
> > +			igt_debug("unloading %s\n", module);
> > +			module_unload(chipset, module);
> > +		}
> > +
> > +		if (recover) {
> > +			igt_debug("recovering device\n");
> > +			recover(recover_priv);
> > +			igt_reset_timeout();
> > +		}
> > +
> > +		igt_debug("running healthcheck\n");
> > +		healthcheck(chipset);
> > +	}
> > +}
> > +
> > +igt_main {
> > +	int device, chipset;
> > +	char *module;
> > +	int i, j;
> > +
> > +	igt_fixture {
> > +		char path[PATH_MAX];
> > +		int dir, len;
> > +
> > +		/**
> > +		 * Since some subtests depend on successful unload of a 
driver
> > +		 * module, don't use drm_open_driver() as it keeps a 
device file
> > +		 * descriptor open for exit handler use and that 
effectively
> > +		 * prevents the module from being unloaded.
> > +		 */
> > +		device = __drm_open_driver(DRIVER_ANY);
> > +		igt_assert(device >= 0);
> > +
> > +		if (is_i915_device(device)) {
> > +			chipset = DRIVER_INTEL;
> > +			module = strdup("i915");
> > +		} else {
> > +			chipset = DRIVER_ANY;
> > +
> > +			/* Capture module name to be unloaded */
> > +			dir = igt_sysfs_open(device);
> > +			len = readlinkat(dir, "device/driver/
module", path,
> > +					 sizeof(path) - 1);
> > +			close(dir);
> > +			path[len] = '\0';
> > +			module = strdup(strrchr(path, '/') + 1);
> > +		}
> > +		close(device);
> > +
> > +		igt_info("Running the test on driver \"%s\", chipset 
mask %#0x\n",
> > +			 module, chipset);
> > +	}
> > +
> > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
+) {
> > +			/* with module unload */
> > +			run_subtest(chipset, i, j, module);
> > +			/* without module unload */
> > +			run_subtest(chipset, i, j, NULL);
> > +		}
> > +	}
> > +}
> > diff --git a/tests/meson.build b/tests/meson.build
> > index 711979b4..0d418035 100644
> > --- a/tests/meson.build
> > +++ b/tests/meson.build
> > @@ -3,6 +3,7 @@ test_progs = [
> >  	'core_getclient',
> >  	'core_getstats',
> >  	'core_getversion',
> > +	'core_hot_reload',
> >  	'core_setmaster_vs_auth',
> >  	'debugfs_test',
> >  	'drm_import_export',
> 
> 




_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-06  8:44     ` Janusz Krzysztofik
@ 2019-05-06  9:21       ` Daniel Vetter
  2019-05-07  6:24         ` Janusz Krzysztofik
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2019-05-06  9:21 UTC (permalink / raw)
  To: Janusz Krzysztofik
  Cc: Janusz Krzysztofik, Petri Latvala, igt-dev, Daniel Vetter

On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> Hi Daniel,
> 
> On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > 
> > > Put some workload on a device, then try to either remove (unplug) the
> > > device from its bus, or unbind the device's driver from it, possibly
> > > followed by module unload, depending on which specific subtest has been
> > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > health checks on the device with the driver possibly loaded back.
> > > 
> > > If module unload is requested, the workload is run in a sub-process,
> > > not directly from the test, as it is expected to crash while still
> > > keeping the device open for as long as its process has not exited.
> > > 
> > > The driver hot unbind / device hot unplug operation is expected to
> > > succeed and the background workload sub-process to crash in a
> > > reasonable time, however long timeouts are used to let kernel level
> > > timeouts pop up first if hit by a bug.
> > > 
> > > The driver is ready for extending it with an arbitrary workload
> > > functions as needed.  For now, a workload based on igt_dummyload is
> > > implemented, hence subtests work only on i915 driver and are skipped on
> > > other hardware, unless they provide their implementation of
> > > igt_spin_new() and friends, or other workloads are implemented.
> > > 
> > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > 
> > High level comments and apologies that I didn't look at v2-v7 in between.
> > 
> > This all seems extremely complex for a simple batch spinner subtest ...
> 
> My initial intention was to build a simple hot unplug/unbind only test. I 
> proposed to use an arbitrary external command as a workload.  Then, on 
> Antonio's advice, I switched to the spinner based internal workload and I 
> agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
> the scope of the test with device recovery and health checking.  Also, a few 
> people, including you, Daniel, requested availability of more workload type 
> options.  As a result, I've decided to build a *framework* for testing driver 
> unbind + rebind / device unplug + bus rescan behavior under different workload 
> types, easily extendable with more workload options as needed, with one 
> example workload type - dummy load or spin batch - initially implemented.  
> That was at least my intention for v6-8.  I wouldn't call it a simple batch 
> spinner subtest any longer.

Maybe my review got wrong, but I just meant that there's more tests to
write here. Generally I think having the framework/generic solution before
you have all the applications is the wrong way to build something. Usually
it results in something which is generic in all the wrong ways, but not in
the ones you will actually need. So complexity with no gain. Better to
- add a few tests first with copypasting/minimal changes
- refactor helpers once you see the real patterns
- no framework, that's the midlayer mistake, see
  https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
  all the articles linked from there.

> > do we really need all that complexity with 2nd process 
> 
> If we drop module unload option then no, we don't need 2nd process.  

Why does module unload require a 2nd process? We don't need a 2nd process
in our other module unload tests either.

> > and watchers 
> 
> That was primarily needed for successful module unload.  If we drop that 
> option and you think driver rebind / bus rescan operations can be performed 
> blindly, without checking for completion of background workload, then I can 
> drop the watchers.

Well we _have_ to do unbinds without checking the background workload has
completed. That's the entire point of testing hotunplug. It's also why
there's lots of work to do here, because the kernel is totally not ready
for this.

First stopping everything and then unloading isn't an interesting test,
that's more or less exactly what our various module unload tests are
doing already.

> > and a bunch of callbacks and everything, just do to a hotremove testcase?
> 
> I can still drop the framework and switch back to the initial simple structure 
> with one or two fixed subtests if you don't like my structural approach.

See above for why, I think that will result in better code in the end.

> > Very first patch looked much more reasonable, aside from that it broke CI
> > since it didn't rebind the driver. 
> 
> Sorry, my understanding of your and Petri's comments was a bit different, I 
> thought that by more than best effort you meant doing everything possible to 
> restore the device to be ready for next test without reboot, and module unload 
> and reload seemed the most reliable option to me.  Now I can see that there 
> were probably two different requirements.  You were considering the test 
> incomplete because it was performing only the unbind/unplug part and not 
> rebind/rescan, while Petri was probably interested mostly in the device being 
> ready for next tests without reboot, no matter which way.

Well it's the same request, and rebind/rescan /should/ result in a working
device again. If not, then I guess we also have a bug on our hotreplug
code. Which again is worth testing for.

> > We can always add complexity later on
> > once we have dma-buf/dma-fence/kms/whatever else substests here.
> 
> OK, as you wish.
> 
> > Also, I think we should have at least one hotremove-only-nothing-special
> > subtest here, i.e. without even the busy batch.
> 
> That seems trivial to adjust the framework so it accepted NULL workload, if 
> the framework survived. Anyway, I'll do that.  Should I put it in a separate 
> NULL workload subtest function to be called from igt_main?  Or add it to the 
> spin workload subtest function specifically as an option?

Separate test as the first subtest. Maybe even include the "shut
everything off first" logic from module unload, to have the most baseline
test possible.

> > I'm also not sure why we also put module unload tests in there. 
> 
> As I tried to explain above, I introduced module unload in order to satisfy 
> the CI requirement on the device being ready for next test without reboot as 
> much as possible.

Hm, but why? What does module reload help in this regard that a rebind
can't do? Aside from testing module reload, which is a developer feature
and already tested elsewhere.

I'm also not seeing much interactions between hotunplug and module unload.

The one interesting testcase I see is trying to unload the module after we
hotunplugged, while the driver is still in use somewhere (open drm fd,
open dma-buf fd, open dma-fence fd). That should result in a failure, and
it's useful to validate that the kernel is handling the module refcounting
correctly in all these cases. But that's a specific negative testcase (and
actually being able to unload would be a failure and likely result in a
kernel oops), I'm not seeing the benefit of reloading the module.

> > Compared
> > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > or whatever), which is something users can do, module unload is explicitly a 
> > developer only feature.
> 
> My approach was to be able to test driver behavior under any hot unload 
> operation available to a user, no matter if developer oriented or not, so we 
> can make the driver resistant to users performing potentially dangerous hot 
> unbind/unplug operations available to them, intentionally or not.

Yes I agree with that, we need to test hotunplug.

btw the real fun isn't the unbind in sysfs, but physically unplugging a
pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
and the best way to test that hotunplug is through the sysfs unbind
support (it's not exactly the same since this way we'll never see failing
pci transactions, which are an entirely different kind of fun).

> > We do not expect module unload to work under all
> > possible conditions (it doesn't). 
> 
> Do you think that driver rebind operation has more chances to succeed, 
> especially on a device on which a bus unplug operation was not actually 
> performed but only simulated via sysfs, on a device which then has been left 
> in an unpredictable state and hasn't undergone a hardware power-on reset on 
> physical bus re-plug?

There's definitely potential for bugs, but I don't see how module reload
helps. Module reload is essentially:

- unbind devices
- unload module
- reload module
- rebind all devices

The only additional magic that module unload can paper over is that it's
disallowed while anyone is still using any devices (assuming the module
refcount code is correct). That's not the case for unbind/hotunplug. But
that's it, there's no additional magic code being run when you unload the
module. Hence why I don't understand why you want to do that.

> > I'd drop that part and focus completely
> > on the hotremove/unbind testcase here.
> 
> Driver unbind / device unplug via sysfs can also be considered developer only 
> features. Do you think we should drop driver unbind option, leaving only 
> device unplug via sysfs for which we may have no good non-developer 
> alternative?

Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
user action. That's why we care.

We didn't care for unbind (I wontfix closed all the bugs myself) while
intel only created built-in gpus because it's indeed fairly pointless to
unbind these.

Other bit I don't quite get: What's the difference between unbind and
unplug?
-Daniel

> 
> Thanks,
> Janusz
> 
> 
> > -Daniel
> > 
> > > ---
> > >  tests/Makefile.sources  |   1 +
> > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > >  tests/meson.build       |   1 +
> > >  3 files changed, 410 insertions(+)
> > >  create mode 100644 tests/core_hot_reload.c
> > > 
> > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > index 7f921f6c..452d8ed7 100644
> > > --- a/tests/Makefile.sources
> > > +++ b/tests/Makefile.sources
> > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > >  	core_getclient \
> > >  	core_getstats \
> > >  	core_getversion \
> > > +	core_hot_reload \
> > >  	core_setmaster_vs_auth \
> > >  	debugfs_test \
> > >  	drm_import_export \
> > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > new file mode 100644
> > > index 00000000..6673f55c
> > > --- /dev/null
> > > +++ b/tests/core_hot_reload.c
> > > @@ -0,0 +1,408 @@
> > > +/*
> > > + * Copyright © 2019 Intel Corporation
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining 
> a
> > > + * copy of this software and associated documentation files (the 
> "Software"),
> > > + * to deal in the Software without restriction, including without 
> limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, 
> sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice (including the 
> next
> > > + * paragraph) shall be included in all copies or substantial portions of 
> the
> > > + * Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
> EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
> SHALL
> > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> OTHER
> > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
> ARISING
> > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
> DEALINGS
> > > + * IN THE SOFTWARE.
> > > + */
> > > +
> > > +#include "igt.h"
> > > +#include "igt_device.h"
> > > +#include "igt_dummyload.h"
> > > +#include "igt_kmod.h"
> > > +#include "igt_sysfs.h"
> > > +
> > > +#include <getopt.h>
> > > +#include <limits.h>
> > > +#include <string.h>
> > > +#include <unistd.h>
> > > +
> > > +#include <sys/types.h>
> > > +#include <sys/wait.h>
> > > +
> > > +/**
> > > + * A post-action device recovery function:
> > > + * @priv: a pointer to private data required for device recovery
> > > + *
> > > + * Make the device re-appear
> > > + */
> > > +typedef void (*recover_t)(const void *priv);
> > > +
> > > +/**
> > > + * A test action function:
> > > + * @dir: file descriptor of an open device sysfs directory
> > > + * @module: module name, non-NULL indicates post-action module unload 
> requested
> > > + * @recover: for returning a pointer to a post-action device recovery 
> function
> > > + * @priv: for returning a pointer to data to be passed to @recover
> > > + *
> > > + * Make the device disappear
> > > + */
> > > +typedef void (*action_t)(int device, const char *module,
> > > +			 recover_t *recover, const void **priv);
> > > +
> > > +/**
> > > + * A workload completion wait function:
> > > + * @device: open device file descriptor
> > > + * @priv: a pointer to private data required by the wait function
> > > + *
> > > + * Wait for completion of background workload
> > > + */
> > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > +
> > > +/**
> > > + * A workload function:
> > > + * @device: open device file descriptor
> > > + * @arg: a optional string argument passed to the workload function
> > > + * @workload_wait: for returning a pointer to workload completion wait 
> function
> > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > + *
> > > + * Put some long lasting load on the device
> > > + */
> > > +typedef void (*workload_t)(int device, const char *arg,
> > > +			   workload_wait_t *workload_wait, void 
> **priv);
> > > +
> > > +/**
> > > + * Pairs of test action / device recovery functions
> > > + */
> > > +
> > > +/* Unbind / re-bind */
> > > +
> > > +struct rebind_data {
> > > +	int driver;	/* open file descriptor of driver sysfs directory */
> > > +	char *device;	/* bus specific device address as string */
> > > +};
> > > +
> > > +/* Re-bind the driver to the device */
> > > +static void driver_bind(const void *priv)
> > > +{
> > > +	const struct rebind_data *data = priv;
> > > +
> > > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > > +	igt_sysfs_set(data->driver, "bind", data->device);
> > > +
> > > +	close(data->driver);
> > > +}
> > > +
> > > +/* Unbind the driver from the device */
> > > +static void driver_unbind(int device, const char *module,
> > > +			  recover_t *recover, const void **priv)
> > > +{
> > > +	static char path[PATH_MAX];
> > > +	static struct rebind_data data;
> > > +	int len;
> > > +
> > > +	/* collect information required for driver bind/unbind */
> > > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > +	igt_assert(data.driver >= 0);
> > > +
> > > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > +	path[len] = '\0';
> > > +	data.device = strrchr(path, '/') + 1;
> > > +
> > > +	/* unbind the driver */
> > > +	igt_set_timeout(60, "Driver unbind timeout!");
> > > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > > +
> > > +	/* pass back info on how to recover the device */
> > > +	if (module) {
> > > +		/* don't try to rebind if module will be unloaded */
> > > +		*recover = NULL;
> > > +	} else {
> > > +		*recover = driver_bind;
> > > +		*priv = &data;
> > > +	}
> > > +}
> > > +
> > > +/* Unplug / re-plug */
> > > +
> > > +/* Re-discover the device by rescanning its bus */
> > > +static void bus_rescan(const void *priv)
> > > +{
> > > +	const int *bus = priv;
> > > +
> > > +	igt_set_timeout(60, "Bus rescan timeout!");
> > > +	igt_sysfs_set(*bus, "rescan", "1");
> > > +
> > > +	close(*bus);
> > > +}
> > > +
> > > +/* Remove (virtually unplug) the device from its bus */
> > > +static void device_unplug(int device, const char *module,
> > > +			  recover_t *recover, const void **priv)
> > > +{
> > > +	static int bus;
> > > +
> > > +	/* collect information required for bus rescan */
> > > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > +	igt_assert(bus >= 0);
> > > +
> > > +	/* remove the device */
> > > +	igt_set_timeout(60, "Device unplug timeout!");
> > > +	igt_sysfs_set(device, "device/remove", "1");
> > > +
> > > +	/* pass back info on how to recover the device */
> > > +	*recover = bus_rescan;
> > > +	*priv = &bus;
> > > +}
> > > +
> > > +/* Each test action function must be registered in the following table */
> > > +static const struct {
> > > +	const char *name;	/* unique test action name used in test 
> names */
> > > +	action_t function;	/* test action function pointer */
> > > +} actions[] = {
> > > +	{ "unbind", driver_unbind, },
> > > +	{ "unplug", device_unplug, },
> > > +};
> > > +
> > > +/**
> > > + * Pairs of workload / wait completion functions
> > > + */
> > > +
> > > +/* A workload using igt_spin_run() */
> > > +
> > > +/* Wait for completaion of dummy load */
> > > +static void dummy_wait(int device, void *priv)
> > > +{
> > > +	igt_spin_t *spin = priv;
> > > +
> > > +	/* wait until the spin no longer runs, don't fail on error */
> > > +	if (gem_wait(device, spin->handle, NULL))
> > > +		__gem_set_domain(device, spin->handle,
> > > +				 I915_GEM_DOMAIN_GTT, 
> I915_GEM_DOMAIN_GTT);
> > > +}
> > > +
> > > +/* Run dummy load */
> > > +static void dummy_load(int device, const char *arg,
> > > +		       workload_wait_t *workload_wait, void **priv)
> > > +{
> > > +	igt_spin_t *spin;
> > > +
> > > +	/* submit a job */
> > > +	spin = igt_spin_new(device);
> > > +
> > > +	*workload_wait = dummy_wait;
> > > +	*priv = spin;
> > > +}
> > > +
> > > +/**
> > > + * Each workload function must be registered in the following table.
> > > + * A function may be registered more than once under different workload 
> names,
> > > + * that makes sense as long as a different argument is specified for each 
> name.
> > > + */
> > > +static const struct {
> > > +	const char *name;	/* unique workload name used in test names 
> */
> > > +	workload_t function;	/* workload function pointer */
> > > +	const char *arg;	/* optional constant string argument */
> > > +} workloads[] = {
> > > +	{ "spin", dummy_load, NULL, },
> > > +};
> > > +
> > > +/**
> > > + * Framework
> > > + */
> > > +
> > > +static void healthcheck(int chipset)
> > > +{
> > > +	int device;
> > > +
> > > +	device = __drm_open_driver(chipset);
> > > +	igt_assert(device >= 0);
> > > +
> > > +	if (chipset == DRIVER_INTEL)
> > > +		gem_test_engine(device, ALL_ENGINES);
> > > +
> > > +	close(device);
> > > +}
> > > +
> > > +static void module_unload(int chipset, const char *module)
> > > +{
> > > +	if (chipset == DRIVER_INTEL)
> > > +		igt_assert(igt_i915_driver_unload() == 
> IGT_EXIT_SUCCESS);
> > > +	else
> > > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > > +}
> > > +
> > > +static void run_action(int device, action_t action, const char *module,
> > > +		      recover_t *recover, const void **priv)
> > > +{
> > > +	int dir;
> > > +
> > > +	dir = igt_sysfs_open(device);
> > > +	igt_assert(dir >= 0);
> > > +
> > > +	action(dir, module, recover, priv);
> > > +
> > > +	close(dir);
> > > +}
> > > +
> > > +static void wait_helper(int device, void *priv)
> > > +{
> > > +	struct igt_helper_process *proc = priv;
> > > +
> > > +	/* wait until the workload subprocess has completed */
> > > +	igt_ignore_warn(igt_wait_helper(proc));
> > > +}
> > > +
> > > +static void run_workload(int device, workload_t workload, const char 
> *arg,
> > > +			 const char *module, workload_wait_t 
> *workload_wait,
> > > +			 void **priv)
> > > +{
> > > +	if (module) {
> > > +		/* run workload in a subprocess so the module is put on 
> crash */
> > > +		static struct igt_helper_process proc;
> > > +		int wstatus, ret;
> > > +
> > > +		bzero(&proc, sizeof(proc));
> > > +
> > > +		igt_fork_helper(&proc) {
> > > +			/* suppress igt_log messages */
> > > +			igt_log_level = IGT_LOG_NONE;
> > > +
> > > +			/* intercept igt_fail/skip() long jumps */
> > > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
> {
> > > +				workload(device, arg, 
> workload_wait, priv);
> > > +
> > > +				(*workload_wait)(device, 
> *priv);
> > > +
> > > +				/* success if not diverted by 
> igt_fail/skip() */
> > > +				igt_success();
> > > +			}
> > > +
> > > +			/* pass exit code back to the caller */
> > > +			igt_exit();
> > > +		}
> > > +		/* let the background process start doing its job or 
> fail */
> > > +		sleep(2);
> > > +		/* fail or skip on workload premature completion */
> > > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > +		if (ret < 0)
> > > +			igt_fail(IGT_EXIT_INVALID);
> > > +		if (ret) {
> > > +			if (!WIFEXITED(wstatus))
> > > +				igt_fail(IGT_EXIT_INVALID);
> > > +			if (WEXITSTATUS(wstatus) == 
> IGT_EXIT_SUCCESS)
> > > +				igt_fail(IGT_EXIT_INVALID);
> > > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > +				igt_skip(NULL);
> > > +			igt_fail(WEXITSTATUS(wstatus));
> > > +		}
> > > +
> > > +		/* pass back info on how to wait for helper completion 
> */
> > > +		*workload_wait = wait_helper;
> > > +		*priv = &proc;
> > > +	} else {
> > > +		/* run the requested workload directly */
> > > +		workload(device, arg, workload_wait, priv);
> > > +	}
> > > +}
> > > +
> > > +static void run_subtest(int chipset, int workload, int action,
> > > +			const char *module)
> > > +{
> > > +	workload_wait_t workload_wait;
> > > +	void *workload_priv;
> > > +	recover_t recover;
> > > +	const void *recover_priv;
> > > +	int device;
> > > +
> > > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
> actions[action].name,
> > > +		      module ? "-unload" : "") {
> > > +		device = __drm_open_driver(chipset);
> > > +		igt_assert(device >= 0);
> > > +
> > > +		/* spawn the requested workload */
> > > +		igt_debug("spawning background workload\n");
> > > +		run_workload(device, workloads[workload].function,
> > > +			     workloads[workload].arg, module,
> > > +			     &workload_wait, &workload_priv);
> > > +
> > > +		/* run the requested test action */
> > > +		igt_debug("running test action\n");
> > > +		run_action(device, actions[action].function, module,
> > > +			   &recover, &recover_priv);
> > > +
> > > +		if (workload_wait) {
> > > +			igt_debug("waiting for workload 
> completion\n");
> > > +			workload_wait(device, workload_priv);
> > > +		}
> > > +
> > > +		close(device);
> > > +
> > > +		if (module) {
> > > +			igt_debug("unloading %s\n", module);
> > > +			module_unload(chipset, module);
> > > +		}
> > > +
> > > +		if (recover) {
> > > +			igt_debug("recovering device\n");
> > > +			recover(recover_priv);
> > > +			igt_reset_timeout();
> > > +		}
> > > +
> > > +		igt_debug("running healthcheck\n");
> > > +		healthcheck(chipset);
> > > +	}
> > > +}
> > > +
> > > +igt_main {
> > > +	int device, chipset;
> > > +	char *module;
> > > +	int i, j;
> > > +
> > > +	igt_fixture {
> > > +		char path[PATH_MAX];
> > > +		int dir, len;
> > > +
> > > +		/**
> > > +		 * Since some subtests depend on successful unload of a 
> driver
> > > +		 * module, don't use drm_open_driver() as it keeps a 
> device file
> > > +		 * descriptor open for exit handler use and that 
> effectively
> > > +		 * prevents the module from being unloaded.
> > > +		 */
> > > +		device = __drm_open_driver(DRIVER_ANY);
> > > +		igt_assert(device >= 0);
> > > +
> > > +		if (is_i915_device(device)) {
> > > +			chipset = DRIVER_INTEL;
> > > +			module = strdup("i915");
> > > +		} else {
> > > +			chipset = DRIVER_ANY;
> > > +
> > > +			/* Capture module name to be unloaded */
> > > +			dir = igt_sysfs_open(device);
> > > +			len = readlinkat(dir, "device/driver/
> module", path,
> > > +					 sizeof(path) - 1);
> > > +			close(dir);
> > > +			path[len] = '\0';
> > > +			module = strdup(strrchr(path, '/') + 1);
> > > +		}
> > > +		close(device);
> > > +
> > > +		igt_info("Running the test on driver \"%s\", chipset 
> mask %#0x\n",
> > > +			 module, chipset);
> > > +	}
> > > +
> > > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> +) {
> > > +			/* with module unload */
> > > +			run_subtest(chipset, i, j, module);
> > > +			/* without module unload */
> > > +			run_subtest(chipset, i, j, NULL);
> > > +		}
> > > +	}
> > > +}
> > > diff --git a/tests/meson.build b/tests/meson.build
> > > index 711979b4..0d418035 100644
> > > --- a/tests/meson.build
> > > +++ b/tests/meson.build
> > > @@ -3,6 +3,7 @@ test_progs = [
> > >  	'core_getclient',
> > >  	'core_getstats',
> > >  	'core_getversion',
> > > +	'core_hot_reload',
> > >  	'core_setmaster_vs_auth',
> > >  	'debugfs_test',
> > >  	'drm_import_export',
> > 
> > 
> 
> 
> 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-06  9:21       ` Daniel Vetter
@ 2019-05-07  6:24         ` Janusz Krzysztofik
  2019-05-07  9:14           ` Daniel Vetter
  0 siblings, 1 reply; 13+ messages in thread
From: Janusz Krzysztofik @ 2019-05-07  6:24 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Petri Latvala, igt-dev

On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > Hi Daniel,
> > 
> > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > 
> > > > Put some workload on a device, then try to either remove (unplug) the
> > > > device from its bus, or unbind the device's driver from it, possibly
> > > > followed by module unload, depending on which specific subtest has been
> > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > health checks on the device with the driver possibly loaded back.
> > > > 
> > > > If module unload is requested, the workload is run in a sub-process,
> > > > not directly from the test, as it is expected to crash while still
> > > > keeping the device open for as long as its process has not exited.
> > > > 
> > > > The driver hot unbind / device hot unplug operation is expected to
> > > > succeed and the background workload sub-process to crash in a
> > > > reasonable time, however long timeouts are used to let kernel level
> > > > timeouts pop up first if hit by a bug.
> > > > 
> > > > The driver is ready for extending it with an arbitrary workload
> > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > other hardware, unless they provide their implementation of
> > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > 
> > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > 
> > > High level comments and apologies that I didn't look at v2-v7 in between.

v1-v3 were submitted internally, so you actually joined and commented first on 
first public submission which I should have had marked as v4 (I hadn't, 
sorry).

> > > 
> > > This all seems extremely complex for a simple batch spinner subtest ...
> > 
> > My initial intention was to build a simple hot unplug/unbind only test. I 
> > proposed to use an arbitrary external command as a workload.  Then, on 
> > Antonio's advice, I switched to the spinner based internal workload and I 
> > agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
> > the scope of the test with device recovery and health checking.  Also, a few 
> > people, including you, Daniel, requested availability of more workload type 
> > options.  As a result, I've decided to build a *framework* for testing driver 
> > unbind + rebind / device unplug + bus rescan behavior under different workload 
> > types, easily extendable with more workload options as needed, with one 
> > example workload type - dummy load or spin batch - initially implemented.  
> > That was at least my intention for v6-8.  I wouldn't call it a simple batch 
> > spinner subtest any longer.
> 
> Maybe my review got wrong, but I just meant that there's more tests to
> write here.

That was clear for me, however I probably misunderstood your intentions in 
regard to device/driver recovery after successful unplug/unbind.

> Generally I think having the framework/generic solution before
> you have all the applications is the wrong way to build something. Usually
> it results in something which is generic in all the wrong ways, but not in
> the ones you will actually need. So complexity with no gain. Better to
> - add a few tests first with copypasting/minimal changes
> - refactor helpers once you see the real patterns
> - no framework, that's the midlayer mistake, see
>   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
>   all the articles linked from there.

OK, thanks for your recommendations and the references.

> > > do we really need all that complexity with 2nd process 
> > 
> > If we drop module unload option then no, we don't need 2nd process.  
> 
> Why does module unload require a 2nd process? We don't need a 2nd process
> in our other module unload tests either.

That's not longer the problem as we're going to drop the module unload step, 
but just to provide you with an explanation of my approach:
In case of the spin workload, references are held after the workload crashes 
and it's not possible to unload the module unless we put them.   Since those 
references are internal to IGT libraries and not exposed to a user, putting 
them is only possible with functions provided by IGT.  Those functions are 
full of checks affecting subtest results and using them to clean up resources 
related to a no longer existing device would result in a subtest failure or 
skip at least.  The most simple way to get rid of those issues is to enclose 
those references in a subprocess and wait for their automatic release on its 
completion.

> > > and watchers 
> > 
> > That was primarily needed for successful module unload.  If we drop that 
> > option and you think driver rebind / bus rescan operations can be performed 
> > blindly, without checking for completion of background workload, then I can 
> > drop the watchers.
> 
> Well we _have_ to do unbinds without checking the background workload has
> completed. That's the entire point of testing hotunplug. 

I agree, and the test performs all unbinds that way, i.e., without checking 
the background workload has completed.  Waiting for background workload 
completion applies only to what I'm considering a device recovery phase, and 
not to the "main" unbind/unplug test phase in any way.

> It's also why
> there's lots of work to do here, because the kernel is totally not ready
> for this.
> 
> First stopping everything and then unloading isn't an interesting test,

Since its introduction, the module unload step was intended as a part of a 
post-subtest device recovery phase, not the subtest merit.  I added that step 
because I thought that would be the most reliable way of satisfying the CI 
requirement on restoring the device to the state ready for next tests without 
reboot or real device power-on reset on real hardware bus replug.

> that's more or less exactly what our various module unload tests are
> doing already.

Yes, and in v5-v7 I was even using the existing i915_module_load test as an 
external helper command performing device recovery and healthcheck phases in 
order to avoid reimplementing its code here.

> > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > 
> > I can still drop the framework and switch back to the initial simple structure 
> > with one or two fixed subtests if you don't like my structural approach.
> 
> See above for why, I think that will result in better code in the end.

OK.

> > > Very first patch looked much more reasonable, aside from that it broke CI
> > > since it didn't rebind the driver. 
> > 
> > Sorry, my understanding of your and Petri's comments was a bit different, I 
> > thought that by more than best effort you meant doing everything possible to 
> > restore the device to be ready for next test without reboot, and module unload 
> > and reload seemed the most reliable option to me.  Now I can see that there 
> > were probably two different requirements.  You were considering the test 
> > incomplete because it was performing only the unbind/unplug part and not 
> > rebind/rescan, while Petri was probably interested mostly in the device being 
> > ready for next tests without reboot, no matter which way.
> 
> Well it's the same request, and rebind/rescan /should/ result in a working
> device again. If not, then I guess we also have a bug on our hotreplug
> code. Which again is worth testing for.
> 
> > > We can always add complexity later on
> > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > 
> > OK, as you wish.
> > 
> > > Also, I think we should have at least one hotremove-only-nothing-special
> > > subtest here, i.e. without even the busy batch.
> > 
> > That seems trivial to adjust the framework so it accepted NULL workload, if 
> > the framework survived. Anyway, I'll do that.  Should I put it in a separate 
> > NULL workload subtest function to be called from igt_main?  Or add it to the 
> > spin workload subtest function specifically as an option?
> 
> Separate test as the first subtest.

OK.

> Maybe even include the "shut
> everything off first" logic from module unload,

Do you mean i915_module_load.c?

> to have the most baseline test possible.
> 
> > > I'm also not sure why we also put module unload tests in there. 
> > 
> > As I tried to explain above, I introduced module unload in order to satisfy 
> > the CI requirement on the device being ready for next test without reboot as 
> > much as possible.
> 
> Hm, but why? What does module reload help in this regard that a rebind
> can't do? Aside from testing module reload, which is a developer feature
> and already tested elsewhere.

As I said, I decided to use module unload as I thought it would be the most 
reliable way of device recovery from simulated unplug in case no real power-on 
reset is performed.  I didn't insist on keeping it there at all, I only tried 
to explain why I did that.  As that can't help in any way to recover the 
device so it is ready for next tests as CI requested then I'll be happy to 
remove it and stick to pure driver rebind / bus rescan operations.

> I'm also not seeing much interactions between hotunplug and module unload.
> 
> The one interesting testcase I see is trying to unload the module after we
> hotunplugged, while the driver is still in use somewhere (open drm fd,
> open dma-buf fd, open dma-fence fd). That should result in a failure, and
> it's useful to validate that the kernel is handling the module refcounting
> correctly in all these cases. But that's a specific negative testcase (and
> actually being able to unload would be a failure and likely result in a
> kernel oops), I'm not seeing the benefit of reloading the module.

That's perfectly clear for me, the optional module unload step will not be 
there on next iteration.

> > > Compared
> > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > or whatever), which is something users can do, module unload is explicitly a 
> > > developer only feature.
> > 
> > My approach was to be able to test driver behavior under any hot unload 
> > operation available to a user, no matter if developer oriented or not, so we 
> > can make the driver resistant to users performing potentially dangerous hot 
> > unbind/unplug operations available to them, intentionally or not.
> 
> Yes I agree with that, we need to test hotunplug.
> 
> btw the real fun isn't the unbind in sysfs, but physically unplugging a
> pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> and the best way to test that hotunplug is through the sysfs unbind
> support (it's not exactly the same since this way we'll never see failing
> pci transactions, which are an entirely different kind of fun).

I fully agree, however please note that what I'm calling device hot unplug is 
probably still an interesting sysfs option aside driver hot unbind.

> > > We do not expect module unload to work under all
> > > possible conditions (it doesn't). 
> > 
> > Do you think that driver rebind operation has more chances to succeed, 
> > especially on a device on which a bus unplug operation was not actually 
> > performed but only simulated via sysfs, on a device which then has been left 
> > in an unpredictable state and hasn't undergone a hardware power-on reset on 
> > physical bus re-plug?
> 
> There's definitely potential for bugs, but I don't see how module reload
> helps. Module reload is essentially:
> 
> - unbind devices
> - unload module
> - reload module
> - rebind all devices
> 
> The only additional magic that module unload can paper over is that it's
> disallowed while anyone is still using any devices (assuming the module
> refcount code is correct). That's not the case for unbind/hotunplug. But
> that's it, there's no additional magic code being run when you unload the
> module. Hence why I don't understand why you want to do that.

Not any longer :-)

> > > I'd drop that part and focus completely
> > > on the hotremove/unbind testcase here.
> > 
> > Driver unbind / device unplug via sysfs can also be considered developer only 
> > features. Do you think we should drop driver unbind option, leaving only 
> > device unplug via sysfs for which we may have no good non-developer 
> > alternative?
> 
> Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> user action. That's why we care.
> 
> We didn't care for unbind (I wontfix closed all the bugs myself) while
> intel only created built-in gpus because it's indeed fairly pointless to
> unbind these.
> 
> Other bit I don't quite get: What's the difference between unbind and
> unplug?

I'm not sure what information you are missing.  What the test is doing is:

driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
vs.
virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove

Panic call traces look a bit different, you may want to compare the following two:
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html

rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
vs.
replug: echo 1 >/sys/bus/<bus>/rescan

With no panics accompanying driver unbind / device unplug under active spin 
workload on older hardware, the recovery phase is however still giving a 
different result for each of those two methods:
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html

BTW, trybot results confirm that module unload really doesn't help:
https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html

Thanks,
Janusz


> -Daniel
> 
> > 
> > Thanks,
> > Janusz
> > 
> > 
> > > -Daniel
> > > 
> > > > ---
> > > >  tests/Makefile.sources  |   1 +
> > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > >  tests/meson.build       |   1 +
> > > >  3 files changed, 410 insertions(+)
> > > >  create mode 100644 tests/core_hot_reload.c
> > > > 
> > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > index 7f921f6c..452d8ed7 100644
> > > > --- a/tests/Makefile.sources
> > > > +++ b/tests/Makefile.sources
> > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > >  	core_getclient \
> > > >  	core_getstats \
> > > >  	core_getversion \
> > > > +	core_hot_reload \
> > > >  	core_setmaster_vs_auth \
> > > >  	debugfs_test \
> > > >  	drm_import_export \
> > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > new file mode 100644
> > > > index 00000000..6673f55c
> > > > --- /dev/null
> > > > +++ b/tests/core_hot_reload.c
> > > > @@ -0,0 +1,408 @@
> > > > +/*
> > > > + * Copyright © 2019 Intel Corporation
> > > > + *
> > > > + * Permission is hereby granted, free of charge, to any person obtaining 
> > a
> > > > + * copy of this software and associated documentation files (the 
> > "Software"),
> > > > + * to deal in the Software without restriction, including without 
> > limitation
> > > > + * the rights to use, copy, modify, merge, publish, distribute, 
> > sublicense,
> > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > + * Software is furnished to do so, subject to the following conditions:
> > > > + *
> > > > + * The above copyright notice and this permission notice (including the 
> > next
> > > > + * paragraph) shall be included in all copies or substantial portions of 
> > the
> > > > + * Software.
> > > > + *
> > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
> > EXPRESS OR
> > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> > MERCHANTABILITY,
> > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
> > SHALL
> > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> > OTHER
> > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
> > ARISING
> > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
> > DEALINGS
> > > > + * IN THE SOFTWARE.
> > > > + */
> > > > +
> > > > +#include "igt.h"
> > > > +#include "igt_device.h"
> > > > +#include "igt_dummyload.h"
> > > > +#include "igt_kmod.h"
> > > > +#include "igt_sysfs.h"
> > > > +
> > > > +#include <getopt.h>
> > > > +#include <limits.h>
> > > > +#include <string.h>
> > > > +#include <unistd.h>
> > > > +
> > > > +#include <sys/types.h>
> > > > +#include <sys/wait.h>
> > > > +
> > > > +/**
> > > > + * A post-action device recovery function:
> > > > + * @priv: a pointer to private data required for device recovery
> > > > + *
> > > > + * Make the device re-appear
> > > > + */
> > > > +typedef void (*recover_t)(const void *priv);
> > > > +
> > > > +/**
> > > > + * A test action function:
> > > > + * @dir: file descriptor of an open device sysfs directory
> > > > + * @module: module name, non-NULL indicates post-action module unload 
> > requested
> > > > + * @recover: for returning a pointer to a post-action device recovery 
> > function
> > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > + *
> > > > + * Make the device disappear
> > > > + */
> > > > +typedef void (*action_t)(int device, const char *module,
> > > > +			 recover_t *recover, const void **priv);
> > > > +
> > > > +/**
> > > > + * A workload completion wait function:
> > > > + * @device: open device file descriptor
> > > > + * @priv: a pointer to private data required by the wait function
> > > > + *
> > > > + * Wait for completion of background workload
> > > > + */
> > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > +
> > > > +/**
> > > > + * A workload function:
> > > > + * @device: open device file descriptor
> > > > + * @arg: a optional string argument passed to the workload function
> > > > + * @workload_wait: for returning a pointer to workload completion wait 
> > function
> > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > + *
> > > > + * Put some long lasting load on the device
> > > > + */
> > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > +			   workload_wait_t *workload_wait, void 
> > **priv);
> > > > +
> > > > +/**
> > > > + * Pairs of test action / device recovery functions
> > > > + */
> > > > +
> > > > +/* Unbind / re-bind */
> > > > +
> > > > +struct rebind_data {
> > > > +	int driver;	/* open file descriptor of driver sysfs directory */
> > > > +	char *device;	/* bus specific device address as string */
> > > > +};
> > > > +
> > > > +/* Re-bind the driver to the device */
> > > > +static void driver_bind(const void *priv)
> > > > +{
> > > > +	const struct rebind_data *data = priv;
> > > > +
> > > > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > > > +	igt_sysfs_set(data->driver, "bind", data->device);
> > > > +
> > > > +	close(data->driver);
> > > > +}
> > > > +
> > > > +/* Unbind the driver from the device */
> > > > +static void driver_unbind(int device, const char *module,
> > > > +			  recover_t *recover, const void **priv)
> > > > +{
> > > > +	static char path[PATH_MAX];
> > > > +	static struct rebind_data data;
> > > > +	int len;
> > > > +
> > > > +	/* collect information required for driver bind/unbind */
> > > > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > +	igt_assert(data.driver >= 0);
> > > > +
> > > > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > +	path[len] = '\0';
> > > > +	data.device = strrchr(path, '/') + 1;
> > > > +
> > > > +	/* unbind the driver */
> > > > +	igt_set_timeout(60, "Driver unbind timeout!");
> > > > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > > > +
> > > > +	/* pass back info on how to recover the device */
> > > > +	if (module) {
> > > > +		/* don't try to rebind if module will be unloaded */
> > > > +		*recover = NULL;
> > > > +	} else {
> > > > +		*recover = driver_bind;
> > > > +		*priv = &data;
> > > > +	}
> > > > +}
> > > > +
> > > > +/* Unplug / re-plug */
> > > > +
> > > > +/* Re-discover the device by rescanning its bus */
> > > > +static void bus_rescan(const void *priv)
> > > > +{
> > > > +	const int *bus = priv;
> > > > +
> > > > +	igt_set_timeout(60, "Bus rescan timeout!");
> > > > +	igt_sysfs_set(*bus, "rescan", "1");
> > > > +
> > > > +	close(*bus);
> > > > +}
> > > > +
> > > > +/* Remove (virtually unplug) the device from its bus */
> > > > +static void device_unplug(int device, const char *module,
> > > > +			  recover_t *recover, const void **priv)
> > > > +{
> > > > +	static int bus;
> > > > +
> > > > +	/* collect information required for bus rescan */
> > > > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > +	igt_assert(bus >= 0);
> > > > +
> > > > +	/* remove the device */
> > > > +	igt_set_timeout(60, "Device unplug timeout!");
> > > > +	igt_sysfs_set(device, "device/remove", "1");
> > > > +
> > > > +	/* pass back info on how to recover the device */
> > > > +	*recover = bus_rescan;
> > > > +	*priv = &bus;
> > > > +}
> > > > +
> > > > +/* Each test action function must be registered in the following table */
> > > > +static const struct {
> > > > +	const char *name;	/* unique test action name used in test 
> > names */
> > > > +	action_t function;	/* test action function pointer */
> > > > +} actions[] = {
> > > > +	{ "unbind", driver_unbind, },
> > > > +	{ "unplug", device_unplug, },
> > > > +};
> > > > +
> > > > +/**
> > > > + * Pairs of workload / wait completion functions
> > > > + */
> > > > +
> > > > +/* A workload using igt_spin_run() */
> > > > +
> > > > +/* Wait for completaion of dummy load */
> > > > +static void dummy_wait(int device, void *priv)
> > > > +{
> > > > +	igt_spin_t *spin = priv;
> > > > +
> > > > +	/* wait until the spin no longer runs, don't fail on error */
> > > > +	if (gem_wait(device, spin->handle, NULL))
> > > > +		__gem_set_domain(device, spin->handle,
> > > > +				 I915_GEM_DOMAIN_GTT, 
> > I915_GEM_DOMAIN_GTT);
> > > > +}
> > > > +
> > > > +/* Run dummy load */
> > > > +static void dummy_load(int device, const char *arg,
> > > > +		       workload_wait_t *workload_wait, void **priv)
> > > > +{
> > > > +	igt_spin_t *spin;
> > > > +
> > > > +	/* submit a job */
> > > > +	spin = igt_spin_new(device);
> > > > +
> > > > +	*workload_wait = dummy_wait;
> > > > +	*priv = spin;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Each workload function must be registered in the following table.
> > > > + * A function may be registered more than once under different workload 
> > names,
> > > > + * that makes sense as long as a different argument is specified for each 
> > name.
> > > > + */
> > > > +static const struct {
> > > > +	const char *name;	/* unique workload name used in test names 
> > */
> > > > +	workload_t function;	/* workload function pointer */
> > > > +	const char *arg;	/* optional constant string argument */
> > > > +} workloads[] = {
> > > > +	{ "spin", dummy_load, NULL, },
> > > > +};
> > > > +
> > > > +/**
> > > > + * Framework
> > > > + */
> > > > +
> > > > +static void healthcheck(int chipset)
> > > > +{
> > > > +	int device;
> > > > +
> > > > +	device = __drm_open_driver(chipset);
> > > > +	igt_assert(device >= 0);
> > > > +
> > > > +	if (chipset == DRIVER_INTEL)
> > > > +		gem_test_engine(device, ALL_ENGINES);
> > > > +
> > > > +	close(device);
> > > > +}
> > > > +
> > > > +static void module_unload(int chipset, const char *module)
> > > > +{
> > > > +	if (chipset == DRIVER_INTEL)
> > > > +		igt_assert(igt_i915_driver_unload() == 
> > IGT_EXIT_SUCCESS);
> > > > +	else
> > > > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > +}
> > > > +
> > > > +static void run_action(int device, action_t action, const char *module,
> > > > +		      recover_t *recover, const void **priv)
> > > > +{
> > > > +	int dir;
> > > > +
> > > > +	dir = igt_sysfs_open(device);
> > > > +	igt_assert(dir >= 0);
> > > > +
> > > > +	action(dir, module, recover, priv);
> > > > +
> > > > +	close(dir);
> > > > +}
> > > > +
> > > > +static void wait_helper(int device, void *priv)
> > > > +{
> > > > +	struct igt_helper_process *proc = priv;
> > > > +
> > > > +	/* wait until the workload subprocess has completed */
> > > > +	igt_ignore_warn(igt_wait_helper(proc));
> > > > +}
> > > > +
> > > > +static void run_workload(int device, workload_t workload, const char 
> > *arg,
> > > > +			 const char *module, workload_wait_t 
> > *workload_wait,
> > > > +			 void **priv)
> > > > +{
> > > > +	if (module) {
> > > > +		/* run workload in a subprocess so the module is put on 
> > crash */
> > > > +		static struct igt_helper_process proc;
> > > > +		int wstatus, ret;
> > > > +
> > > > +		bzero(&proc, sizeof(proc));
> > > > +
> > > > +		igt_fork_helper(&proc) {
> > > > +			/* suppress igt_log messages */
> > > > +			igt_log_level = IGT_LOG_NONE;
> > > > +
> > > > +			/* intercept igt_fail/skip() long jumps */
> > > > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
> > {
> > > > +				workload(device, arg, 
> > workload_wait, priv);
> > > > +
> > > > +				(*workload_wait)(device, 
> > *priv);
> > > > +
> > > > +				/* success if not diverted by 
> > igt_fail/skip() */
> > > > +				igt_success();
> > > > +			}
> > > > +
> > > > +			/* pass exit code back to the caller */
> > > > +			igt_exit();
> > > > +		}
> > > > +		/* let the background process start doing its job or 
> > fail */
> > > > +		sleep(2);
> > > > +		/* fail or skip on workload premature completion */
> > > > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > +		if (ret < 0)
> > > > +			igt_fail(IGT_EXIT_INVALID);
> > > > +		if (ret) {
> > > > +			if (!WIFEXITED(wstatus))
> > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > +			if (WEXITSTATUS(wstatus) == 
> > IGT_EXIT_SUCCESS)
> > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > +				igt_skip(NULL);
> > > > +			igt_fail(WEXITSTATUS(wstatus));
> > > > +		}
> > > > +
> > > > +		/* pass back info on how to wait for helper completion 
> > */
> > > > +		*workload_wait = wait_helper;
> > > > +		*priv = &proc;
> > > > +	} else {
> > > > +		/* run the requested workload directly */
> > > > +		workload(device, arg, workload_wait, priv);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void run_subtest(int chipset, int workload, int action,
> > > > +			const char *module)
> > > > +{
> > > > +	workload_wait_t workload_wait;
> > > > +	void *workload_priv;
> > > > +	recover_t recover;
> > > > +	const void *recover_priv;
> > > > +	int device;
> > > > +
> > > > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
> > actions[action].name,
> > > > +		      module ? "-unload" : "") {
> > > > +		device = __drm_open_driver(chipset);
> > > > +		igt_assert(device >= 0);
> > > > +
> > > > +		/* spawn the requested workload */
> > > > +		igt_debug("spawning background workload\n");
> > > > +		run_workload(device, workloads[workload].function,
> > > > +			     workloads[workload].arg, module,
> > > > +			     &workload_wait, &workload_priv);
> > > > +
> > > > +		/* run the requested test action */
> > > > +		igt_debug("running test action\n");
> > > > +		run_action(device, actions[action].function, module,
> > > > +			   &recover, &recover_priv);
> > > > +
> > > > +		if (workload_wait) {
> > > > +			igt_debug("waiting for workload 
> > completion\n");
> > > > +			workload_wait(device, workload_priv);
> > > > +		}
> > > > +
> > > > +		close(device);
> > > > +
> > > > +		if (module) {
> > > > +			igt_debug("unloading %s\n", module);
> > > > +			module_unload(chipset, module);
> > > > +		}
> > > > +
> > > > +		if (recover) {
> > > > +			igt_debug("recovering device\n");
> > > > +			recover(recover_priv);
> > > > +			igt_reset_timeout();
> > > > +		}
> > > > +
> > > > +		igt_debug("running healthcheck\n");
> > > > +		healthcheck(chipset);
> > > > +	}
> > > > +}
> > > > +
> > > > +igt_main {
> > > > +	int device, chipset;
> > > > +	char *module;
> > > > +	int i, j;
> > > > +
> > > > +	igt_fixture {
> > > > +		char path[PATH_MAX];
> > > > +		int dir, len;
> > > > +
> > > > +		/**
> > > > +		 * Since some subtests depend on successful unload of a 
> > driver
> > > > +		 * module, don't use drm_open_driver() as it keeps a 
> > device file
> > > > +		 * descriptor open for exit handler use and that 
> > effectively
> > > > +		 * prevents the module from being unloaded.
> > > > +		 */
> > > > +		device = __drm_open_driver(DRIVER_ANY);
> > > > +		igt_assert(device >= 0);
> > > > +
> > > > +		if (is_i915_device(device)) {
> > > > +			chipset = DRIVER_INTEL;
> > > > +			module = strdup("i915");
> > > > +		} else {
> > > > +			chipset = DRIVER_ANY;
> > > > +
> > > > +			/* Capture module name to be unloaded */
> > > > +			dir = igt_sysfs_open(device);
> > > > +			len = readlinkat(dir, "device/driver/
> > module", path,
> > > > +					 sizeof(path) - 1);
> > > > +			close(dir);
> > > > +			path[len] = '\0';
> > > > +			module = strdup(strrchr(path, '/') + 1);
> > > > +		}
> > > > +		close(device);
> > > > +
> > > > +		igt_info("Running the test on driver \"%s\", chipset 
> > mask %#0x\n",
> > > > +			 module, chipset);
> > > > +	}
> > > > +
> > > > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > +) {
> > > > +			/* with module unload */
> > > > +			run_subtest(chipset, i, j, module);
> > > > +			/* without module unload */
> > > > +			run_subtest(chipset, i, j, NULL);
> > > > +		}
> > > > +	}
> > > > +}
> > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > index 711979b4..0d418035 100644
> > > > --- a/tests/meson.build
> > > > +++ b/tests/meson.build
> > > > @@ -3,6 +3,7 @@ test_progs = [
> > > >  	'core_getclient',
> > > >  	'core_getstats',
> > > >  	'core_getversion',
> > > > +	'core_hot_reload',
> > > >  	'core_setmaster_vs_auth',
> > > >  	'debugfs_test',
> > > >  	'drm_import_export',
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> 




_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-07  6:24         ` Janusz Krzysztofik
@ 2019-05-07  9:14           ` Daniel Vetter
  2019-05-07 10:44             ` Janusz Krzysztofik
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2019-05-07  9:14 UTC (permalink / raw)
  To: Janusz Krzysztofik; +Cc: igt-dev, Petri Latvala, Daniel Vetter

On Tue, May 07, 2019 at 08:24:30AM +0200, Janusz Krzysztofik wrote:
> On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> > On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > > Hi Daniel,
> > > 
> > > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > > From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > > 
> > > > > Put some workload on a device, then try to either remove (unplug) the
> > > > > device from its bus, or unbind the device's driver from it, possibly
> > > > > followed by module unload, depending on which specific subtest has been
> > > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > > health checks on the device with the driver possibly loaded back.
> > > > > 
> > > > > If module unload is requested, the workload is run in a sub-process,
> > > > > not directly from the test, as it is expected to crash while still
> > > > > keeping the device open for as long as its process has not exited.
> > > > > 
> > > > > The driver hot unbind / device hot unplug operation is expected to
> > > > > succeed and the background workload sub-process to crash in a
> > > > > reasonable time, however long timeouts are used to let kernel level
> > > > > timeouts pop up first if hit by a bug.
> > > > > 
> > > > > The driver is ready for extending it with an arbitrary workload
> > > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > > other hardware, unless they provide their implementation of
> > > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > > 
> > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > 
> > > > High level comments and apologies that I didn't look at v2-v7 in between.
> 
> v1-v3 were submitted internally, so you actually joined and commented first on 
> first public submission which I should have had marked as v4 (I hadn't, 
> sorry).
> 
> > > > 
> > > > This all seems extremely complex for a simple batch spinner subtest ...
> > > 
> > > My initial intention was to build a simple hot unplug/unbind only test. I 
> > > proposed to use an arbitrary external command as a workload.  Then, on 
> > > Antonio's advice, I switched to the spinner based internal workload and I 
> > > agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
> > > the scope of the test with device recovery and health checking.  Also, a few 
> > > people, including you, Daniel, requested availability of more workload type 
> > > options.  As a result, I've decided to build a *framework* for testing driver 
> > > unbind + rebind / device unplug + bus rescan behavior under different workload 
> > > types, easily extendable with more workload options as needed, with one 
> > > example workload type - dummy load or spin batch - initially implemented.  
> > > That was at least my intention for v6-8.  I wouldn't call it a simple batch 
> > > spinner subtest any longer.
> > 
> > Maybe my review got wrong, but I just meant that there's more tests to
> > write here.
> 
> That was clear for me, however I probably misunderstood your intentions in 
> regard to device/driver recovery after successful unplug/unbind.
> 
> > Generally I think having the framework/generic solution before
> > you have all the applications is the wrong way to build something. Usually
> > it results in something which is generic in all the wrong ways, but not in
> > the ones you will actually need. So complexity with no gain. Better to
> > - add a few tests first with copypasting/minimal changes
> > - refactor helpers once you see the real patterns
> > - no framework, that's the midlayer mistake, see
> >   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
> >   all the articles linked from there.
> 
> OK, thanks for your recommendations and the references.
> 
> > > > do we really need all that complexity with 2nd process 
> > > 
> > > If we drop module unload option then no, we don't need 2nd process.  
> > 
> > Why does module unload require a 2nd process? We don't need a 2nd process
> > in our other module unload tests either.
> 
> That's not longer the problem as we're going to drop the module unload step, 
> but just to provide you with an explanation of my approach:
> In case of the spin workload, references are held after the workload crashes 
> and it's not possible to unload the module unless we put them.   Since those 
> references are internal to IGT libraries and not exposed to a user, putting 
> them is only possible with functions provided by IGT.  Those functions are 
> full of checks affecting subtest results and using them to clean up resources 
> related to a no longer existing device would result in a subtest failure or 
> skip at least.  The most simple way to get rid of those issues is to enclose 
> those references in a subprocess and wait for their automatic release on its 
> completion.


Hm which references? Closing the file descriptors is all we should need to
be doing to make the module unloadable. I think an explicit helper
function to do that (exported from core lib) is much better than killing a
process (or waiting for that process to die). It's more explicit code at
least (and that's generally better for testcases).

> > > > and watchers 
> > > 
> > > That was primarily needed for successful module unload.  If we drop that 
> > > option and you think driver rebind / bus rescan operations can be performed 
> > > blindly, without checking for completion of background workload, then I can 
> > > drop the watchers.
> > 
> > Well we _have_ to do unbinds without checking the background workload has
> > completed. That's the entire point of testing hotunplug. 
> 
> I agree, and the test performs all unbinds that way, i.e., without checking 
> the background workload has completed.  Waiting for background workload 
> completion applies only to what I'm considering a device recovery phase, and 
> not to the "main" unbind/unplug test phase in any way.

Ah ok. At least for rescan I think would also make sense to not wait,
that's another interesting (and even more evil) testcase. This would check
for issues around assigning device node minor numbers. We'd only need one
such case, and all it needs to do really is keep the drm device fd open.

> > It's also why
> > there's lots of work to do here, because the kernel is totally not ready
> > for this.
> > 
> > First stopping everything and then unloading isn't an interesting test,
> 
> Since its introduction, the module unload step was intended as a part of a 
> post-subtest device recovery phase, not the subtest merit.  I added that step 
> because I thought that would be the most reliable way of satisfying the CI 
> requirement on restoring the device to the state ready for next tests without 
> reboot or real device power-on reset on real hardware bus replug.

Yes I understand that. But what are you trying to recover from with a
module reload? Just code sharing as you explain below, or other reasons?

> > that's more or less exactly what our various module unload tests are
> > doing already.
> 
> Yes, and in v5-v7 I was even using the existing i915_module_load test as an 
> external helper command performing device recovery and healthcheck phases in 
> order to avoid reimplementing its code here.
> 
> > > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > > 
> > > I can still drop the framework and switch back to the initial simple structure 
> > > with one or two fixed subtests if you don't like my structural approach.
> > 
> > See above for why, I think that will result in better code in the end.
> 
> OK.
> 
> > > > Very first patch looked much more reasonable, aside from that it broke CI
> > > > since it didn't rebind the driver. 
> > > 
> > > Sorry, my understanding of your and Petri's comments was a bit different, I 
> > > thought that by more than best effort you meant doing everything possible to 
> > > restore the device to be ready for next test without reboot, and module unload 
> > > and reload seemed the most reliable option to me.  Now I can see that there 
> > > were probably two different requirements.  You were considering the test 
> > > incomplete because it was performing only the unbind/unplug part and not 
> > > rebind/rescan, while Petri was probably interested mostly in the device being 
> > > ready for next tests without reboot, no matter which way.
> > 
> > Well it's the same request, and rebind/rescan /should/ result in a working
> > device again. If not, then I guess we also have a bug on our hotreplug
> > code. Which again is worth testing for.
> > 
> > > > We can always add complexity later on
> > > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > > 
> > > OK, as you wish.
> > > 
> > > > Also, I think we should have at least one hotremove-only-nothing-special
> > > > subtest here, i.e. without even the busy batch.
> > > 
> > > That seems trivial to adjust the framework so it accepted NULL workload, if 
> > > the framework survived. Anyway, I'll do that.  Should I put it in a separate 
> > > NULL workload subtest function to be called from igt_main?  Or add it to the 
> > > spin workload subtest function specifically as an option?
> > 
> > Separate test as the first subtest.
> 
> OK.
> 
> > Maybe even include the "shut
> > everything off first" logic from module unload,
> 
> Do you mean i915_module_load.c?

On 2nd thought, for a hotunplug we shouldn't need any of that. Those "shut
everything off first" steps are just to lower the module use count, so
we're allowed to unload the module. The unplug code we run should take
care of all that already for us for a (fake) hotunplug. So module reload
should just work. But then I still don't understand where you see the
benefit in unloading/reloading the module.

> > to have the most baseline test possible.
> > 
> > > > I'm also not sure why we also put module unload tests in there. 
> > > 
> > > As I tried to explain above, I introduced module unload in order to satisfy 
> > > the CI requirement on the device being ready for next test without reboot as 
> > > much as possible.
> > 
> > Hm, but why? What does module reload help in this regard that a rebind
> > can't do? Aside from testing module reload, which is a developer feature
> > and already tested elsewhere.
> 
> As I said, I decided to use module unload as I thought it would be the most 
> reliable way of device recovery from simulated unplug in case no real power-on 
> reset is performed.  I didn't insist on keeping it there at all, I only tried 
> to explain why I did that.  As that can't help in any way to recover the 
> device so it is ready for next tests as CI requested then I'll be happy to 
> remove it and stick to pure driver rebind / bus rescan operations.
> 
> > I'm also not seeing much interactions between hotunplug and module unload.
> > 
> > The one interesting testcase I see is trying to unload the module after we
> > hotunplugged, while the driver is still in use somewhere (open drm fd,
> > open dma-buf fd, open dma-fence fd). That should result in a failure, and
> > it's useful to validate that the kernel is handling the module refcounting
> > correctly in all these cases. But that's a specific negative testcase (and
> > actually being able to unload would be a failure and likely result in a
> > kernel oops), I'm not seeing the benefit of reloading the module.
> 
> That's perfectly clear for me, the optional module unload step will not be 
> there on next iteration.

That might be overshooting slightly. There is at least one interesting
hotunplug testcase involving module unload. But it's more a special case,
not something we need to do for all subtests.
> 
> > > > Compared
> > > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > > or whatever), which is something users can do, module unload is explicitly a 
> > > > developer only feature.
> > > 
> > > My approach was to be able to test driver behavior under any hot unload 
> > > operation available to a user, no matter if developer oriented or not, so we 
> > > can make the driver resistant to users performing potentially dangerous hot 
> > > unbind/unplug operations available to them, intentionally or not.
> > 
> > Yes I agree with that, we need to test hotunplug.
> > 
> > btw the real fun isn't the unbind in sysfs, but physically unplugging a
> > pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> > and the best way to test that hotunplug is through the sysfs unbind
> > support (it's not exactly the same since this way we'll never see failing
> > pci transactions, which are an entirely different kind of fun).
> 
> I fully agree, however please note that what I'm calling device hot unplug is 
> probably still an interesting sysfs option aside driver hot unbind.
> 
> > > > We do not expect module unload to work under all
> > > > possible conditions (it doesn't). 
> > > 
> > > Do you think that driver rebind operation has more chances to succeed, 
> > > especially on a device on which a bus unplug operation was not actually 
> > > performed but only simulated via sysfs, on a device which then has been left 
> > > in an unpredictable state and hasn't undergone a hardware power-on reset on 
> > > physical bus re-plug?
> > 
> > There's definitely potential for bugs, but I don't see how module reload
> > helps. Module reload is essentially:
> > 
> > - unbind devices
> > - unload module
> > - reload module
> > - rebind all devices
> > 
> > The only additional magic that module unload can paper over is that it's
> > disallowed while anyone is still using any devices (assuming the module
> > refcount code is correct). That's not the case for unbind/hotunplug. But
> > that's it, there's no additional magic code being run when you unload the
> > module. Hence why I don't understand why you want to do that.
> 
> Not any longer :-)
> 
> > > > I'd drop that part and focus completely
> > > > on the hotremove/unbind testcase here.
> > > 
> > > Driver unbind / device unplug via sysfs can also be considered developer only 
> > > features. Do you think we should drop driver unbind option, leaving only 
> > > device unplug via sysfs for which we may have no good non-developer 
> > > alternative?
> > 
> > Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> > user action. That's why we care.
> > 
> > We didn't care for unbind (I wontfix closed all the bugs myself) while
> > intel only created built-in gpus because it's indeed fairly pointless to
> > unbind these.
> > 
> > Other bit I don't quite get: What's the difference between unbind and
> > unplug?
> 
> I'm not sure what information you are missing.  What the test is doing is:
> 
> driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
> vs.
> virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove

I was missing the above I guess.

So looking at kernel code the difference is that when we unbind the entire
driver, we loop over all currrently bound devices and do the same as the
remove sysfs file. So kinda redundant, I'd drop the driver unbind
testcase.

Also, are you digging around in the kernel already and trying to
understand what's going on and how it all ties together? And have you
started to look at the bugs this uncovers in the kernel, or who's supposed
to work on that side of this effort?

Bonus points if we unbind the same device as we'd pick for the drm fd
(there's some selection logic, and if you go through /sys/class/drm you
should find the right device). This is relevant for discrete/multi-gpu
systems.

> Panic call traces look a bit different, you may want to compare the following two:
> https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
> https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html

Hm I think we also need a hotunplug testcase that does absolutely nothing
first, i.e. no spin batches, no open drm files, nothing else.

> rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
> vs.
> replug: echo 1 >/sys/bus/<bus>/rescan
> 
> With no panics accompanying driver unbind / device unplug under active spin 
> workload on older hardware, the recovery phase is however still giving a 
> different result for each of those two methods:
> https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
> https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html

Shouldn't really be a difference, but maybe there's timing changes that
slightly influence the outcome.

> BTW, trybot results confirm that module unload really doesn't help:
> https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html

Yeah worst case we have an additional module refcount bug and then module
unload will make things worse. I can't come up with a scenario where
module unload would help (there's reasons it's a developer-only thing,
it's really hard to get right).

Cheers, Daniel


> 
> Thanks,
> Janusz
> 
> 
> > -Daniel
> > 
> > > 
> > > Thanks,
> > > Janusz
> > > 
> > > 
> > > > -Daniel
> > > > 
> > > > > ---
> > > > >  tests/Makefile.sources  |   1 +
> > > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > > >  tests/meson.build       |   1 +
> > > > >  3 files changed, 410 insertions(+)
> > > > >  create mode 100644 tests/core_hot_reload.c
> > > > > 
> > > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > > index 7f921f6c..452d8ed7 100644
> > > > > --- a/tests/Makefile.sources
> > > > > +++ b/tests/Makefile.sources
> > > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > > >  	core_getclient \
> > > > >  	core_getstats \
> > > > >  	core_getversion \
> > > > > +	core_hot_reload \
> > > > >  	core_setmaster_vs_auth \
> > > > >  	debugfs_test \
> > > > >  	drm_import_export \
> > > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > > new file mode 100644
> > > > > index 00000000..6673f55c
> > > > > --- /dev/null
> > > > > +++ b/tests/core_hot_reload.c
> > > > > @@ -0,0 +1,408 @@
> > > > > +/*
> > > > > + * Copyright © 2019 Intel Corporation
> > > > > + *
> > > > > + * Permission is hereby granted, free of charge, to any person obtaining 
> > > a
> > > > > + * copy of this software and associated documentation files (the 
> > > "Software"),
> > > > > + * to deal in the Software without restriction, including without 
> > > limitation
> > > > > + * the rights to use, copy, modify, merge, publish, distribute, 
> > > sublicense,
> > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > + *
> > > > > + * The above copyright notice and this permission notice (including the 
> > > next
> > > > > + * paragraph) shall be included in all copies or substantial portions of 
> > > the
> > > > > + * Software.
> > > > > + *
> > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
> > > EXPRESS OR
> > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> > > MERCHANTABILITY,
> > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
> > > SHALL
> > > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> > > OTHER
> > > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
> > > ARISING
> > > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
> > > DEALINGS
> > > > > + * IN THE SOFTWARE.
> > > > > + */
> > > > > +
> > > > > +#include "igt.h"
> > > > > +#include "igt_device.h"
> > > > > +#include "igt_dummyload.h"
> > > > > +#include "igt_kmod.h"
> > > > > +#include "igt_sysfs.h"
> > > > > +
> > > > > +#include <getopt.h>
> > > > > +#include <limits.h>
> > > > > +#include <string.h>
> > > > > +#include <unistd.h>
> > > > > +
> > > > > +#include <sys/types.h>
> > > > > +#include <sys/wait.h>
> > > > > +
> > > > > +/**
> > > > > + * A post-action device recovery function:
> > > > > + * @priv: a pointer to private data required for device recovery
> > > > > + *
> > > > > + * Make the device re-appear
> > > > > + */
> > > > > +typedef void (*recover_t)(const void *priv);
> > > > > +
> > > > > +/**
> > > > > + * A test action function:
> > > > > + * @dir: file descriptor of an open device sysfs directory
> > > > > + * @module: module name, non-NULL indicates post-action module unload 
> > > requested
> > > > > + * @recover: for returning a pointer to a post-action device recovery 
> > > function
> > > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > > + *
> > > > > + * Make the device disappear
> > > > > + */
> > > > > +typedef void (*action_t)(int device, const char *module,
> > > > > +			 recover_t *recover, const void **priv);
> > > > > +
> > > > > +/**
> > > > > + * A workload completion wait function:
> > > > > + * @device: open device file descriptor
> > > > > + * @priv: a pointer to private data required by the wait function
> > > > > + *
> > > > > + * Wait for completion of background workload
> > > > > + */
> > > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > > +
> > > > > +/**
> > > > > + * A workload function:
> > > > > + * @device: open device file descriptor
> > > > > + * @arg: a optional string argument passed to the workload function
> > > > > + * @workload_wait: for returning a pointer to workload completion wait 
> > > function
> > > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > > + *
> > > > > + * Put some long lasting load on the device
> > > > > + */
> > > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > > +			   workload_wait_t *workload_wait, void 
> > > **priv);
> > > > > +
> > > > > +/**
> > > > > + * Pairs of test action / device recovery functions
> > > > > + */
> > > > > +
> > > > > +/* Unbind / re-bind */
> > > > > +
> > > > > +struct rebind_data {
> > > > > +	int driver;	/* open file descriptor of driver sysfs directory */
> > > > > +	char *device;	/* bus specific device address as string */
> > > > > +};
> > > > > +
> > > > > +/* Re-bind the driver to the device */
> > > > > +static void driver_bind(const void *priv)
> > > > > +{
> > > > > +	const struct rebind_data *data = priv;
> > > > > +
> > > > > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > > > > +	igt_sysfs_set(data->driver, "bind", data->device);
> > > > > +
> > > > > +	close(data->driver);
> > > > > +}
> > > > > +
> > > > > +/* Unbind the driver from the device */
> > > > > +static void driver_unbind(int device, const char *module,
> > > > > +			  recover_t *recover, const void **priv)
> > > > > +{
> > > > > +	static char path[PATH_MAX];
> > > > > +	static struct rebind_data data;
> > > > > +	int len;
> > > > > +
> > > > > +	/* collect information required for driver bind/unbind */
> > > > > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > > +	igt_assert(data.driver >= 0);
> > > > > +
> > > > > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > > +	path[len] = '\0';
> > > > > +	data.device = strrchr(path, '/') + 1;
> > > > > +
> > > > > +	/* unbind the driver */
> > > > > +	igt_set_timeout(60, "Driver unbind timeout!");
> > > > > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > > > > +
> > > > > +	/* pass back info on how to recover the device */
> > > > > +	if (module) {
> > > > > +		/* don't try to rebind if module will be unloaded */
> > > > > +		*recover = NULL;
> > > > > +	} else {
> > > > > +		*recover = driver_bind;
> > > > > +		*priv = &data;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +/* Unplug / re-plug */
> > > > > +
> > > > > +/* Re-discover the device by rescanning its bus */
> > > > > +static void bus_rescan(const void *priv)
> > > > > +{
> > > > > +	const int *bus = priv;
> > > > > +
> > > > > +	igt_set_timeout(60, "Bus rescan timeout!");
> > > > > +	igt_sysfs_set(*bus, "rescan", "1");
> > > > > +
> > > > > +	close(*bus);
> > > > > +}
> > > > > +
> > > > > +/* Remove (virtually unplug) the device from its bus */
> > > > > +static void device_unplug(int device, const char *module,
> > > > > +			  recover_t *recover, const void **priv)
> > > > > +{
> > > > > +	static int bus;
> > > > > +
> > > > > +	/* collect information required for bus rescan */
> > > > > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > > +	igt_assert(bus >= 0);
> > > > > +
> > > > > +	/* remove the device */
> > > > > +	igt_set_timeout(60, "Device unplug timeout!");
> > > > > +	igt_sysfs_set(device, "device/remove", "1");
> > > > > +
> > > > > +	/* pass back info on how to recover the device */
> > > > > +	*recover = bus_rescan;
> > > > > +	*priv = &bus;
> > > > > +}
> > > > > +
> > > > > +/* Each test action function must be registered in the following table */
> > > > > +static const struct {
> > > > > +	const char *name;	/* unique test action name used in test 
> > > names */
> > > > > +	action_t function;	/* test action function pointer */
> > > > > +} actions[] = {
> > > > > +	{ "unbind", driver_unbind, },
> > > > > +	{ "unplug", device_unplug, },
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * Pairs of workload / wait completion functions
> > > > > + */
> > > > > +
> > > > > +/* A workload using igt_spin_run() */
> > > > > +
> > > > > +/* Wait for completaion of dummy load */
> > > > > +static void dummy_wait(int device, void *priv)
> > > > > +{
> > > > > +	igt_spin_t *spin = priv;
> > > > > +
> > > > > +	/* wait until the spin no longer runs, don't fail on error */
> > > > > +	if (gem_wait(device, spin->handle, NULL))
> > > > > +		__gem_set_domain(device, spin->handle,
> > > > > +				 I915_GEM_DOMAIN_GTT, 
> > > I915_GEM_DOMAIN_GTT);
> > > > > +}
> > > > > +
> > > > > +/* Run dummy load */
> > > > > +static void dummy_load(int device, const char *arg,
> > > > > +		       workload_wait_t *workload_wait, void **priv)
> > > > > +{
> > > > > +	igt_spin_t *spin;
> > > > > +
> > > > > +	/* submit a job */
> > > > > +	spin = igt_spin_new(device);
> > > > > +
> > > > > +	*workload_wait = dummy_wait;
> > > > > +	*priv = spin;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Each workload function must be registered in the following table.
> > > > > + * A function may be registered more than once under different workload 
> > > names,
> > > > > + * that makes sense as long as a different argument is specified for each 
> > > name.
> > > > > + */
> > > > > +static const struct {
> > > > > +	const char *name;	/* unique workload name used in test names 
> > > */
> > > > > +	workload_t function;	/* workload function pointer */
> > > > > +	const char *arg;	/* optional constant string argument */
> > > > > +} workloads[] = {
> > > > > +	{ "spin", dummy_load, NULL, },
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * Framework
> > > > > + */
> > > > > +
> > > > > +static void healthcheck(int chipset)
> > > > > +{
> > > > > +	int device;
> > > > > +
> > > > > +	device = __drm_open_driver(chipset);
> > > > > +	igt_assert(device >= 0);
> > > > > +
> > > > > +	if (chipset == DRIVER_INTEL)
> > > > > +		gem_test_engine(device, ALL_ENGINES);
> > > > > +
> > > > > +	close(device);
> > > > > +}
> > > > > +
> > > > > +static void module_unload(int chipset, const char *module)
> > > > > +{
> > > > > +	if (chipset == DRIVER_INTEL)
> > > > > +		igt_assert(igt_i915_driver_unload() == 
> > > IGT_EXIT_SUCCESS);
> > > > > +	else
> > > > > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > > +}
> > > > > +
> > > > > +static void run_action(int device, action_t action, const char *module,
> > > > > +		      recover_t *recover, const void **priv)
> > > > > +{
> > > > > +	int dir;
> > > > > +
> > > > > +	dir = igt_sysfs_open(device);
> > > > > +	igt_assert(dir >= 0);
> > > > > +
> > > > > +	action(dir, module, recover, priv);
> > > > > +
> > > > > +	close(dir);
> > > > > +}
> > > > > +
> > > > > +static void wait_helper(int device, void *priv)
> > > > > +{
> > > > > +	struct igt_helper_process *proc = priv;
> > > > > +
> > > > > +	/* wait until the workload subprocess has completed */
> > > > > +	igt_ignore_warn(igt_wait_helper(proc));
> > > > > +}
> > > > > +
> > > > > +static void run_workload(int device, workload_t workload, const char 
> > > *arg,
> > > > > +			 const char *module, workload_wait_t 
> > > *workload_wait,
> > > > > +			 void **priv)
> > > > > +{
> > > > > +	if (module) {
> > > > > +		/* run workload in a subprocess so the module is put on 
> > > crash */
> > > > > +		static struct igt_helper_process proc;
> > > > > +		int wstatus, ret;
> > > > > +
> > > > > +		bzero(&proc, sizeof(proc));
> > > > > +
> > > > > +		igt_fork_helper(&proc) {
> > > > > +			/* suppress igt_log messages */
> > > > > +			igt_log_level = IGT_LOG_NONE;
> > > > > +
> > > > > +			/* intercept igt_fail/skip() long jumps */
> > > > > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
> > > {
> > > > > +				workload(device, arg, 
> > > workload_wait, priv);
> > > > > +
> > > > > +				(*workload_wait)(device, 
> > > *priv);
> > > > > +
> > > > > +				/* success if not diverted by 
> > > igt_fail/skip() */
> > > > > +				igt_success();
> > > > > +			}
> > > > > +
> > > > > +			/* pass exit code back to the caller */
> > > > > +			igt_exit();
> > > > > +		}
> > > > > +		/* let the background process start doing its job or 
> > > fail */
> > > > > +		sleep(2);
> > > > > +		/* fail or skip on workload premature completion */
> > > > > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > > +		if (ret < 0)
> > > > > +			igt_fail(IGT_EXIT_INVALID);
> > > > > +		if (ret) {
> > > > > +			if (!WIFEXITED(wstatus))
> > > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > > +			if (WEXITSTATUS(wstatus) == 
> > > IGT_EXIT_SUCCESS)
> > > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > > +				igt_skip(NULL);
> > > > > +			igt_fail(WEXITSTATUS(wstatus));
> > > > > +		}
> > > > > +
> > > > > +		/* pass back info on how to wait for helper completion 
> > > */
> > > > > +		*workload_wait = wait_helper;
> > > > > +		*priv = &proc;
> > > > > +	} else {
> > > > > +		/* run the requested workload directly */
> > > > > +		workload(device, arg, workload_wait, priv);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static void run_subtest(int chipset, int workload, int action,
> > > > > +			const char *module)
> > > > > +{
> > > > > +	workload_wait_t workload_wait;
> > > > > +	void *workload_priv;
> > > > > +	recover_t recover;
> > > > > +	const void *recover_priv;
> > > > > +	int device;
> > > > > +
> > > > > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
> > > actions[action].name,
> > > > > +		      module ? "-unload" : "") {
> > > > > +		device = __drm_open_driver(chipset);
> > > > > +		igt_assert(device >= 0);
> > > > > +
> > > > > +		/* spawn the requested workload */
> > > > > +		igt_debug("spawning background workload\n");
> > > > > +		run_workload(device, workloads[workload].function,
> > > > > +			     workloads[workload].arg, module,
> > > > > +			     &workload_wait, &workload_priv);
> > > > > +
> > > > > +		/* run the requested test action */
> > > > > +		igt_debug("running test action\n");
> > > > > +		run_action(device, actions[action].function, module,
> > > > > +			   &recover, &recover_priv);
> > > > > +
> > > > > +		if (workload_wait) {
> > > > > +			igt_debug("waiting for workload 
> > > completion\n");
> > > > > +			workload_wait(device, workload_priv);
> > > > > +		}
> > > > > +
> > > > > +		close(device);
> > > > > +
> > > > > +		if (module) {
> > > > > +			igt_debug("unloading %s\n", module);
> > > > > +			module_unload(chipset, module);
> > > > > +		}
> > > > > +
> > > > > +		if (recover) {
> > > > > +			igt_debug("recovering device\n");
> > > > > +			recover(recover_priv);
> > > > > +			igt_reset_timeout();
> > > > > +		}
> > > > > +
> > > > > +		igt_debug("running healthcheck\n");
> > > > > +		healthcheck(chipset);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +igt_main {
> > > > > +	int device, chipset;
> > > > > +	char *module;
> > > > > +	int i, j;
> > > > > +
> > > > > +	igt_fixture {
> > > > > +		char path[PATH_MAX];
> > > > > +		int dir, len;
> > > > > +
> > > > > +		/**
> > > > > +		 * Since some subtests depend on successful unload of a 
> > > driver
> > > > > +		 * module, don't use drm_open_driver() as it keeps a 
> > > device file
> > > > > +		 * descriptor open for exit handler use and that 
> > > effectively
> > > > > +		 * prevents the module from being unloaded.
> > > > > +		 */
> > > > > +		device = __drm_open_driver(DRIVER_ANY);
> > > > > +		igt_assert(device >= 0);
> > > > > +
> > > > > +		if (is_i915_device(device)) {
> > > > > +			chipset = DRIVER_INTEL;
> > > > > +			module = strdup("i915");
> > > > > +		} else {
> > > > > +			chipset = DRIVER_ANY;
> > > > > +
> > > > > +			/* Capture module name to be unloaded */
> > > > > +			dir = igt_sysfs_open(device);
> > > > > +			len = readlinkat(dir, "device/driver/
> > > module", path,
> > > > > +					 sizeof(path) - 1);
> > > > > +			close(dir);
> > > > > +			path[len] = '\0';
> > > > > +			module = strdup(strrchr(path, '/') + 1);
> > > > > +		}
> > > > > +		close(device);
> > > > > +
> > > > > +		igt_info("Running the test on driver \"%s\", chipset 
> > > mask %#0x\n",
> > > > > +			 module, chipset);
> > > > > +	}
> > > > > +
> > > > > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > > +) {
> > > > > +			/* with module unload */
> > > > > +			run_subtest(chipset, i, j, module);
> > > > > +			/* without module unload */
> > > > > +			run_subtest(chipset, i, j, NULL);
> > > > > +		}
> > > > > +	}
> > > > > +}
> > > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > > index 711979b4..0d418035 100644
> > > > > --- a/tests/meson.build
> > > > > +++ b/tests/meson.build
> > > > > @@ -3,6 +3,7 @@ test_progs = [
> > > > >  	'core_getclient',
> > > > >  	'core_getstats',
> > > > >  	'core_getversion',
> > > > > +	'core_hot_reload',
> > > > >  	'core_setmaster_vs_auth',
> > > > >  	'debugfs_test',
> > > > >  	'drm_import_export',
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> 
> 
> 
> 
> _______________________________________________
> igt-dev mailing list
> igt-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/igt-dev

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-07  9:14           ` Daniel Vetter
@ 2019-05-07 10:44             ` Janusz Krzysztofik
  2019-05-07 13:32               ` Daniel Vetter
  0 siblings, 1 reply; 13+ messages in thread
From: Janusz Krzysztofik @ 2019-05-07 10:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: igt-dev, Petri Latvala

On Tuesday, May 7, 2019 11:14:20 AM CEST Daniel Vetter wrote:
> On Tue, May 07, 2019 at 08:24:30AM +0200, Janusz Krzysztofik wrote:
> > On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> > > On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > > > Hi Daniel,
> > > > 
> > > > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > > > From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > > > 
> > > > > > Put some workload on a device, then try to either remove (unplug) the
> > > > > > device from its bus, or unbind the device's driver from it, possibly
> > > > > > followed by module unload, depending on which specific subtest has been
> > > > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > > > health checks on the device with the driver possibly loaded back.
> > > > > > 
> > > > > > If module unload is requested, the workload is run in a sub-process,
> > > > > > not directly from the test, as it is expected to crash while still
> > > > > > keeping the device open for as long as its process has not exited.
> > > > > > 
> > > > > > The driver hot unbind / device hot unplug operation is expected to
> > > > > > succeed and the background workload sub-process to crash in a
> > > > > > reasonable time, however long timeouts are used to let kernel level
> > > > > > timeouts pop up first if hit by a bug.
> > > > > > 
> > > > > > The driver is ready for extending it with an arbitrary workload
> > > > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > > > other hardware, unless they provide their implementation of
> > > > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > > > 
> > > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > > 
> > > > > High level comments and apologies that I didn't look at v2-v7 in between.
> > 
> > v1-v3 were submitted internally, so you actually joined and commented first on 
> > first public submission which I should have had marked as v4 (I hadn't, 
> > sorry).
> > 
> > > > > 
> > > > > This all seems extremely complex for a simple batch spinner subtest ...
> > > > 
> > > > My initial intention was to build a simple hot unplug/unbind only test. I 
> > > > proposed to use an arbitrary external command as a workload.  Then, on 
> > > > Antonio's advice, I switched to the spinner based internal workload and I 
> > > > agree that was a good move.  Then, Petri and you, Daniel, requested to extend 
> > > > the scope of the test with device recovery and health checking.  Also, a few 
> > > > people, including you, Daniel, requested availability of more workload type 
> > > > options.  As a result, I've decided to build a *framework* for testing driver 
> > > > unbind + rebind / device unplug + bus rescan behavior under different workload 
> > > > types, easily extendable with more workload options as needed, with one 
> > > > example workload type - dummy load or spin batch - initially implemented.  
> > > > That was at least my intention for v6-8.  I wouldn't call it a simple batch 
> > > > spinner subtest any longer.
> > > 
> > > Maybe my review got wrong, but I just meant that there's more tests to
> > > write here.
> > 
> > That was clear for me, however I probably misunderstood your intentions in 
> > regard to device/driver recovery after successful unplug/unbind.
> > 
> > > Generally I think having the framework/generic solution before
> > > you have all the applications is the wrong way to build something. Usually
> > > it results in something which is generic in all the wrong ways, but not in
> > > the ones you will actually need. So complexity with no gain. Better to
> > > - add a few tests first with copypasting/minimal changes
> > > - refactor helpers once you see the real patterns
> > > - no framework, that's the midlayer mistake, see
> > >   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
> > >   all the articles linked from there.
> > 
> > OK, thanks for your recommendations and the references.
> > 
> > > > > do we really need all that complexity with 2nd process 
> > > > 
> > > > If we drop module unload option then no, we don't need 2nd process.  
> > > 
> > > Why does module unload require a 2nd process? We don't need a 2nd process
> > > in our other module unload tests either.
> > 
> > That's not longer the problem as we're going to drop the module unload step, 
> > but just to provide you with an explanation of my approach:
> > In case of the spin workload, references are held after the workload crashes 
> > and it's not possible to unload the module unless we put them.   Since those 
> > references are internal to IGT libraries and not exposed to a user, putting 
> > them is only possible with functions provided by IGT.  Those functions are 
> > full of checks affecting subtest results and using them to clean up resources 
> > related to a no longer existing device would result in a subtest failure or 
> > skip at least.  The most simple way to get rid of those issues is to enclose 
> > those references in a subprocess and wait for their automatic release on its 
> > completion.
> 
> 
> Hm which references? Closing the file descriptors is all we should need to
> be doing to make the module unloadable. 

That's exactly what I meant.  Unfortunately some of those file descriptors are 
private to IGT lib.

> I think an explicit helper
> function to do that (exported from core lib) is much better than killing a
> process (or waiting for that process to die). It's more explicit code at
> least (and that's generally better for testcases).

Do you think it's worth of effort to extend core lib with less assertive 
variants of existing functions, useful specifically for the hotunplug test and 
maybe no others?  I have identified quite a few such functions, however with 
the approach of not making to much cleanups before recovery you suggest I'm 
not sure if still needed, maybe only for a subtest with module unload.

> > > > > and watchers 
> > > > 
> > > > That was primarily needed for successful module unload.  If we drop that 
> > > > option and you think driver rebind / bus rescan operations can be performed 
> > > > blindly, without checking for completion of background workload, then I can 
> > > > drop the watchers.
> > > 
> > > Well we _have_ to do unbinds without checking the background workload has
> > > completed. That's the entire point of testing hotunplug. 
> > 
> > I agree, and the test performs all unbinds that way, i.e., without checking 
> > the background workload has completed.  Waiting for background workload 
> > completion applies only to what I'm considering a device recovery phase, and 
> > not to the "main" unbind/unplug test phase in any way.
> 
> Ah ok. At least for rescan I think would also make sense to not wait,
> that's another interesting (and even more evil) testcase. This would check
> for issues around assigning device node minor numbers. We'd only need one
> such case, and all it needs to do really is keep the drm device fd open.

OK.

> > > It's also why
> > > there's lots of work to do here, because the kernel is totally not ready
> > > for this.
> > > 
> > > First stopping everything and then unloading isn't an interesting test,
> > 
> > Since its introduction, the module unload step was intended as a part of a 
> > post-subtest device recovery phase, not the subtest merit.  I added that step 
> > because I thought that would be the most reliable way of satisfying the CI 
> > requirement on restoring the device to the state ready for next tests without 
> > reboot or real device power-on reset on real hardware bus replug.
> 
> Yes I understand that. But what are you trying to recover from with a
> module reload? Just code sharing as you explain below, or other reasons?

Nothing specific.  Oriented on successful recovery of the device so it's ready 
for next tests without reboot, I just intuitively tried to avoid rediscovering 
it, possibly in a completely unpredictable state after the fake unplug, and 
that intuition, probably mixed with my ignorance, suggested me to use module 
unload before bus rescan.

> > > that's more or less exactly what our various module unload tests are
> > > doing already.
> > 
> > Yes, and in v5-v7 I was even using the existing i915_module_load test as an 
> > external helper command performing device recovery and healthcheck phases in 
> > order to avoid reimplementing its code here.
> > 
> > > > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > > > 
> > > > I can still drop the framework and switch back to the initial simple structure 
> > > > with one or two fixed subtests if you don't like my structural approach.
> > > 
> > > See above for why, I think that will result in better code in the end.
> > 
> > OK.
> > 
> > > > > Very first patch looked much more reasonable, aside from that it broke CI
> > > > > since it didn't rebind the driver. 
> > > > 
> > > > Sorry, my understanding of your and Petri's comments was a bit different, I 
> > > > thought that by more than best effort you meant doing everything possible to 
> > > > restore the device to be ready for next test without reboot, and module unload 
> > > > and reload seemed the most reliable option to me.  Now I can see that there 
> > > > were probably two different requirements.  You were considering the test 
> > > > incomplete because it was performing only the unbind/unplug part and not 
> > > > rebind/rescan, while Petri was probably interested mostly in the device being 
> > > > ready for next tests without reboot, no matter which way.
> > > 
> > > Well it's the same request, and rebind/rescan /should/ result in a working
> > > device again. If not, then I guess we also have a bug on our hotreplug
> > > code. Which again is worth testing for.
> > > 
> > > > > We can always add complexity later on
> > > > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > > > 
> > > > OK, as you wish.
> > > > 
> > > > > Also, I think we should have at least one hotremove-only-nothing-special
> > > > > subtest here, i.e. without even the busy batch.
> > > > 
> > > > That seems trivial to adjust the framework so it accepted NULL workload, if 
> > > > the framework survived. Anyway, I'll do that.  Should I put it in a separate 
> > > > NULL workload subtest function to be called from igt_main?  Or add it to the 
> > > > spin workload subtest function specifically as an option?
> > > 
> > > Separate test as the first subtest.
> > 
> > OK.
> > 
> > > Maybe even include the "shut
> > > everything off first" logic from module unload,
> > 
> > Do you mean i915_module_load.c?
> 
> On 2nd thought, for a hotunplug we shouldn't need any of that. Those "shut
> everything off first" steps are just to lower the module use count, so
> we're allowed to unload the module. The unplug code we run should take
> care of all that already for us for a (fake) hotunplug. So module reload
> should just work. But then I still don't understand where you see the
> benefit in unloading/reloading the module.

Not any longer ;-)

> > > to have the most baseline test possible.
> > > 
> > > > > I'm also not sure why we also put module unload tests in there. 
> > > > 
> > > > As I tried to explain above, I introduced module unload in order to satisfy 
> > > > the CI requirement on the device being ready for next test without reboot as 
> > > > much as possible.
> > > 
> > > Hm, but why? What does module reload help in this regard that a rebind
> > > can't do? Aside from testing module reload, which is a developer feature
> > > and already tested elsewhere.
> > 
> > As I said, I decided to use module unload as I thought it would be the most 
> > reliable way of device recovery from simulated unplug in case no real power-on 
> > reset is performed.  I didn't insist on keeping it there at all, I only tried 
> > to explain why I did that.  As that can't help in any way to recover the 
> > device so it is ready for next tests as CI requested then I'll be happy to 
> > remove it and stick to pure driver rebind / bus rescan operations.
> > 
> > > I'm also not seeing much interactions between hotunplug and module unload.
> > > 
> > > The one interesting testcase I see is trying to unload the module after we
> > > hotunplugged, while the driver is still in use somewhere (open drm fd,
> > > open dma-buf fd, open dma-fence fd). That should result in a failure, and
> > > it's useful to validate that the kernel is handling the module refcounting
> > > correctly in all these cases. But that's a specific negative testcase (and
> > > actually being able to unload would be a failure and likely result in a
> > > kernel oops), I'm not seeing the benefit of reloading the module.
> > 
> > That's perfectly clear for me, the optional module unload step will not be 
> > there on next iteration.
> 
> That might be overshooting slightly. There is at least one interesting
> hotunplug testcase involving module unload. But it's more a special case,
> not something we need to do for all subtests.

OK, I can try to implement it if I'm sure what you think it should do exactly.

> > > > > Compared
> > > > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > > > or whatever), which is something users can do, module unload is explicitly a 
> > > > > developer only feature.
> > > > 
> > > > My approach was to be able to test driver behavior under any hot unload 
> > > > operation available to a user, no matter if developer oriented or not, so we 
> > > > can make the driver resistant to users performing potentially dangerous hot 
> > > > unbind/unplug operations available to them, intentionally or not.
> > > 
> > > Yes I agree with that, we need to test hotunplug.
> > > 
> > > btw the real fun isn't the unbind in sysfs, but physically unplugging a
> > > pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> > > and the best way to test that hotunplug is through the sysfs unbind
> > > support (it's not exactly the same since this way we'll never see failing
> > > pci transactions, which are an entirely different kind of fun).
> > 
> > I fully agree, however please note that what I'm calling device hot unplug is 
> > probably still an interesting sysfs option aside driver hot unbind.
> > 
> > > > > We do not expect module unload to work under all
> > > > > possible conditions (it doesn't). 
> > > > 
> > > > Do you think that driver rebind operation has more chances to succeed, 
> > > > especially on a device on which a bus unplug operation was not actually 
> > > > performed but only simulated via sysfs, on a device which then has been left 
> > > > in an unpredictable state and hasn't undergone a hardware power-on reset on 
> > > > physical bus re-plug?
> > > 
> > > There's definitely potential for bugs, but I don't see how module reload
> > > helps. Module reload is essentially:
> > > 
> > > - unbind devices
> > > - unload module
> > > - reload module
> > > - rebind all devices
> > > 
> > > The only additional magic that module unload can paper over is that it's
> > > disallowed while anyone is still using any devices (assuming the module
> > > refcount code is correct). That's not the case for unbind/hotunplug. But
> > > that's it, there's no additional magic code being run when you unload the
> > > module. Hence why I don't understand why you want to do that.
> > 
> > Not any longer :-)
> > 
> > > > > I'd drop that part and focus completely
> > > > > on the hotremove/unbind testcase here.
> > > > 
> > > > Driver unbind / device unplug via sysfs can also be considered developer only 
> > > > features. Do you think we should drop driver unbind option, leaving only 
> > > > device unplug via sysfs for which we may have no good non-developer 
> > > > alternative?
> > > 
> > > Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> > > user action. That's why we care.
> > > 
> > > We didn't care for unbind (I wontfix closed all the bugs myself) while
> > > intel only created built-in gpus because it's indeed fairly pointless to
> > > unbind these.
> > > 
> > > Other bit I don't quite get: What's the difference between unbind and
> > > unplug?
> > 
> > I'm not sure what information you are missing.  What the test is doing is:
> > 
> > driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
> > vs.
> > virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove
> 
> I was missing the above I guess.
> 
> So looking at kernel code the difference is that when we unbind the entire
> driver, we loop over all currrently bound devices and do the same as the
> remove sysfs file. So kinda redundant, I'd drop the driver unbind
> testcase.

OK.

> Also, are you digging around in the kernel already and trying to
> understand what's going on and how it all ties together? And have you
> started to look at the bugs this uncovers in the kernel, or who's supposed
> to work on that side of this effort?

I'm trying to.  A first step was:
https://github.com/freedesktop/drm-intel/commit/d69990e0c399e4f7f9b50505d3285e5de991148a

I've already tested two other patches:
https://patchwork.freedesktop.org/series/60053/
https://patchwork.freedesktop.org/series/60051/

Now I'm trying to resolve the GEM_BUG_ON(vma->obj != obj) issue which popped 
up with both above patches applied. I'm not yet sure to what extent it has 
been simply uncovered vs. just triggered by my second patch. 

> Bonus points if we unbind the same device as we'd pick for the drm fd
> (there's some selection logic, and if you go through /sys/class/drm you
> should find the right device). This is relevant for discrete/multi-gpu
> systems.

I think I took that into account seriously enough while planing the subtest 
actions.  Sysfs operations are performed on nodes resolved from the device 
file descriptor.

> > Panic call traces look a bit different, you may want to compare the following two:
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html
> 
> Hm I think we also need a hotunplug testcase that does absolutely nothing
> first, i.e. no spin batches, no open drm files, nothing else.

OK.

> > rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
> > vs.
> > replug: echo 1 >/sys/bus/<bus>/rescan
> > 
> > With no panics accompanying driver unbind / device unplug under active spin 
> > workload on older hardware, the recovery phase is however still giving a 
> > different result for each of those two methods:
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html
> 
> Shouldn't really be a difference, but maybe there's timing changes that
> slightly influence the outcome.

I rather thought of bus operations still being available for the driver to 
shut down the device (more) cleanly on unbind, but again, that's an intuitive 
guess rather than real knowledge.

Thanks,
Janusz


> > BTW, trybot results confirm that module unload really doesn't help:
> > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html
> 
> Yeah worst case we have an additional module refcount bug and then module
> unload will make things worse. I can't come up with a scenario where
> module unload would help (there's reasons it's a developer-only thing,
> it's really hard to get right).
> 
> Cheers, Daniel
> 
> 
> > 
> > Thanks,
> > Janusz
> > 
> > 
> > > -Daniel
> > > 
> > > > 
> > > > Thanks,
> > > > Janusz
> > > > 
> > > > 
> > > > > -Daniel
> > > > > 
> > > > > > ---
> > > > > >  tests/Makefile.sources  |   1 +
> > > > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > > > >  tests/meson.build       |   1 +
> > > > > >  3 files changed, 410 insertions(+)
> > > > > >  create mode 100644 tests/core_hot_reload.c
> > > > > > 
> > > > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > > > index 7f921f6c..452d8ed7 100644
> > > > > > --- a/tests/Makefile.sources
> > > > > > +++ b/tests/Makefile.sources
> > > > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > > > >  	core_getclient \
> > > > > >  	core_getstats \
> > > > > >  	core_getversion \
> > > > > > +	core_hot_reload \
> > > > > >  	core_setmaster_vs_auth \
> > > > > >  	debugfs_test \
> > > > > >  	drm_import_export \
> > > > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > > > new file mode 100644
> > > > > > index 00000000..6673f55c
> > > > > > --- /dev/null
> > > > > > +++ b/tests/core_hot_reload.c
> > > > > > @@ -0,0 +1,408 @@
> > > > > > +/*
> > > > > > + * Copyright © 2019 Intel Corporation
> > > > > > + *
> > > > > > + * Permission is hereby granted, free of charge, to any person obtaining 
> > > > a
> > > > > > + * copy of this software and associated documentation files (the 
> > > > "Software"),
> > > > > > + * to deal in the Software without restriction, including without 
> > > > limitation
> > > > > > + * the rights to use, copy, modify, merge, publish, distribute, 
> > > > sublicense,
> > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > + *
> > > > > > + * The above copyright notice and this permission notice (including the 
> > > > next
> > > > > > + * paragraph) shall be included in all copies or substantial portions of 
> > > > the
> > > > > > + * Software.
> > > > > > + *
> > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
> > > > EXPRESS OR
> > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> > > > MERCHANTABILITY,
> > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT 
> > > > SHALL
> > > > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> > > > OTHER
> > > > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 
> > > > ARISING
> > > > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
> > > > DEALINGS
> > > > > > + * IN THE SOFTWARE.
> > > > > > + */
> > > > > > +
> > > > > > +#include "igt.h"
> > > > > > +#include "igt_device.h"
> > > > > > +#include "igt_dummyload.h"
> > > > > > +#include "igt_kmod.h"
> > > > > > +#include "igt_sysfs.h"
> > > > > > +
> > > > > > +#include <getopt.h>
> > > > > > +#include <limits.h>
> > > > > > +#include <string.h>
> > > > > > +#include <unistd.h>
> > > > > > +
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/wait.h>
> > > > > > +
> > > > > > +/**
> > > > > > + * A post-action device recovery function:
> > > > > > + * @priv: a pointer to private data required for device recovery
> > > > > > + *
> > > > > > + * Make the device re-appear
> > > > > > + */
> > > > > > +typedef void (*recover_t)(const void *priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * A test action function:
> > > > > > + * @dir: file descriptor of an open device sysfs directory
> > > > > > + * @module: module name, non-NULL indicates post-action module unload 
> > > > requested
> > > > > > + * @recover: for returning a pointer to a post-action device recovery 
> > > > function
> > > > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > > > + *
> > > > > > + * Make the device disappear
> > > > > > + */
> > > > > > +typedef void (*action_t)(int device, const char *module,
> > > > > > +			 recover_t *recover, const void **priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * A workload completion wait function:
> > > > > > + * @device: open device file descriptor
> > > > > > + * @priv: a pointer to private data required by the wait function
> > > > > > + *
> > > > > > + * Wait for completion of background workload
> > > > > > + */
> > > > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * A workload function:
> > > > > > + * @device: open device file descriptor
> > > > > > + * @arg: a optional string argument passed to the workload function
> > > > > > + * @workload_wait: for returning a pointer to workload completion wait 
> > > > function
> > > > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > > > + *
> > > > > > + * Put some long lasting load on the device
> > > > > > + */
> > > > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > > > +			   workload_wait_t *workload_wait, void 
> > > > **priv);
> > > > > > +
> > > > > > +/**
> > > > > > + * Pairs of test action / device recovery functions
> > > > > > + */
> > > > > > +
> > > > > > +/* Unbind / re-bind */
> > > > > > +
> > > > > > +struct rebind_data {
> > > > > > +	int driver;	/* open file descriptor of driver sysfs directory */
> > > > > > +	char *device;	/* bus specific device address as string */
> > > > > > +};
> > > > > > +
> > > > > > +/* Re-bind the driver to the device */
> > > > > > +static void driver_bind(const void *priv)
> > > > > > +{
> > > > > > +	const struct rebind_data *data = priv;
> > > > > > +
> > > > > > +	igt_set_timeout(60, "Driver re-bind timeout!");
> > > > > > +	igt_sysfs_set(data->driver, "bind", data->device);
> > > > > > +
> > > > > > +	close(data->driver);
> > > > > > +}
> > > > > > +
> > > > > > +/* Unbind the driver from the device */
> > > > > > +static void driver_unbind(int device, const char *module,
> > > > > > +			  recover_t *recover, const void **priv)
> > > > > > +{
> > > > > > +	static char path[PATH_MAX];
> > > > > > +	static struct rebind_data data;
> > > > > > +	int len;
> > > > > > +
> > > > > > +	/* collect information required for driver bind/unbind */
> > > > > > +	data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > > > +	igt_assert(data.driver >= 0);
> > > > > > +
> > > > > > +	len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > > > +	path[len] = '\0';
> > > > > > +	data.device = strrchr(path, '/') + 1;
> > > > > > +
> > > > > > +	/* unbind the driver */
> > > > > > +	igt_set_timeout(60, "Driver unbind timeout!");
> > > > > > +	igt_sysfs_set(data.driver, "unbind", data.device);
> > > > > > +
> > > > > > +	/* pass back info on how to recover the device */
> > > > > > +	if (module) {
> > > > > > +		/* don't try to rebind if module will be unloaded */
> > > > > > +		*recover = NULL;
> > > > > > +	} else {
> > > > > > +		*recover = driver_bind;
> > > > > > +		*priv = &data;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +/* Unplug / re-plug */
> > > > > > +
> > > > > > +/* Re-discover the device by rescanning its bus */
> > > > > > +static void bus_rescan(const void *priv)
> > > > > > +{
> > > > > > +	const int *bus = priv;
> > > > > > +
> > > > > > +	igt_set_timeout(60, "Bus rescan timeout!");
> > > > > > +	igt_sysfs_set(*bus, "rescan", "1");
> > > > > > +
> > > > > > +	close(*bus);
> > > > > > +}
> > > > > > +
> > > > > > +/* Remove (virtually unplug) the device from its bus */
> > > > > > +static void device_unplug(int device, const char *module,
> > > > > > +			  recover_t *recover, const void **priv)
> > > > > > +{
> > > > > > +	static int bus;
> > > > > > +
> > > > > > +	/* collect information required for bus rescan */
> > > > > > +	bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > > > +	igt_assert(bus >= 0);
> > > > > > +
> > > > > > +	/* remove the device */
> > > > > > +	igt_set_timeout(60, "Device unplug timeout!");
> > > > > > +	igt_sysfs_set(device, "device/remove", "1");
> > > > > > +
> > > > > > +	/* pass back info on how to recover the device */
> > > > > > +	*recover = bus_rescan;
> > > > > > +	*priv = &bus;
> > > > > > +}
> > > > > > +
> > > > > > +/* Each test action function must be registered in the following table */
> > > > > > +static const struct {
> > > > > > +	const char *name;	/* unique test action name used in test 
> > > > names */
> > > > > > +	action_t function;	/* test action function pointer */
> > > > > > +} actions[] = {
> > > > > > +	{ "unbind", driver_unbind, },
> > > > > > +	{ "unplug", device_unplug, },
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * Pairs of workload / wait completion functions
> > > > > > + */
> > > > > > +
> > > > > > +/* A workload using igt_spin_run() */
> > > > > > +
> > > > > > +/* Wait for completaion of dummy load */
> > > > > > +static void dummy_wait(int device, void *priv)
> > > > > > +{
> > > > > > +	igt_spin_t *spin = priv;
> > > > > > +
> > > > > > +	/* wait until the spin no longer runs, don't fail on error */
> > > > > > +	if (gem_wait(device, spin->handle, NULL))
> > > > > > +		__gem_set_domain(device, spin->handle,
> > > > > > +				 I915_GEM_DOMAIN_GTT, 
> > > > I915_GEM_DOMAIN_GTT);
> > > > > > +}
> > > > > > +
> > > > > > +/* Run dummy load */
> > > > > > +static void dummy_load(int device, const char *arg,
> > > > > > +		       workload_wait_t *workload_wait, void **priv)
> > > > > > +{
> > > > > > +	igt_spin_t *spin;
> > > > > > +
> > > > > > +	/* submit a job */
> > > > > > +	spin = igt_spin_new(device);
> > > > > > +
> > > > > > +	*workload_wait = dummy_wait;
> > > > > > +	*priv = spin;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Each workload function must be registered in the following table.
> > > > > > + * A function may be registered more than once under different workload 
> > > > names,
> > > > > > + * that makes sense as long as a different argument is specified for each 
> > > > name.
> > > > > > + */
> > > > > > +static const struct {
> > > > > > +	const char *name;	/* unique workload name used in test names 
> > > > */
> > > > > > +	workload_t function;	/* workload function pointer */
> > > > > > +	const char *arg;	/* optional constant string argument */
> > > > > > +} workloads[] = {
> > > > > > +	{ "spin", dummy_load, NULL, },
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * Framework
> > > > > > + */
> > > > > > +
> > > > > > +static void healthcheck(int chipset)
> > > > > > +{
> > > > > > +	int device;
> > > > > > +
> > > > > > +	device = __drm_open_driver(chipset);
> > > > > > +	igt_assert(device >= 0);
> > > > > > +
> > > > > > +	if (chipset == DRIVER_INTEL)
> > > > > > +		gem_test_engine(device, ALL_ENGINES);
> > > > > > +
> > > > > > +	close(device);
> > > > > > +}
> > > > > > +
> > > > > > +static void module_unload(int chipset, const char *module)
> > > > > > +{
> > > > > > +	if (chipset == DRIVER_INTEL)
> > > > > > +		igt_assert(igt_i915_driver_unload() == 
> > > > IGT_EXIT_SUCCESS);
> > > > > > +	else
> > > > > > +		igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > > > +}
> > > > > > +
> > > > > > +static void run_action(int device, action_t action, const char *module,
> > > > > > +		      recover_t *recover, const void **priv)
> > > > > > +{
> > > > > > +	int dir;
> > > > > > +
> > > > > > +	dir = igt_sysfs_open(device);
> > > > > > +	igt_assert(dir >= 0);
> > > > > > +
> > > > > > +	action(dir, module, recover, priv);
> > > > > > +
> > > > > > +	close(dir);
> > > > > > +}
> > > > > > +
> > > > > > +static void wait_helper(int device, void *priv)
> > > > > > +{
> > > > > > +	struct igt_helper_process *proc = priv;
> > > > > > +
> > > > > > +	/* wait until the workload subprocess has completed */
> > > > > > +	igt_ignore_warn(igt_wait_helper(proc));
> > > > > > +}
> > > > > > +
> > > > > > +static void run_workload(int device, workload_t workload, const char 
> > > > *arg,
> > > > > > +			 const char *module, workload_wait_t 
> > > > *workload_wait,
> > > > > > +			 void **priv)
> > > > > > +{
> > > > > > +	if (module) {
> > > > > > +		/* run workload in a subprocess so the module is put on 
> > > > crash */
> > > > > > +		static struct igt_helper_process proc;
> > > > > > +		int wstatus, ret;
> > > > > > +
> > > > > > +		bzero(&proc, sizeof(proc));
> > > > > > +
> > > > > > +		igt_fork_helper(&proc) {
> > > > > > +			/* suppress igt_log messages */
> > > > > > +			igt_log_level = IGT_LOG_NONE;
> > > > > > +
> > > > > > +			/* intercept igt_fail/skip() long jumps */
> > > > > > +			if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0) 
> > > > {
> > > > > > +				workload(device, arg, 
> > > > workload_wait, priv);
> > > > > > +
> > > > > > +				(*workload_wait)(device, 
> > > > *priv);
> > > > > > +
> > > > > > +				/* success if not diverted by 
> > > > igt_fail/skip() */
> > > > > > +				igt_success();
> > > > > > +			}
> > > > > > +
> > > > > > +			/* pass exit code back to the caller */
> > > > > > +			igt_exit();
> > > > > > +		}
> > > > > > +		/* let the background process start doing its job or 
> > > > fail */
> > > > > > +		sleep(2);
> > > > > > +		/* fail or skip on workload premature completion */
> > > > > > +		ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > > > +		if (ret < 0)
> > > > > > +			igt_fail(IGT_EXIT_INVALID);
> > > > > > +		if (ret) {
> > > > > > +			if (!WIFEXITED(wstatus))
> > > > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > > > +			if (WEXITSTATUS(wstatus) == 
> > > > IGT_EXIT_SUCCESS)
> > > > > > +				igt_fail(IGT_EXIT_INVALID);
> > > > > > +			if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > > > +				igt_skip(NULL);
> > > > > > +			igt_fail(WEXITSTATUS(wstatus));
> > > > > > +		}
> > > > > > +
> > > > > > +		/* pass back info on how to wait for helper completion 
> > > > */
> > > > > > +		*workload_wait = wait_helper;
> > > > > > +		*priv = &proc;
> > > > > > +	} else {
> > > > > > +		/* run the requested workload directly */
> > > > > > +		workload(device, arg, workload_wait, priv);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +static void run_subtest(int chipset, int workload, int action,
> > > > > > +			const char *module)
> > > > > > +{
> > > > > > +	workload_wait_t workload_wait;
> > > > > > +	void *workload_priv;
> > > > > > +	recover_t recover;
> > > > > > +	const void *recover_priv;
> > > > > > +	int device;
> > > > > > +
> > > > > > +	igt_subtest_f("%s-%s%s", workloads[workload].name, 
> > > > actions[action].name,
> > > > > > +		      module ? "-unload" : "") {
> > > > > > +		device = __drm_open_driver(chipset);
> > > > > > +		igt_assert(device >= 0);
> > > > > > +
> > > > > > +		/* spawn the requested workload */
> > > > > > +		igt_debug("spawning background workload\n");
> > > > > > +		run_workload(device, workloads[workload].function,
> > > > > > +			     workloads[workload].arg, module,
> > > > > > +			     &workload_wait, &workload_priv);
> > > > > > +
> > > > > > +		/* run the requested test action */
> > > > > > +		igt_debug("running test action\n");
> > > > > > +		run_action(device, actions[action].function, module,
> > > > > > +			   &recover, &recover_priv);
> > > > > > +
> > > > > > +		if (workload_wait) {
> > > > > > +			igt_debug("waiting for workload 
> > > > completion\n");
> > > > > > +			workload_wait(device, workload_priv);
> > > > > > +		}
> > > > > > +
> > > > > > +		close(device);
> > > > > > +
> > > > > > +		if (module) {
> > > > > > +			igt_debug("unloading %s\n", module);
> > > > > > +			module_unload(chipset, module);
> > > > > > +		}
> > > > > > +
> > > > > > +		if (recover) {
> > > > > > +			igt_debug("recovering device\n");
> > > > > > +			recover(recover_priv);
> > > > > > +			igt_reset_timeout();
> > > > > > +		}
> > > > > > +
> > > > > > +		igt_debug("running healthcheck\n");
> > > > > > +		healthcheck(chipset);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > +igt_main {
> > > > > > +	int device, chipset;
> > > > > > +	char *module;
> > > > > > +	int i, j;
> > > > > > +
> > > > > > +	igt_fixture {
> > > > > > +		char path[PATH_MAX];
> > > > > > +		int dir, len;
> > > > > > +
> > > > > > +		/**
> > > > > > +		 * Since some subtests depend on successful unload of a 
> > > > driver
> > > > > > +		 * module, don't use drm_open_driver() as it keeps a 
> > > > device file
> > > > > > +		 * descriptor open for exit handler use and that 
> > > > effectively
> > > > > > +		 * prevents the module from being unloaded.
> > > > > > +		 */
> > > > > > +		device = __drm_open_driver(DRIVER_ANY);
> > > > > > +		igt_assert(device >= 0);
> > > > > > +
> > > > > > +		if (is_i915_device(device)) {
> > > > > > +			chipset = DRIVER_INTEL;
> > > > > > +			module = strdup("i915");
> > > > > > +		} else {
> > > > > > +			chipset = DRIVER_ANY;
> > > > > > +
> > > > > > +			/* Capture module name to be unloaded */
> > > > > > +			dir = igt_sysfs_open(device);
> > > > > > +			len = readlinkat(dir, "device/driver/
> > > > module", path,
> > > > > > +					 sizeof(path) - 1);
> > > > > > +			close(dir);
> > > > > > +			path[len] = '\0';
> > > > > > +			module = strdup(strrchr(path, '/') + 1);
> > > > > > +		}
> > > > > > +		close(device);
> > > > > > +
> > > > > > +		igt_info("Running the test on driver \"%s\", chipset 
> > > > mask %#0x\n",
> > > > > > +			 module, chipset);
> > > > > > +	}
> > > > > > +
> > > > > > +	for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > > > +		for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > > > +) {
> > > > > > +			/* with module unload */
> > > > > > +			run_subtest(chipset, i, j, module);
> > > > > > +			/* without module unload */
> > > > > > +			run_subtest(chipset, i, j, NULL);
> > > > > > +		}
> > > > > > +	}
> > > > > > +}
> > > > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > > > index 711979b4..0d418035 100644
> > > > > > --- a/tests/meson.build
> > > > > > +++ b/tests/meson.build
> > > > > > @@ -3,6 +3,7 @@ test_progs = [
> > > > > >  	'core_getclient',
> > > > > >  	'core_getstats',
> > > > > >  	'core_getversion',
> > > > > > +	'core_hot_reload',
> > > > > >  	'core_setmaster_vs_auth',
> > > > > >  	'debugfs_test',
> > > > > >  	'drm_import_export',
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > igt-dev mailing list
> > igt-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/igt-dev
> 
> 




_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-07 10:44             ` Janusz Krzysztofik
@ 2019-05-07 13:32               ` Daniel Vetter
  2019-05-13  6:45                 ` Janusz Krzysztofik
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2019-05-07 13:32 UTC (permalink / raw)
  To: Janusz Krzysztofik; +Cc: IGT development, Petri Latvala

On Tue, May 7, 2019 at 12:44 PM Janusz Krzysztofik
<janusz.krzysztofik@linux.intel.com> wrote:
>
> On Tuesday, May 7, 2019 11:14:20 AM CEST Daniel Vetter wrote:
> > On Tue, May 07, 2019 at 08:24:30AM +0200, Janusz Krzysztofik wrote:
> > > On Monday, May 6, 2019 11:21:58 AM CEST Daniel Vetter wrote:
> > > > On Mon, May 06, 2019 at 10:44:11AM +0200, Janusz Krzysztofik wrote:
> > > > > Hi Daniel,
> > > > >
> > > > > On Tuesday, April 30, 2019 5:05:48 PM CEST Daniel Vetter wrote:
> > > > > > On Tue, Apr 30, 2019 at 01:29:15PM +0200, Janusz Krzysztofik wrote:
> > > > > > > From: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > > > >
> > > > > > > Put some workload on a device, then try to either remove (unplug) the
> > > > > > > device from its bus, or unbind the device's driver from it, possibly
> > > > > > > followed by module unload, depending on which specific subtest has been
> > > > > > > selected.  If succeeded, rescan the device's bus if needed and perform
> > > > > > > health checks on the device with the driver possibly loaded back.
> > > > > > >
> > > > > > > If module unload is requested, the workload is run in a sub-process,
> > > > > > > not directly from the test, as it is expected to crash while still
> > > > > > > keeping the device open for as long as its process has not exited.
> > > > > > >
> > > > > > > The driver hot unbind / device hot unplug operation is expected to
> > > > > > > succeed and the background workload sub-process to crash in a
> > > > > > > reasonable time, however long timeouts are used to let kernel level
> > > > > > > timeouts pop up first if hit by a bug.
> > > > > > >
> > > > > > > The driver is ready for extending it with an arbitrary workload
> > > > > > > functions as needed.  For now, a workload based on igt_dummyload is
> > > > > > > implemented, hence subtests work only on i915 driver and are skipped on
> > > > > > > other hardware, unless they provide their implementation of
> > > > > > > igt_spin_new() and friends, or other workloads are implemented.
> > > > > > >
> > > > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@intel.com>
> > > > > >
> > > > > > High level comments and apologies that I didn't look at v2-v7 in between.
> > >
> > > v1-v3 were submitted internally, so you actually joined and commented first on
> > > first public submission which I should have had marked as v4 (I hadn't,
> > > sorry).
> > >
> > > > > >
> > > > > > This all seems extremely complex for a simple batch spinner subtest ...
> > > > >
> > > > > My initial intention was to build a simple hot unplug/unbind only test. I
> > > > > proposed to use an arbitrary external command as a workload.  Then, on
> > > > > Antonio's advice, I switched to the spinner based internal workload and I
> > > > > agree that was a good move.  Then, Petri and you, Daniel, requested to extend
> > > > > the scope of the test with device recovery and health checking.  Also, a few
> > > > > people, including you, Daniel, requested availability of more workload type
> > > > > options.  As a result, I've decided to build a *framework* for testing driver
> > > > > unbind + rebind / device unplug + bus rescan behavior under different workload
> > > > > types, easily extendable with more workload options as needed, with one
> > > > > example workload type - dummy load or spin batch - initially implemented.
> > > > > That was at least my intention for v6-8.  I wouldn't call it a simple batch
> > > > > spinner subtest any longer.
> > > >
> > > > Maybe my review got wrong, but I just meant that there's more tests to
> > > > write here.
> > >
> > > That was clear for me, however I probably misunderstood your intentions in
> > > regard to device/driver recovery after successful unplug/unbind.
> > >
> > > > Generally I think having the framework/generic solution before
> > > > you have all the applications is the wrong way to build something. Usually
> > > > it results in something which is generic in all the wrong ways, but not in
> > > > the ones you will actually need. So complexity with no gain. Better to
> > > > - add a few tests first with copypasting/minimal changes
> > > > - refactor helpers once you see the real patterns
> > > > - no framework, that's the midlayer mistake, see
> > > >   https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html and
> > > >   all the articles linked from there.
> > >
> > > OK, thanks for your recommendations and the references.
> > >
> > > > > > do we really need all that complexity with 2nd process
> > > > >
> > > > > If we drop module unload option then no, we don't need 2nd process.
> > > >
> > > > Why does module unload require a 2nd process? We don't need a 2nd process
> > > > in our other module unload tests either.
> > >
> > > That's not longer the problem as we're going to drop the module unload step,
> > > but just to provide you with an explanation of my approach:
> > > In case of the spin workload, references are held after the workload crashes
> > > and it's not possible to unload the module unless we put them.   Since those
> > > references are internal to IGT libraries and not exposed to a user, putting
> > > them is only possible with functions provided by IGT.  Those functions are
> > > full of checks affecting subtest results and using them to clean up resources
> > > related to a no longer existing device would result in a subtest failure or
> > > skip at least.  The most simple way to get rid of those issues is to enclose
> > > those references in a subprocess and wait for their automatic release on its
> > > completion.
> >
> >
> > Hm which references? Closing the file descriptors is all we should need to
> > be doing to make the module unloadable.
>
> That's exactly what I meant.  Unfortunately some of those file descriptors are
> private to IGT lib.
>
> > I think an explicit helper
> > function to do that (exported from core lib) is much better than killing a
> > process (or waiting for that process to die). It's more explicit code at
> > least (and that's generally better for testcases).
>
> Do you think it's worth of effort to extend core lib with less assertive
> variants of existing functions, useful specifically for the hotunplug test and
> maybe no others?  I have identified quite a few such functions, however with
> the approach of not making to much cleanups before recovery you suggest I'm
> not sure if still needed, maybe only for a subtest with module unload.

Yes, we have lots of these already. The usual approach is that the
normal function has all the checks, and the one with a __ prefix has a
return value indicating whether things worked or not, leaving checking
to the caller. Especially with low-level ioctls wrappers this is a
very common pattern.

> > > > > > and watchers
> > > > >
> > > > > That was primarily needed for successful module unload.  If we drop that
> > > > > option and you think driver rebind / bus rescan operations can be performed
> > > > > blindly, without checking for completion of background workload, then I can
> > > > > drop the watchers.
> > > >
> > > > Well we _have_ to do unbinds without checking the background workload has
> > > > completed. That's the entire point of testing hotunplug.
> > >
> > > I agree, and the test performs all unbinds that way, i.e., without checking
> > > the background workload has completed.  Waiting for background workload
> > > completion applies only to what I'm considering a device recovery phase, and
> > > not to the "main" unbind/unplug test phase in any way.
> >
> > Ah ok. At least for rescan I think would also make sense to not wait,
> > that's another interesting (and even more evil) testcase. This would check
> > for issues around assigning device node minor numbers. We'd only need one
> > such case, and all it needs to do really is keep the drm device fd open.
>
> OK.
>
> > > > It's also why
> > > > there's lots of work to do here, because the kernel is totally not ready
> > > > for this.
> > > >
> > > > First stopping everything and then unloading isn't an interesting test,
> > >
> > > Since its introduction, the module unload step was intended as a part of a
> > > post-subtest device recovery phase, not the subtest merit.  I added that step
> > > because I thought that would be the most reliable way of satisfying the CI
> > > requirement on restoring the device to the state ready for next tests without
> > > reboot or real device power-on reset on real hardware bus replug.
> >
> > Yes I understand that. But what are you trying to recover from with a
> > module reload? Just code sharing as you explain below, or other reasons?
>
> Nothing specific.  Oriented on successful recovery of the device so it's ready
> for next tests without reboot, I just intuitively tried to avoid rediscovering
> it, possibly in a completely unpredictable state after the fake unplug, and
> that intuition, probably mixed with my ignorance, suggested me to use module
> unload before bus rescan.
>
> > > > that's more or less exactly what our various module unload tests are
> > > > doing already.
> > >
> > > Yes, and in v5-v7 I was even using the existing i915_module_load test as an
> > > external helper command performing device recovery and healthcheck phases in
> > > order to avoid reimplementing its code here.
> > >
> > > > > > and a bunch of callbacks and everything, just do to a hotremove testcase?
> > > > >
> > > > > I can still drop the framework and switch back to the initial simple structure
> > > > > with one or two fixed subtests if you don't like my structural approach.
> > > >
> > > > See above for why, I think that will result in better code in the end.
> > >
> > > OK.
> > >
> > > > > > Very first patch looked much more reasonable, aside from that it broke CI
> > > > > > since it didn't rebind the driver.
> > > > >
> > > > > Sorry, my understanding of your and Petri's comments was a bit different, I
> > > > > thought that by more than best effort you meant doing everything possible to
> > > > > restore the device to be ready for next test without reboot, and module unload
> > > > > and reload seemed the most reliable option to me.  Now I can see that there
> > > > > were probably two different requirements.  You were considering the test
> > > > > incomplete because it was performing only the unbind/unplug part and not
> > > > > rebind/rescan, while Petri was probably interested mostly in the device being
> > > > > ready for next tests without reboot, no matter which way.
> > > >
> > > > Well it's the same request, and rebind/rescan /should/ result in a working
> > > > device again. If not, then I guess we also have a bug on our hotreplug
> > > > code. Which again is worth testing for.
> > > >
> > > > > > We can always add complexity later on
> > > > > > once we have dma-buf/dma-fence/kms/whatever else substests here.
> > > > >
> > > > > OK, as you wish.
> > > > >
> > > > > > Also, I think we should have at least one hotremove-only-nothing-special
> > > > > > subtest here, i.e. without even the busy batch.
> > > > >
> > > > > That seems trivial to adjust the framework so it accepted NULL workload, if
> > > > > the framework survived. Anyway, I'll do that.  Should I put it in a separate
> > > > > NULL workload subtest function to be called from igt_main?  Or add it to the
> > > > > spin workload subtest function specifically as an option?
> > > >
> > > > Separate test as the first subtest.
> > >
> > > OK.
> > >
> > > > Maybe even include the "shut
> > > > everything off first" logic from module unload,
> > >
> > > Do you mean i915_module_load.c?
> >
> > On 2nd thought, for a hotunplug we shouldn't need any of that. Those "shut
> > everything off first" steps are just to lower the module use count, so
> > we're allowed to unload the module. The unplug code we run should take
> > care of all that already for us for a (fake) hotunplug. So module reload
> > should just work. But then I still don't understand where you see the
> > benefit in unloading/reloading the module.
>
> Not any longer ;-)
>
> > > > to have the most baseline test possible.
> > > >
> > > > > > I'm also not sure why we also put module unload tests in there.
> > > > >
> > > > > As I tried to explain above, I introduced module unload in order to satisfy
> > > > > the CI requirement on the device being ready for next test without reboot as
> > > > > much as possible.
> > > >
> > > > Hm, but why? What does module reload help in this regard that a rebind
> > > > can't do? Aside from testing module reload, which is a developer feature
> > > > and already tested elsewhere.
> > >
> > > As I said, I decided to use module unload as I thought it would be the most
> > > reliable way of device recovery from simulated unplug in case no real power-on
> > > reset is performed.  I didn't insist on keeping it there at all, I only tried
> > > to explain why I did that.  As that can't help in any way to recover the
> > > device so it is ready for next tests as CI requested then I'll be happy to
> > > remove it and stick to pure driver rebind / bus rescan operations.
> > >
> > > > I'm also not seeing much interactions between hotunplug and module unload.
> > > >
> > > > The one interesting testcase I see is trying to unload the module after we
> > > > hotunplugged, while the driver is still in use somewhere (open drm fd,
> > > > open dma-buf fd, open dma-fence fd). That should result in a failure, and
> > > > it's useful to validate that the kernel is handling the module refcounting
> > > > correctly in all these cases. But that's a specific negative testcase (and
> > > > actually being able to unload would be a failure and likely result in a
> > > > kernel oops), I'm not seeing the benefit of reloading the module.
> > >
> > > That's perfectly clear for me, the optional module unload step will not be
> > > there on next iteration.
> >
> > That might be overshooting slightly. There is at least one interesting
> > hotunplug testcase involving module unload. But it's more a special case,
> > not something we need to do for all subtests.
>
> OK, I can try to implement it if I'm sure what you think it should do exactly.

I forgot to explain what it was :-) Or at least I can't find it
anymore in this huge thread.

- hotunplug device
- keep one userspace reference alive (with open drm device fd or similar)
- try to unload, check that this fails
- drop the reference (by closing the fd)
- make sure unload/reload suceed now

Very basic, but makes sure that dangling references keep the module
locked - if that's not the case then userspace could issue an IOCTL,
which would call into that unloaded module and result in a kernel
Oops.

There's more varianst we could do here, but since module unload is a
developer feature I think there's better places to invest all that
effort. But one basic sanity check like this can't hurt.

> > > > > > Compared
> > > > > > to hotunplug of a discrete gfx card (external one over usb or thunderbolt
> > > > > > or whatever), which is something users can do, module unload is explicitly a
> > > > > > developer only feature.
> > > > >
> > > > > My approach was to be able to test driver behavior under any hot unload
> > > > > operation available to a user, no matter if developer oriented or not, so we
> > > > > can make the driver resistant to users performing potentially dangerous hot
> > > > > unbind/unplug operations available to them, intentionally or not.
> > > >
> > > > Yes I agree with that, we need to test hotunplug.
> > > >
> > > > btw the real fun isn't the unbind in sysfs, but physically unplugging a
> > > > pci-e or thunderbolt/usb-c gfx card. Imo that's why we need to have this,
> > > > and the best way to test that hotunplug is through the sysfs unbind
> > > > support (it's not exactly the same since this way we'll never see failing
> > > > pci transactions, which are an entirely different kind of fun).
> > >
> > > I fully agree, however please note that what I'm calling device hot unplug is
> > > probably still an interesting sysfs option aside driver hot unbind.
> > >
> > > > > > We do not expect module unload to work under all
> > > > > > possible conditions (it doesn't).
> > > > >
> > > > > Do you think that driver rebind operation has more chances to succeed,
> > > > > especially on a device on which a bus unplug operation was not actually
> > > > > performed but only simulated via sysfs, on a device which then has been left
> > > > > in an unpredictable state and hasn't undergone a hardware power-on reset on
> > > > > physical bus re-plug?
> > > >
> > > > There's definitely potential for bugs, but I don't see how module reload
> > > > helps. Module reload is essentially:
> > > >
> > > > - unbind devices
> > > > - unload module
> > > > - reload module
> > > > - rebind all devices
> > > >
> > > > The only additional magic that module unload can paper over is that it's
> > > > disallowed while anyone is still using any devices (assuming the module
> > > > refcount code is correct). That's not the case for unbind/hotunplug. But
> > > > that's it, there's no additional magic code being run when you unload the
> > > > module. Hence why I don't understand why you want to do that.
> > >
> > > Not any longer :-)
> > >
> > > > > > I'd drop that part and focus completely
> > > > > > on the hotremove/unbind testcase here.
> > > > >
> > > > > Driver unbind / device unplug via sysfs can also be considered developer only
> > > > > features. Do you think we should drop driver unbind option, leaving only
> > > > > device unplug via sysfs for which we may have no good non-developer
> > > > > alternative?
> > > >
> > > > Yanking the cable for e.g. usb-c/thunderbolt external gpu is very much a
> > > > user action. That's why we care.
> > > >
> > > > We didn't care for unbind (I wontfix closed all the bugs myself) while
> > > > intel only created built-in gpus because it's indeed fairly pointless to
> > > > unbind these.
> > > >
> > > > Other bit I don't quite get: What's the difference between unbind and
> > > > unplug?
> > >
> > > I'm not sure what information you are missing.  What the test is doing is:
> > >
> > > driver unbind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/unbind
> > > vs.
> > > virtual device unplug: echo 1 >/sys/bus/<bus>/devices/<device>/remove
> >
> > I was missing the above I guess.
> >
> > So looking at kernel code the difference is that when we unbind the entire
> > driver, we loop over all currrently bound devices and do the same as the
> > remove sysfs file. So kinda redundant, I'd drop the driver unbind
> > testcase.
>
> OK.
>
> > Also, are you digging around in the kernel already and trying to
> > understand what's going on and how it all ties together? And have you
> > started to look at the bugs this uncovers in the kernel, or who's supposed
> > to work on that side of this effort?
>
> I'm trying to.  A first step was:
> https://github.com/freedesktop/drm-intel/commit/d69990e0c399e4f7f9b50505d3285e5de991148a

Yup, that's about 1/n patches we need.

> I've already tested two other patches:
> https://patchwork.freedesktop.org/series/60053/
> https://patchwork.freedesktop.org/series/60051/

Usually we solve this stuff by reference counting structures, to make
sure they don't disappear before the last user is finished cleaning
stuff up. Forced cleanup tends to be really hard to understand,
validate and often is impossible to implement because of locking
inversions.

> Now I'm trying to resolve the GEM_BUG_ON(vma->obj != obj) issue which popped
> up with both above patches applied. I'm not yet sure to what extent it has
> been simply uncovered vs. just triggered by my second patch.

No idea, could very well just be that you now blow up later with the
next oops. Hotunplug is so untested I'd be surprised if it's not full
of bugs.

Another option would be to first have a more minimal hotunplug test
without any busy batches pending, and make sure that works reliable
(i.e. no buffer, no context, no nothing, no open fd, absolutely
nothing really allocated from userspace). Then extend from there
step-by-step, e.g. first just open a drm device fd, then allocate a
buffer, then mmap, then maybe try to write to that buffer after
unplug, try to do an ioctl after unplug (go through all of them).

Only once you've gone through all the more basic things go to the much
more evil stuff like having a batch in-flight, or a kms request
in-flight. That way it should be easier to make forward progress on
these bugs without risking the world. Plus you can start merging the
igt side already, to make sure no one regresses anymore.

> > Bonus points if we unbind the same device as we'd pick for the drm fd
> > (there's some selection logic, and if you go through /sys/class/drm you
> > should find the right device). This is relevant for discrete/multi-gpu
> > systems.
>
> I think I took that into account seriously enough while planing the subtest
> actions.  Sysfs operations are performed on nodes resolved from the device
> file descriptor.

Ah great, tbh I didn't check that in your patch.

> > > Panic call traces look a bit different, you may want to compare the following two:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb1/igt@core_hot_reload@spin-unbind.html
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-iclb4/igt@core_hot_reload@spin-unplug.html
> >
> > Hm I think we also need a hotunplug testcase that does absolutely nothing
> > first, i.e. no spin batches, no open drm files, nothing else.
>
> OK.
>
> > > rebind: echo "<device bus' address>" >/sys/bus/<bus>/drivers/<driver>/bind
> > > vs.
> > > replug: echo 1 >/sys/bus/<bus>/rescan
> > >
> > > With no panics accompanying driver unbind / device unplug under active spin
> > > workload on older hardware, the recovery phase is however still giving a
> > > different result for each of those two methods:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw1/igt@core_hot_reload@spin-unbind.html
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw5/igt@core_hot_reload@spin-unplug.html
> >
> > Shouldn't really be a difference, but maybe there's timing changes that
> > slightly influence the outcome.
>
> I rather thought of bus operations still being available for the driver to
> shut down the device (more) cleanly on unbind, but again, that's an intuitive
> guess rather than real knowledge.

All driver unbind does is also remove the driver from the linux device
model. All that does is:
- force unbind on all devices
- prevent binding to new devices.

Not really providing anything beyond unbind/rebind to a specific
device (or what we test already with module reload). The actual
unbinding from a specific device is exactly the same logic (plus/minus
bugs in driver core sysfs files, I just recently fixed one of these,
but that's not our primary concern with validating i915 code).

Cheers, Daniel
>
> Thanks,
> Janusz
>
>
> > > BTW, trybot results confirm that module unload really doesn't help:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/TrybotIGT_14/shard-hsw8/igt@core_hot_reload@spin-unplug-unload.html
> >
> > Yeah worst case we have an additional module refcount bug and then module
> > unload will make things worse. I can't come up with a scenario where
> > module unload would help (there's reasons it's a developer-only thing,
> > it's really hard to get right).
> >
> > Cheers, Daniel
> >
> >
> > >
> > > Thanks,
> > > Janusz
> > >
> > >
> > > > -Daniel
> > > >
> > > > >
> > > > > Thanks,
> > > > > Janusz
> > > > >
> > > > >
> > > > > > -Daniel
> > > > > >
> > > > > > > ---
> > > > > > >  tests/Makefile.sources  |   1 +
> > > > > > >  tests/core_hot_reload.c | 408 ++++++++++++++++++++++++++++++++++++++++
> > > > > > >  tests/meson.build       |   1 +
> > > > > > >  3 files changed, 410 insertions(+)
> > > > > > >  create mode 100644 tests/core_hot_reload.c
> > > > > > >
> > > > > > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > > > > > index 7f921f6c..452d8ed7 100644
> > > > > > > --- a/tests/Makefile.sources
> > > > > > > +++ b/tests/Makefile.sources
> > > > > > > @@ -16,6 +16,7 @@ TESTS_progs = \
> > > > > > >     core_getclient \
> > > > > > >     core_getstats \
> > > > > > >     core_getversion \
> > > > > > > +   core_hot_reload \
> > > > > > >     core_setmaster_vs_auth \
> > > > > > >     debugfs_test \
> > > > > > >     drm_import_export \
> > > > > > > diff --git a/tests/core_hot_reload.c b/tests/core_hot_reload.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000..6673f55c
> > > > > > > --- /dev/null
> > > > > > > +++ b/tests/core_hot_reload.c
> > > > > > > @@ -0,0 +1,408 @@
> > > > > > > +/*
> > > > > > > + * Copyright © 2019 Intel Corporation
> > > > > > > + *
> > > > > > > + * Permission is hereby granted, free of charge, to any person obtaining
> > > > > a
> > > > > > > + * copy of this software and associated documentation files (the
> > > > > "Software"),
> > > > > > > + * to deal in the Software without restriction, including without
> > > > > limitation
> > > > > > > + * the rights to use, copy, modify, merge, publish, distribute,
> > > > > sublicense,
> > > > > > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > > > > > + * Software is furnished to do so, subject to the following conditions:
> > > > > > > + *
> > > > > > > + * The above copyright notice and this permission notice (including the
> > > > > next
> > > > > > > + * paragraph) shall be included in all copies or substantial portions of
> > > > > the
> > > > > > > + * Software.
> > > > > > > + *
> > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > > > > EXPRESS OR
> > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > > > > MERCHANTABILITY,
> > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
> > > > > SHALL
> > > > > > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > > > OTHER
> > > > > > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > > > > ARISING
> > > > > > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> > > > > DEALINGS
> > > > > > > + * IN THE SOFTWARE.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include "igt.h"
> > > > > > > +#include "igt_device.h"
> > > > > > > +#include "igt_dummyload.h"
> > > > > > > +#include "igt_kmod.h"
> > > > > > > +#include "igt_sysfs.h"
> > > > > > > +
> > > > > > > +#include <getopt.h>
> > > > > > > +#include <limits.h>
> > > > > > > +#include <string.h>
> > > > > > > +#include <unistd.h>
> > > > > > > +
> > > > > > > +#include <sys/types.h>
> > > > > > > +#include <sys/wait.h>
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A post-action device recovery function:
> > > > > > > + * @priv: a pointer to private data required for device recovery
> > > > > > > + *
> > > > > > > + * Make the device re-appear
> > > > > > > + */
> > > > > > > +typedef void (*recover_t)(const void *priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A test action function:
> > > > > > > + * @dir: file descriptor of an open device sysfs directory
> > > > > > > + * @module: module name, non-NULL indicates post-action module unload
> > > > > requested
> > > > > > > + * @recover: for returning a pointer to a post-action device recovery
> > > > > function
> > > > > > > + * @priv: for returning a pointer to data to be passed to @recover
> > > > > > > + *
> > > > > > > + * Make the device disappear
> > > > > > > + */
> > > > > > > +typedef void (*action_t)(int device, const char *module,
> > > > > > > +                    recover_t *recover, const void **priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A workload completion wait function:
> > > > > > > + * @device: open device file descriptor
> > > > > > > + * @priv: a pointer to private data required by the wait function
> > > > > > > + *
> > > > > > > + * Wait for completion of background workload
> > > > > > > + */
> > > > > > > +typedef void (*workload_wait_t)(int device, void *priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * A workload function:
> > > > > > > + * @device: open device file descriptor
> > > > > > > + * @arg: a optional string argument passed to the workload function
> > > > > > > + * @workload_wait: for returning a pointer to workload completion wait
> > > > > function
> > > > > > > + * @priv: for returning a pointer to data to be passed to @workload_wait
> > > > > > > + *
> > > > > > > + * Put some long lasting load on the device
> > > > > > > + */
> > > > > > > +typedef void (*workload_t)(int device, const char *arg,
> > > > > > > +                      workload_wait_t *workload_wait, void
> > > > > **priv);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Pairs of test action / device recovery functions
> > > > > > > + */
> > > > > > > +
> > > > > > > +/* Unbind / re-bind */
> > > > > > > +
> > > > > > > +struct rebind_data {
> > > > > > > +   int driver;     /* open file descriptor of driver sysfs directory */
> > > > > > > +   char *device;   /* bus specific device address as string */
> > > > > > > +};
> > > > > > > +
> > > > > > > +/* Re-bind the driver to the device */
> > > > > > > +static void driver_bind(const void *priv)
> > > > > > > +{
> > > > > > > +   const struct rebind_data *data = priv;
> > > > > > > +
> > > > > > > +   igt_set_timeout(60, "Driver re-bind timeout!");
> > > > > > > +   igt_sysfs_set(data->driver, "bind", data->device);
> > > > > > > +
> > > > > > > +   close(data->driver);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Unbind the driver from the device */
> > > > > > > +static void driver_unbind(int device, const char *module,
> > > > > > > +                     recover_t *recover, const void **priv)
> > > > > > > +{
> > > > > > > +   static char path[PATH_MAX];
> > > > > > > +   static struct rebind_data data;
> > > > > > > +   int len;
> > > > > > > +
> > > > > > > +   /* collect information required for driver bind/unbind */
> > > > > > > +   data.driver = openat(device, "device/driver", O_DIRECTORY);
> > > > > > > +   igt_assert(data.driver >= 0);
> > > > > > > +
> > > > > > > +   len = readlinkat(device, "device", path, sizeof(path) - 1);
> > > > > > > +   path[len] = '\0';
> > > > > > > +   data.device = strrchr(path, '/') + 1;
> > > > > > > +
> > > > > > > +   /* unbind the driver */
> > > > > > > +   igt_set_timeout(60, "Driver unbind timeout!");
> > > > > > > +   igt_sysfs_set(data.driver, "unbind", data.device);
> > > > > > > +
> > > > > > > +   /* pass back info on how to recover the device */
> > > > > > > +   if (module) {
> > > > > > > +           /* don't try to rebind if module will be unloaded */
> > > > > > > +           *recover = NULL;
> > > > > > > +   } else {
> > > > > > > +           *recover = driver_bind;
> > > > > > > +           *priv = &data;
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Unplug / re-plug */
> > > > > > > +
> > > > > > > +/* Re-discover the device by rescanning its bus */
> > > > > > > +static void bus_rescan(const void *priv)
> > > > > > > +{
> > > > > > > +   const int *bus = priv;
> > > > > > > +
> > > > > > > +   igt_set_timeout(60, "Bus rescan timeout!");
> > > > > > > +   igt_sysfs_set(*bus, "rescan", "1");
> > > > > > > +
> > > > > > > +   close(*bus);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Remove (virtually unplug) the device from its bus */
> > > > > > > +static void device_unplug(int device, const char *module,
> > > > > > > +                     recover_t *recover, const void **priv)
> > > > > > > +{
> > > > > > > +   static int bus;
> > > > > > > +
> > > > > > > +   /* collect information required for bus rescan */
> > > > > > > +   bus = openat(device, "device/subsystem", O_DIRECTORY);
> > > > > > > +   igt_assert(bus >= 0);
> > > > > > > +
> > > > > > > +   /* remove the device */
> > > > > > > +   igt_set_timeout(60, "Device unplug timeout!");
> > > > > > > +   igt_sysfs_set(device, "device/remove", "1");
> > > > > > > +
> > > > > > > +   /* pass back info on how to recover the device */
> > > > > > > +   *recover = bus_rescan;
> > > > > > > +   *priv = &bus;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Each test action function must be registered in the following table */
> > > > > > > +static const struct {
> > > > > > > +   const char *name;       /* unique test action name used in test
> > > > > names */
> > > > > > > +   action_t function;      /* test action function pointer */
> > > > > > > +} actions[] = {
> > > > > > > +   { "unbind", driver_unbind, },
> > > > > > > +   { "unplug", device_unplug, },
> > > > > > > +};
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Pairs of workload / wait completion functions
> > > > > > > + */
> > > > > > > +
> > > > > > > +/* A workload using igt_spin_run() */
> > > > > > > +
> > > > > > > +/* Wait for completaion of dummy load */
> > > > > > > +static void dummy_wait(int device, void *priv)
> > > > > > > +{
> > > > > > > +   igt_spin_t *spin = priv;
> > > > > > > +
> > > > > > > +   /* wait until the spin no longer runs, don't fail on error */
> > > > > > > +   if (gem_wait(device, spin->handle, NULL))
> > > > > > > +           __gem_set_domain(device, spin->handle,
> > > > > > > +                            I915_GEM_DOMAIN_GTT,
> > > > > I915_GEM_DOMAIN_GTT);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Run dummy load */
> > > > > > > +static void dummy_load(int device, const char *arg,
> > > > > > > +                  workload_wait_t *workload_wait, void **priv)
> > > > > > > +{
> > > > > > > +   igt_spin_t *spin;
> > > > > > > +
> > > > > > > +   /* submit a job */
> > > > > > > +   spin = igt_spin_new(device);
> > > > > > > +
> > > > > > > +   *workload_wait = dummy_wait;
> > > > > > > +   *priv = spin;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Each workload function must be registered in the following table.
> > > > > > > + * A function may be registered more than once under different workload
> > > > > names,
> > > > > > > + * that makes sense as long as a different argument is specified for each
> > > > > name.
> > > > > > > + */
> > > > > > > +static const struct {
> > > > > > > +   const char *name;       /* unique workload name used in test names
> > > > > */
> > > > > > > +   workload_t function;    /* workload function pointer */
> > > > > > > +   const char *arg;        /* optional constant string argument */
> > > > > > > +} workloads[] = {
> > > > > > > +   { "spin", dummy_load, NULL, },
> > > > > > > +};
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Framework
> > > > > > > + */
> > > > > > > +
> > > > > > > +static void healthcheck(int chipset)
> > > > > > > +{
> > > > > > > +   int device;
> > > > > > > +
> > > > > > > +   device = __drm_open_driver(chipset);
> > > > > > > +   igt_assert(device >= 0);
> > > > > > > +
> > > > > > > +   if (chipset == DRIVER_INTEL)
> > > > > > > +           gem_test_engine(device, ALL_ENGINES);
> > > > > > > +
> > > > > > > +   close(device);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void module_unload(int chipset, const char *module)
> > > > > > > +{
> > > > > > > +   if (chipset == DRIVER_INTEL)
> > > > > > > +           igt_assert(igt_i915_driver_unload() ==
> > > > > IGT_EXIT_SUCCESS);
> > > > > > > +   else
> > > > > > > +           igt_assert(igt_kmod_unload(module, 0) == 0);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void run_action(int device, action_t action, const char *module,
> > > > > > > +                 recover_t *recover, const void **priv)
> > > > > > > +{
> > > > > > > +   int dir;
> > > > > > > +
> > > > > > > +   dir = igt_sysfs_open(device);
> > > > > > > +   igt_assert(dir >= 0);
> > > > > > > +
> > > > > > > +   action(dir, module, recover, priv);
> > > > > > > +
> > > > > > > +   close(dir);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void wait_helper(int device, void *priv)
> > > > > > > +{
> > > > > > > +   struct igt_helper_process *proc = priv;
> > > > > > > +
> > > > > > > +   /* wait until the workload subprocess has completed */
> > > > > > > +   igt_ignore_warn(igt_wait_helper(proc));
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void run_workload(int device, workload_t workload, const char
> > > > > *arg,
> > > > > > > +                    const char *module, workload_wait_t
> > > > > *workload_wait,
> > > > > > > +                    void **priv)
> > > > > > > +{
> > > > > > > +   if (module) {
> > > > > > > +           /* run workload in a subprocess so the module is put on
> > > > > crash */
> > > > > > > +           static struct igt_helper_process proc;
> > > > > > > +           int wstatus, ret;
> > > > > > > +
> > > > > > > +           bzero(&proc, sizeof(proc));
> > > > > > > +
> > > > > > > +           igt_fork_helper(&proc) {
> > > > > > > +                   /* suppress igt_log messages */
> > > > > > > +                   igt_log_level = IGT_LOG_NONE;
> > > > > > > +
> > > > > > > +                   /* intercept igt_fail/skip() long jumps */
> > > > > > > +                   if (sigsetjmp(igt_subtest_jmpbuf, 1) == 0)
> > > > > {
> > > > > > > +                           workload(device, arg,
> > > > > workload_wait, priv);
> > > > > > > +
> > > > > > > +                           (*workload_wait)(device,
> > > > > *priv);
> > > > > > > +
> > > > > > > +                           /* success if not diverted by
> > > > > igt_fail/skip() */
> > > > > > > +                           igt_success();
> > > > > > > +                   }
> > > > > > > +
> > > > > > > +                   /* pass exit code back to the caller */
> > > > > > > +                   igt_exit();
> > > > > > > +           }
> > > > > > > +           /* let the background process start doing its job or
> > > > > fail */
> > > > > > > +           sleep(2);
> > > > > > > +           /* fail or skip on workload premature completion */
> > > > > > > +           ret = waitpid(proc.pid, &wstatus, WNOHANG);
> > > > > > > +           if (ret < 0)
> > > > > > > +                   igt_fail(IGT_EXIT_INVALID);
> > > > > > > +           if (ret) {
> > > > > > > +                   if (!WIFEXITED(wstatus))
> > > > > > > +                           igt_fail(IGT_EXIT_INVALID);
> > > > > > > +                   if (WEXITSTATUS(wstatus) ==
> > > > > IGT_EXIT_SUCCESS)
> > > > > > > +                           igt_fail(IGT_EXIT_INVALID);
> > > > > > > +                   if (WEXITSTATUS(wstatus) == IGT_EXIT_SKIP)
> > > > > > > +                           igt_skip(NULL);
> > > > > > > +                   igt_fail(WEXITSTATUS(wstatus));
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           /* pass back info on how to wait for helper completion
> > > > > */
> > > > > > > +           *workload_wait = wait_helper;
> > > > > > > +           *priv = &proc;
> > > > > > > +   } else {
> > > > > > > +           /* run the requested workload directly */
> > > > > > > +           workload(device, arg, workload_wait, priv);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void run_subtest(int chipset, int workload, int action,
> > > > > > > +                   const char *module)
> > > > > > > +{
> > > > > > > +   workload_wait_t workload_wait;
> > > > > > > +   void *workload_priv;
> > > > > > > +   recover_t recover;
> > > > > > > +   const void *recover_priv;
> > > > > > > +   int device;
> > > > > > > +
> > > > > > > +   igt_subtest_f("%s-%s%s", workloads[workload].name,
> > > > > actions[action].name,
> > > > > > > +                 module ? "-unload" : "") {
> > > > > > > +           device = __drm_open_driver(chipset);
> > > > > > > +           igt_assert(device >= 0);
> > > > > > > +
> > > > > > > +           /* spawn the requested workload */
> > > > > > > +           igt_debug("spawning background workload\n");
> > > > > > > +           run_workload(device, workloads[workload].function,
> > > > > > > +                        workloads[workload].arg, module,
> > > > > > > +                        &workload_wait, &workload_priv);
> > > > > > > +
> > > > > > > +           /* run the requested test action */
> > > > > > > +           igt_debug("running test action\n");
> > > > > > > +           run_action(device, actions[action].function, module,
> > > > > > > +                      &recover, &recover_priv);
> > > > > > > +
> > > > > > > +           if (workload_wait) {
> > > > > > > +                   igt_debug("waiting for workload
> > > > > completion\n");
> > > > > > > +                   workload_wait(device, workload_priv);
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           close(device);
> > > > > > > +
> > > > > > > +           if (module) {
> > > > > > > +                   igt_debug("unloading %s\n", module);
> > > > > > > +                   module_unload(chipset, module);
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           if (recover) {
> > > > > > > +                   igt_debug("recovering device\n");
> > > > > > > +                   recover(recover_priv);
> > > > > > > +                   igt_reset_timeout();
> > > > > > > +           }
> > > > > > > +
> > > > > > > +           igt_debug("running healthcheck\n");
> > > > > > > +           healthcheck(chipset);
> > > > > > > +   }
> > > > > > > +}
> > > > > > > +
> > > > > > > +igt_main {
> > > > > > > +   int device, chipset;
> > > > > > > +   char *module;
> > > > > > > +   int i, j;
> > > > > > > +
> > > > > > > +   igt_fixture {
> > > > > > > +           char path[PATH_MAX];
> > > > > > > +           int dir, len;
> > > > > > > +
> > > > > > > +           /**
> > > > > > > +            * Since some subtests depend on successful unload of a
> > > > > driver
> > > > > > > +            * module, don't use drm_open_driver() as it keeps a
> > > > > device file
> > > > > > > +            * descriptor open for exit handler use and that
> > > > > effectively
> > > > > > > +            * prevents the module from being unloaded.
> > > > > > > +            */
> > > > > > > +           device = __drm_open_driver(DRIVER_ANY);
> > > > > > > +           igt_assert(device >= 0);
> > > > > > > +
> > > > > > > +           if (is_i915_device(device)) {
> > > > > > > +                   chipset = DRIVER_INTEL;
> > > > > > > +                   module = strdup("i915");
> > > > > > > +           } else {
> > > > > > > +                   chipset = DRIVER_ANY;
> > > > > > > +
> > > > > > > +                   /* Capture module name to be unloaded */
> > > > > > > +                   dir = igt_sysfs_open(device);
> > > > > > > +                   len = readlinkat(dir, "device/driver/
> > > > > module", path,
> > > > > > > +                                    sizeof(path) - 1);
> > > > > > > +                   close(dir);
> > > > > > > +                   path[len] = '\0';
> > > > > > > +                   module = strdup(strrchr(path, '/') + 1);
> > > > > > > +           }
> > > > > > > +           close(device);
> > > > > > > +
> > > > > > > +           igt_info("Running the test on driver \"%s\", chipset
> > > > > mask %#0x\n",
> > > > > > > +                    module, chipset);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   for (i = 0; i < sizeof(workloads) / sizeof(*workloads); i++) {
> > > > > > > +           for (j = 0; j < sizeof(actions) / sizeof(*actions); j+
> > > > > +) {
> > > > > > > +                   /* with module unload */
> > > > > > > +                   run_subtest(chipset, i, j, module);
> > > > > > > +                   /* without module unload */
> > > > > > > +                   run_subtest(chipset, i, j, NULL);
> > > > > > > +           }
> > > > > > > +   }
> > > > > > > +}
> > > > > > > diff --git a/tests/meson.build b/tests/meson.build
> > > > > > > index 711979b4..0d418035 100644
> > > > > > > --- a/tests/meson.build
> > > > > > > +++ b/tests/meson.build
> > > > > > > @@ -3,6 +3,7 @@ test_progs = [
> > > > > > >     'core_getclient',
> > > > > > >     'core_getstats',
> > > > > > >     'core_getversion',
> > > > > > > +   'core_hot_reload',
> > > > > > >     'core_setmaster_vs_auth',
> > > > > > >     'debugfs_test',
> > > > > > >     'drm_import_export',
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > igt-dev mailing list
> > > igt-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/igt-dev
> >
> >
>
>
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-07 13:32               ` Daniel Vetter
@ 2019-05-13  6:45                 ` Janusz Krzysztofik
  2019-05-13  8:30                   ` Daniel Vetter
  0 siblings, 1 reply; 13+ messages in thread
From: Janusz Krzysztofik @ 2019-05-13  6:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: IGT development, Petri Latvala

Hi Daniel,

On Tuesday, May 7, 2019 3:32:10 PM CEST Daniel Vetter wrote:
> ...
> Another option would be to first have a more minimal hotunplug test
> without any busy batches pending, and make sure that works reliable
> (i.e. no buffer, no context, no nothing, no open fd, absolutely
> nothing really allocated from userspace). Then extend from there
> step-by-step, e.g. first just open a drm device fd, then allocate a
> buffer, then mmap, then maybe try to write to that buffer after
> unplug, try to do an ioctl after unplug (go through all of them).
> 
> Only once you've gone through all the more basic things go to the much
> more evil stuff like having a batch in-flight, or a kms request
> in-flight. That way it should be easier to make forward progress on
> these bugs without risking the world. Plus you can start merging the
> igt side already, to make sure no one regresses anymore.

It looks like I might have misunderstood your idea.  Did you mean I should 
start with a patch that adds only the minimal hotunplug test and nothing more? 
Then extend it with more advanced subtests one by one consecutively but only 
when driver issues revealed by those subtests already in place are resolved?

If that's the case, I should probably submit v10 limited to the minimal test 
since v9 I submitted last week was not going to be taken under consideration.

Thanks,
Janusz



_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [igt-dev] [PATCH v8 1/1 i-g-t] tests: Add a new test for driver/device hot reload
  2019-05-13  6:45                 ` Janusz Krzysztofik
@ 2019-05-13  8:30                   ` Daniel Vetter
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Vetter @ 2019-05-13  8:30 UTC (permalink / raw)
  To: Janusz Krzysztofik; +Cc: IGT development, Petri Latvala, Daniel Vetter

On Mon, May 13, 2019 at 08:45:57AM +0200, Janusz Krzysztofik wrote:
> Hi Daniel,
> 
> On Tuesday, May 7, 2019 3:32:10 PM CEST Daniel Vetter wrote:
> > ...
> > Another option would be to first have a more minimal hotunplug test
> > without any busy batches pending, and make sure that works reliable
> > (i.e. no buffer, no context, no nothing, no open fd, absolutely
> > nothing really allocated from userspace). Then extend from there
> > step-by-step, e.g. first just open a drm device fd, then allocate a
> > buffer, then mmap, then maybe try to write to that buffer after
> > unplug, try to do an ioctl after unplug (go through all of them).
> > 
> > Only once you've gone through all the more basic things go to the much
> > more evil stuff like having a batch in-flight, or a kms request
> > in-flight. That way it should be easier to make forward progress on
> > these bugs without risking the world. Plus you can start merging the
> > igt side already, to make sure no one regresses anymore.
> 
> It looks like I might have misunderstood your idea.  Did you mean I should 
> start with a patch that adds only the minimal hotunplug test and nothing more? 
> Then extend it with more advanced subtests one by one consecutively but only 
> when driver issues revealed by those subtests already in place are resolved?

Yes that's the idea, we can't merge testcase that break the driver badly
(because that will result in other tests not being run, which isn't good).

> If that's the case, I should probably submit v10 limited to the minimal test 
> since v9 I submitted last week was not going to be taken under consideration.

I'll look at v9 still for some feedback on all the subtests, but yes need
to split it up and merge as we fix up the kernel side. So always submit
kernel and igt patches, pre-merge test them together (with the Test-with:
tag). Then first merge kernel fixes, then igt, and move on to the next set
of subtests.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-05-13  8:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-30 11:29 [igt-dev] [PATCH v8 0/1 i-g-t] tests: Add a new test for driver/device hot reload Janusz Krzysztofik
2019-04-30 11:29 ` [igt-dev] [PATCH v8 1/1 " Janusz Krzysztofik
2019-04-30 15:05   ` Daniel Vetter
2019-05-06  8:44     ` Janusz Krzysztofik
2019-05-06  9:21       ` Daniel Vetter
2019-05-07  6:24         ` Janusz Krzysztofik
2019-05-07  9:14           ` Daniel Vetter
2019-05-07 10:44             ` Janusz Krzysztofik
2019-05-07 13:32               ` Daniel Vetter
2019-05-13  6:45                 ` Janusz Krzysztofik
2019-05-13  8:30                   ` Daniel Vetter
2019-04-30 12:19 ` [igt-dev] ✓ Fi.CI.BAT: success for tests: Add a new test for driver/device hot reload (rev2) Patchwork
2019-05-01  0:26 ` [igt-dev] ✗ Fi.CI.IGT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.