All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Fix the debugfs splat from mock selftests
@ 2021-07-19 18:30 ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Daniel Vetter, Jason Ekstrand

This patch series fixes a miscellaneous collection of bugs that all add up
to all our mock selftests throwing dmesg warnings in CI.  As can be seen
from "drm/i915: Always call i915_globals_exit() from i915_exit()", it's
especially fun since those warnings don't always show up in the selftests
but can show up in other random IGTs depending on test execution order.

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

Jason Ekstrand (6):
  drm/i915: Call i915_globals_exit() after i915_pmu_exit()
  drm/i915: Call i915_globals_exit() if pci_register_device() fails
  drm/i915: Always call i915_globals_exit() from i915_exit()
  drm/ttm: Force re-init if ttm_global_init() fails
  drm/ttm: Initialize debugfs from ttm_global_init()
  drm/i915: Make the kmem slab for i915_buddy_block a global

 drivers/gpu/drm/i915/i915_buddy.c   | 44 ++++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_buddy.h   |  3 +-
 drivers/gpu/drm/i915/i915_globals.c |  6 ++--
 drivers/gpu/drm/i915/i915_pci.c     | 33 +++++++++++++++++-----
 drivers/gpu/drm/ttm/ttm_device.c    | 14 +++++++++
 drivers/gpu/drm/ttm/ttm_module.c    |  4 ---
 6 files changed, 80 insertions(+), 24 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 0/6] Fix the debugfs splat from mock selftests
@ 2021-07-19 18:30 ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Daniel Vetter

This patch series fixes a miscellaneous collection of bugs that all add up
to all our mock selftests throwing dmesg warnings in CI.  As can be seen
from "drm/i915: Always call i915_globals_exit() from i915_exit()", it's
especially fun since those warnings don't always show up in the selftests
but can show up in other random IGTs depending on test execution order.

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

Jason Ekstrand (6):
  drm/i915: Call i915_globals_exit() after i915_pmu_exit()
  drm/i915: Call i915_globals_exit() if pci_register_device() fails
  drm/i915: Always call i915_globals_exit() from i915_exit()
  drm/ttm: Force re-init if ttm_global_init() fails
  drm/ttm: Initialize debugfs from ttm_global_init()
  drm/i915: Make the kmem slab for i915_buddy_block a global

 drivers/gpu/drm/i915/i915_buddy.c   | 44 ++++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_buddy.h   |  3 +-
 drivers/gpu/drm/i915/i915_globals.c |  6 ++--
 drivers/gpu/drm/i915/i915_pci.c     | 33 +++++++++++++++++-----
 drivers/gpu/drm/ttm/ttm_device.c    | 14 +++++++++
 drivers/gpu/drm/ttm/ttm_module.c    |  4 ---
 6 files changed, 80 insertions(+), 24 deletions(-)

-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/6] drm/i915: Call i915_globals_exit() after i915_pmu_exit()
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 18:30   ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Daniel Vetter, Jason Ekstrand, Tvrtko Ursulin

We should tear down in the opposite order we set up.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 67696d7522718..50ed93b03e582 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1244,8 +1244,8 @@ static void __exit i915_exit(void)
 
 	i915_perf_sysctl_unregister();
 	pci_unregister_driver(&i915_pci_driver);
-	i915_globals_exit();
 	i915_pmu_exit();
+	i915_globals_exit();
 }
 
 module_init(i915_init);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 1/6] drm/i915: Call i915_globals_exit() after i915_pmu_exit()
@ 2021-07-19 18:30   ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Daniel Vetter

We should tear down in the opposite order we set up.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 67696d7522718..50ed93b03e582 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1244,8 +1244,8 @@ static void __exit i915_exit(void)
 
 	i915_perf_sysctl_unregister();
 	pci_unregister_driver(&i915_pci_driver);
-	i915_globals_exit();
 	i915_pmu_exit();
+	i915_globals_exit();
 }
 
 module_init(i915_init);
-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/6] drm/i915: Call i915_globals_exit() if pci_register_device() fails
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 18:30   ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Jason Ekstrand

In the unlikely event that pci_register_device() fails, we were tearing
down our PMU setup but not globals.  This leaves a bunch of memory slabs
lying around.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_globals.c | 4 ++--
 drivers/gpu/drm/i915/i915_pci.c     | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
index 77f1911c463b8..87267e1d2ad92 100644
--- a/drivers/gpu/drm/i915/i915_globals.c
+++ b/drivers/gpu/drm/i915/i915_globals.c
@@ -138,7 +138,7 @@ void i915_globals_unpark(void)
 	atomic_inc(&active);
 }
 
-static void __exit __i915_globals_flush(void)
+static void __i915_globals_flush(void)
 {
 	atomic_inc(&active); /* skip shrinking */
 
@@ -148,7 +148,7 @@ static void __exit __i915_globals_flush(void)
 	atomic_dec(&active);
 }
 
-void __exit i915_globals_exit(void)
+void i915_globals_exit(void)
 {
 	GEM_BUG_ON(atomic_read(&active));
 
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 50ed93b03e582..4e627b57d31a2 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1230,6 +1230,7 @@ static int __init i915_init(void)
 	err = pci_register_driver(&i915_pci_driver);
 	if (err) {
 		i915_pmu_exit();
+		i915_globals_exit();
 		return err;
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 2/6] drm/i915: Call i915_globals_exit() if pci_register_device() fails
@ 2021-07-19 18:30   ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel

In the unlikely event that pci_register_device() fails, we were tearing
down our PMU setup but not globals.  This leaves a bunch of memory slabs
lying around.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_globals.c | 4 ++--
 drivers/gpu/drm/i915/i915_pci.c     | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
index 77f1911c463b8..87267e1d2ad92 100644
--- a/drivers/gpu/drm/i915/i915_globals.c
+++ b/drivers/gpu/drm/i915/i915_globals.c
@@ -138,7 +138,7 @@ void i915_globals_unpark(void)
 	atomic_inc(&active);
 }
 
-static void __exit __i915_globals_flush(void)
+static void __i915_globals_flush(void)
 {
 	atomic_inc(&active); /* skip shrinking */
 
@@ -148,7 +148,7 @@ static void __exit __i915_globals_flush(void)
 	atomic_dec(&active);
 }
 
-void __exit i915_globals_exit(void)
+void i915_globals_exit(void)
 {
 	GEM_BUG_ON(atomic_read(&active));
 
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 50ed93b03e582..4e627b57d31a2 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1230,6 +1230,7 @@ static int __init i915_init(void)
 	err = pci_register_driver(&i915_pci_driver);
 	if (err) {
 		i915_pmu_exit();
+		i915_globals_exit();
 		return err;
 	}
 
-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 18:30   ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Jason Ekstrand

If the driver was not fully loaded, we may still have globals lying
around.  If we don't tear those down in i915_exit(), we'll leak a bunch
of memory slabs.  This can happen two ways: use_kms = false and if we've
run mock selftests.  In either case, we have an early exit from
i915_init which happens after i915_globals_init() and we need to clean
up those globals.  While we're here, add an explicit boolean instead of
using a random field from i915_pci_device to detect partial loads.

The mock selftests case gets especially sticky.  The load isn't entirely
a no-op.  We actually do quite a bit inside those selftests including
allocating a bunch of mock objects and running tests on them.  Once all
those tests are complete, we exit early from i915_init().  Perviously,
i915_init() would return a non-zero error code on failure and a zero
error code on success.  In the success case, we would get to i915_exit()
and check i915_pci_driver.driver.owner to detect if i915_init exited early
and do nothing.  In the failure case, we would fail i915_init() but
there would be no opportunity to clean up globals.

The most annoying part is that you don't actually notice the failure as
part of the self-tests since leaking a bit of memory, while bad, doesn't
result in anything observable from userspace.  Instead, the next time we
load the driver (usually for next IGT test), i915_globals_init() gets
invoked again, we go to allocate a bunch of new memory slabs, those
implicitly create debugfs entries, and debugfs warns that we're trying
to create directories and files that already exist.  Since this all
happens as part of the next driver load, it shows up in the dmesg-warn
of whatever IGT test ran after the mock selftests.

While the obvious thing to do here might be to call i915_globals_exit()
after selftests, that's not actually safe.  The dma-buf selftests call
i915_gem_prime_export which creates a file.  We call dma_buf_put() on
the resulting dmabuf which calls fput() on the file.  However, fput()
isn't immediate and gets flushed right before syscall returns.  This
means that all the fput()s from the selftests don't happen until right
before the module load syscall used to fire off the selftests returns
which is after i915_init().  If we call i915_globals_exit() in
i915_init() after selftests, we end up freeing slabs out from under
objects which won't get released until fput() is flushed at the end of
the module load.

The solution here is to let i915_init() return success early and detect
the early success in i915_exit() and only tear down globals and nothing
else.  This way the module loads successfully, regardless of the success
or failure of the tests.  Because we've not enumerated any PCI devices,
no device nodes are created and it's entirely useless from userspace.
The only thing the module does at that point is hold on to a bit of
memory until we unload it and i915_exit() is called.  Importantly, this
means that everything from our selftests has the ability to properly
flush out between i915_init() and i915_exit() because there are a couple
syscall boundaries in between.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 4e627b57d31a2..24e4e54516936 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
 	.driver.pm = &i915_pm_ops,
 };
 
+static bool i915_fully_loaded = false;
+
 static int __init i915_init(void)
 {
 	bool use_kms = true;
 	int err;
 
+	i915_fully_loaded = false;
+
 	err = i915_globals_init();
 	if (err)
 		return err;
 
+	/* i915_mock_selftests() only returns zero if no mock subtests were
+	 * run.  If we get any non-zero error code, we return early here.
+	 * We always return success because selftests may have allocated
+	 * objects from slabs which will get cleaned up by i915_exit().  We
+	 * could attempt to clean up immediately and fail module load but,
+	 * thanks to interactions with other parts of the kernel (struct
+	 * file, in particular), it's safer to let the module fully load
+	 * and then clean up on unload.
+	 */
 	err = i915_mock_selftests();
 	if (err)
-		return err > 0 ? 0 : err;
+		return 0;
 
 	/*
 	 * Enable KMS by default, unless explicitly overriden by
@@ -1225,6 +1238,12 @@ static int __init i915_init(void)
 		return 0;
 	}
 
+	/* After this point, i915_init() must either fully succeed or
+	 * properly tear everything down and fail.  We don't have separate
+	 * flags for each set-up bit.
+	 */
+	i915_fully_loaded = true;
+
 	i915_pmu_init();
 
 	err = pci_register_driver(&i915_pci_driver);
@@ -1240,12 +1259,11 @@ static int __init i915_init(void)
 
 static void __exit i915_exit(void)
 {
-	if (!i915_pci_driver.driver.owner)
-		return;
-
-	i915_perf_sysctl_unregister();
-	pci_unregister_driver(&i915_pci_driver);
-	i915_pmu_exit();
+	if (i915_fully_loaded) {
+		i915_perf_sysctl_unregister();
+		pci_unregister_driver(&i915_pci_driver);
+		i915_pmu_exit();
+	}
 	i915_globals_exit();
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-19 18:30   ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel

If the driver was not fully loaded, we may still have globals lying
around.  If we don't tear those down in i915_exit(), we'll leak a bunch
of memory slabs.  This can happen two ways: use_kms = false and if we've
run mock selftests.  In either case, we have an early exit from
i915_init which happens after i915_globals_init() and we need to clean
up those globals.  While we're here, add an explicit boolean instead of
using a random field from i915_pci_device to detect partial loads.

The mock selftests case gets especially sticky.  The load isn't entirely
a no-op.  We actually do quite a bit inside those selftests including
allocating a bunch of mock objects and running tests on them.  Once all
those tests are complete, we exit early from i915_init().  Perviously,
i915_init() would return a non-zero error code on failure and a zero
error code on success.  In the success case, we would get to i915_exit()
and check i915_pci_driver.driver.owner to detect if i915_init exited early
and do nothing.  In the failure case, we would fail i915_init() but
there would be no opportunity to clean up globals.

The most annoying part is that you don't actually notice the failure as
part of the self-tests since leaking a bit of memory, while bad, doesn't
result in anything observable from userspace.  Instead, the next time we
load the driver (usually for next IGT test), i915_globals_init() gets
invoked again, we go to allocate a bunch of new memory slabs, those
implicitly create debugfs entries, and debugfs warns that we're trying
to create directories and files that already exist.  Since this all
happens as part of the next driver load, it shows up in the dmesg-warn
of whatever IGT test ran after the mock selftests.

While the obvious thing to do here might be to call i915_globals_exit()
after selftests, that's not actually safe.  The dma-buf selftests call
i915_gem_prime_export which creates a file.  We call dma_buf_put() on
the resulting dmabuf which calls fput() on the file.  However, fput()
isn't immediate and gets flushed right before syscall returns.  This
means that all the fput()s from the selftests don't happen until right
before the module load syscall used to fire off the selftests returns
which is after i915_init().  If we call i915_globals_exit() in
i915_init() after selftests, we end up freeing slabs out from under
objects which won't get released until fput() is flushed at the end of
the module load.

The solution here is to let i915_init() return success early and detect
the early success in i915_exit() and only tear down globals and nothing
else.  This way the module loads successfully, regardless of the success
or failure of the tests.  Because we've not enumerated any PCI devices,
no device nodes are created and it's entirely useless from userspace.
The only thing the module does at that point is hold on to a bit of
memory until we unload it and i915_exit() is called.  Importantly, this
means that everything from our selftests has the ability to properly
flush out between i915_init() and i915_exit() because there are a couple
syscall boundaries in between.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 4e627b57d31a2..24e4e54516936 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
 	.driver.pm = &i915_pm_ops,
 };
 
+static bool i915_fully_loaded = false;
+
 static int __init i915_init(void)
 {
 	bool use_kms = true;
 	int err;
 
+	i915_fully_loaded = false;
+
 	err = i915_globals_init();
 	if (err)
 		return err;
 
+	/* i915_mock_selftests() only returns zero if no mock subtests were
+	 * run.  If we get any non-zero error code, we return early here.
+	 * We always return success because selftests may have allocated
+	 * objects from slabs which will get cleaned up by i915_exit().  We
+	 * could attempt to clean up immediately and fail module load but,
+	 * thanks to interactions with other parts of the kernel (struct
+	 * file, in particular), it's safer to let the module fully load
+	 * and then clean up on unload.
+	 */
 	err = i915_mock_selftests();
 	if (err)
-		return err > 0 ? 0 : err;
+		return 0;
 
 	/*
 	 * Enable KMS by default, unless explicitly overriden by
@@ -1225,6 +1238,12 @@ static int __init i915_init(void)
 		return 0;
 	}
 
+	/* After this point, i915_init() must either fully succeed or
+	 * properly tear everything down and fail.  We don't have separate
+	 * flags for each set-up bit.
+	 */
+	i915_fully_loaded = true;
+
 	i915_pmu_init();
 
 	err = pci_register_driver(&i915_pci_driver);
@@ -1240,12 +1259,11 @@ static int __init i915_init(void)
 
 static void __exit i915_exit(void)
 {
-	if (!i915_pci_driver.driver.owner)
-		return;
-
-	i915_perf_sysctl_unregister();
-	pci_unregister_driver(&i915_pci_driver);
-	i915_pmu_exit();
+	if (i915_fully_loaded) {
+		i915_perf_sysctl_unregister();
+		pci_unregister_driver(&i915_pci_driver);
+		i915_pmu_exit();
+	}
 	i915_globals_exit();
 }
 
-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/6] drm/ttm: Force re-init if ttm_global_init() fails
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 18:30   ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Christian König, Jason Ekstrand

If we have a failure, decrement the reference count so that the next
call to ttm_global_init() will actually do something instead of assume
everything is all set up.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 62b53b37e4b1 ("drm/ttm: use a static ttm_bo_global instance")
Cc: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 5f31acec3ad76..519deea8e39b7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -100,6 +100,8 @@ static int ttm_global_init(void)
 	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
 				&glob->bo_count);
 out:
+	if (ret)
+		--ttm_glob_use_count;
 	mutex_unlock(&ttm_global_mutex);
 	return ret;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 4/6] drm/ttm: Force re-init if ttm_global_init() fails
@ 2021-07-19 18:30   ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Christian König

If we have a failure, decrement the reference count so that the next
call to ttm_global_init() will actually do something instead of assume
everything is all set up.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 62b53b37e4b1 ("drm/ttm: use a static ttm_bo_global instance")
Cc: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 5f31acec3ad76..519deea8e39b7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -100,6 +100,8 @@ static int ttm_global_init(void)
 	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
 				&glob->bo_count);
 out:
+	if (ret)
+		--ttm_glob_use_count;
 	mutex_unlock(&ttm_global_mutex);
 	return ret;
 }
-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/6] drm/ttm: Initialize debugfs from ttm_global_init()
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 18:30   ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Jason Ekstrand

We create a bunch of debugfs entries as a side-effect of
ttm_global_init() and then never clean them up.  This isn't usually a
problem because we free the whole debugfs directory on module unload.
However, if the global reference count ever goes to zero and then
ttm_global_init() is called again, we'll re-create those debugfs entries
and debugfs will complain in dmesg that we're creating entries that
already exist.  This patch fixes this problem by changing the lifetime
of the whole TTM debugfs directory to match that of the TTM global
state.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
---
 drivers/gpu/drm/ttm/ttm_device.c | 12 ++++++++++++
 drivers/gpu/drm/ttm/ttm_module.c |  4 ----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 519deea8e39b7..74e3b460132b3 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -44,6 +44,8 @@ static unsigned ttm_glob_use_count;
 struct ttm_global ttm_glob;
 EXPORT_SYMBOL(ttm_glob);
 
+struct dentry *ttm_debugfs_root;
+
 static void ttm_global_release(void)
 {
 	struct ttm_global *glob = &ttm_glob;
@@ -53,6 +55,7 @@ static void ttm_global_release(void)
 		goto out;
 
 	ttm_pool_mgr_fini();
+	debugfs_remove(ttm_debugfs_root);
 
 	__free_page(glob->dummy_read_page);
 	memset(glob, 0, sizeof(*glob));
@@ -73,6 +76,13 @@ static int ttm_global_init(void)
 
 	si_meminfo(&si);
 
+	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
+	if (IS_ERR(ttm_debugfs_root)) {
+		ret = PTR_ERR(ttm_debugfs_root);
+		ttm_debugfs_root = NULL;
+		goto out;
+	}
+
 	/* Limit the number of pages in the pool to about 50% of the total
 	 * system memory.
 	 */
@@ -100,6 +110,8 @@ static int ttm_global_init(void)
 	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
 				&glob->bo_count);
 out:
+	if (ret && ttm_debugfs_root)
+		debugfs_remove(ttm_debugfs_root);
 	if (ret)
 		--ttm_glob_use_count;
 	mutex_unlock(&ttm_global_mutex);
diff --git a/drivers/gpu/drm/ttm/ttm_module.c b/drivers/gpu/drm/ttm/ttm_module.c
index 997c458f68a9a..88554f2db11fe 100644
--- a/drivers/gpu/drm/ttm/ttm_module.c
+++ b/drivers/gpu/drm/ttm/ttm_module.c
@@ -72,17 +72,13 @@ pgprot_t ttm_prot_from_caching(enum ttm_caching caching, pgprot_t tmp)
 	return tmp;
 }
 
-struct dentry *ttm_debugfs_root;
-
 static int __init ttm_init(void)
 {
-	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
 	return 0;
 }
 
 static void __exit ttm_exit(void)
 {
-	debugfs_remove(ttm_debugfs_root);
 }
 
 module_init(ttm_init);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 5/6] drm/ttm: Initialize debugfs from ttm_global_init()
@ 2021-07-19 18:30   ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel

We create a bunch of debugfs entries as a side-effect of
ttm_global_init() and then never clean them up.  This isn't usually a
problem because we free the whole debugfs directory on module unload.
However, if the global reference count ever goes to zero and then
ttm_global_init() is called again, we'll re-create those debugfs entries
and debugfs will complain in dmesg that we're creating entries that
already exist.  This patch fixes this problem by changing the lifetime
of the whole TTM debugfs directory to match that of the TTM global
state.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
---
 drivers/gpu/drm/ttm/ttm_device.c | 12 ++++++++++++
 drivers/gpu/drm/ttm/ttm_module.c |  4 ----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 519deea8e39b7..74e3b460132b3 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -44,6 +44,8 @@ static unsigned ttm_glob_use_count;
 struct ttm_global ttm_glob;
 EXPORT_SYMBOL(ttm_glob);
 
+struct dentry *ttm_debugfs_root;
+
 static void ttm_global_release(void)
 {
 	struct ttm_global *glob = &ttm_glob;
@@ -53,6 +55,7 @@ static void ttm_global_release(void)
 		goto out;
 
 	ttm_pool_mgr_fini();
+	debugfs_remove(ttm_debugfs_root);
 
 	__free_page(glob->dummy_read_page);
 	memset(glob, 0, sizeof(*glob));
@@ -73,6 +76,13 @@ static int ttm_global_init(void)
 
 	si_meminfo(&si);
 
+	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
+	if (IS_ERR(ttm_debugfs_root)) {
+		ret = PTR_ERR(ttm_debugfs_root);
+		ttm_debugfs_root = NULL;
+		goto out;
+	}
+
 	/* Limit the number of pages in the pool to about 50% of the total
 	 * system memory.
 	 */
@@ -100,6 +110,8 @@ static int ttm_global_init(void)
 	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
 				&glob->bo_count);
 out:
+	if (ret && ttm_debugfs_root)
+		debugfs_remove(ttm_debugfs_root);
 	if (ret)
 		--ttm_glob_use_count;
 	mutex_unlock(&ttm_global_mutex);
diff --git a/drivers/gpu/drm/ttm/ttm_module.c b/drivers/gpu/drm/ttm/ttm_module.c
index 997c458f68a9a..88554f2db11fe 100644
--- a/drivers/gpu/drm/ttm/ttm_module.c
+++ b/drivers/gpu/drm/ttm/ttm_module.c
@@ -72,17 +72,13 @@ pgprot_t ttm_prot_from_caching(enum ttm_caching caching, pgprot_t tmp)
 	return tmp;
 }
 
-struct dentry *ttm_debugfs_root;
-
 static int __init ttm_init(void)
 {
-	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
 	return 0;
 }
 
 static void __exit ttm_exit(void)
 {
-	debugfs_remove(ttm_debugfs_root);
 }
 
 module_init(ttm_init);
-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/6] drm/i915: Make the kmem slab for i915_buddy_block a global
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 18:30   ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Matthew Auld, Jason Ekstrand

There's no reason that I can tell why this should be per-i915_buddy_mm
and doing so causes KMEM_CACHE to throw dmesg warnings because it tries
to create a debugfs entry with the name i915_buddy_block multiple times.
We could handle this by carefully giving each slab its own name but that
brings its own pain because then we have to store that string somewhere
and manage the lifetimes of the different slabs.  The most likely
outcome would be a global atomic which we increment to get a new name or
something like that.

The much easier solution is to use the i915_globals system like we do
for every other slab in i915.  This ensures that we have exactly one of
them for each i915 driver load and it gets neatly created on module load
and destroyed on module unload.  Using the globals system also means
that its now tied into the shrink handler so we can properly respond to
low-memory situations.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 88be9a0a06b7 ("drm/i915/ttm: add ttm_buddy_man")
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_buddy.c   | 44 ++++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_buddy.h   |  3 +-
 drivers/gpu/drm/i915/i915_globals.c |  2 ++
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_buddy.c b/drivers/gpu/drm/i915/i915_buddy.c
index 29dd7d0310c1f..911feedad4513 100644
--- a/drivers/gpu/drm/i915/i915_buddy.c
+++ b/drivers/gpu/drm/i915/i915_buddy.c
@@ -8,8 +8,14 @@
 #include "i915_buddy.h"
 
 #include "i915_gem.h"
+#include "i915_globals.h"
 #include "i915_utils.h"
 
+static struct i915_global_buddy {
+	struct i915_global base;
+	struct kmem_cache *slab_blocks;
+} global;
+
 static struct i915_buddy_block *i915_block_alloc(struct i915_buddy_mm *mm,
 						 struct i915_buddy_block *parent,
 						 unsigned int order,
@@ -19,7 +25,7 @@ static struct i915_buddy_block *i915_block_alloc(struct i915_buddy_mm *mm,
 
 	GEM_BUG_ON(order > I915_BUDDY_MAX_ORDER);
 
-	block = kmem_cache_zalloc(mm->slab_blocks, GFP_KERNEL);
+	block = kmem_cache_zalloc(global.slab_blocks, GFP_KERNEL);
 	if (!block)
 		return NULL;
 
@@ -34,7 +40,7 @@ static struct i915_buddy_block *i915_block_alloc(struct i915_buddy_mm *mm,
 static void i915_block_free(struct i915_buddy_mm *mm,
 			    struct i915_buddy_block *block)
 {
-	kmem_cache_free(mm->slab_blocks, block);
+	kmem_cache_free(global.slab_blocks, block);
 }
 
 static void mark_allocated(struct i915_buddy_block *block)
@@ -85,15 +91,11 @@ int i915_buddy_init(struct i915_buddy_mm *mm, u64 size, u64 chunk_size)
 
 	GEM_BUG_ON(mm->max_order > I915_BUDDY_MAX_ORDER);
 
-	mm->slab_blocks = KMEM_CACHE(i915_buddy_block, SLAB_HWCACHE_ALIGN);
-	if (!mm->slab_blocks)
-		return -ENOMEM;
-
 	mm->free_list = kmalloc_array(mm->max_order + 1,
 				      sizeof(struct list_head),
 				      GFP_KERNEL);
 	if (!mm->free_list)
-		goto out_destroy_slab;
+		return -ENOMEM;
 
 	for (i = 0; i <= mm->max_order; ++i)
 		INIT_LIST_HEAD(&mm->free_list[i]);
@@ -145,8 +147,6 @@ int i915_buddy_init(struct i915_buddy_mm *mm, u64 size, u64 chunk_size)
 	kfree(mm->roots);
 out_free_list:
 	kfree(mm->free_list);
-out_destroy_slab:
-	kmem_cache_destroy(mm->slab_blocks);
 	return -ENOMEM;
 }
 
@@ -161,7 +161,6 @@ void i915_buddy_fini(struct i915_buddy_mm *mm)
 
 	kfree(mm->roots);
 	kfree(mm->free_list);
-	kmem_cache_destroy(mm->slab_blocks);
 }
 
 static int split_block(struct i915_buddy_mm *mm,
@@ -410,3 +409,28 @@ int i915_buddy_alloc_range(struct i915_buddy_mm *mm,
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftests/i915_buddy.c"
 #endif
+
+static void i915_global_buddy_shrink(void)
+{
+	kmem_cache_shrink(global.slab_blocks);
+}
+
+static void i915_global_buddy_exit(void)
+{
+	kmem_cache_destroy(global.slab_blocks);
+}
+
+static struct i915_global_buddy global = { {
+	.shrink = i915_global_buddy_shrink,
+	.exit = i915_global_buddy_exit,
+} };
+
+int __init i915_global_buddy_init(void)
+{
+	global.slab_blocks = KMEM_CACHE(i915_buddy_block, 0);
+	if (!global.slab_blocks)
+		return -ENOMEM;
+
+	i915_global_register(&global.base);
+	return 0;
+}
diff --git a/drivers/gpu/drm/i915/i915_buddy.h b/drivers/gpu/drm/i915/i915_buddy.h
index 37f8c42071d12..d8f26706de52f 100644
--- a/drivers/gpu/drm/i915/i915_buddy.h
+++ b/drivers/gpu/drm/i915/i915_buddy.h
@@ -47,7 +47,6 @@ struct i915_buddy_block {
  * i915_buddy_alloc* and i915_buddy_free* should suffice.
  */
 struct i915_buddy_mm {
-	struct kmem_cache *slab_blocks;
 	/* Maintain a free list for each order. */
 	struct list_head *free_list;
 
@@ -130,4 +129,6 @@ void i915_buddy_free(struct i915_buddy_mm *mm, struct i915_buddy_block *block);
 
 void i915_buddy_free_list(struct i915_buddy_mm *mm, struct list_head *objects);
 
+int i915_global_buddy_init(void);
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
index 87267e1d2ad92..e57102a4c8d16 100644
--- a/drivers/gpu/drm/i915/i915_globals.c
+++ b/drivers/gpu/drm/i915/i915_globals.c
@@ -8,6 +8,7 @@
 #include <linux/workqueue.h>
 
 #include "i915_active.h"
+#include "i915_buddy.h"
 #include "gem/i915_gem_context.h"
 #include "gem/i915_gem_object.h"
 #include "i915_globals.h"
@@ -87,6 +88,7 @@ static void __i915_globals_cleanup(void)
 
 static __initconst int (* const initfn[])(void) = {
 	i915_global_active_init,
+	i915_global_buddy_init,
 	i915_global_context_init,
 	i915_global_gem_context_init,
 	i915_global_objects_init,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] [PATCH 6/6] drm/i915: Make the kmem slab for i915_buddy_block a global
@ 2021-07-19 18:30   ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-19 18:30 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Matthew Auld

There's no reason that I can tell why this should be per-i915_buddy_mm
and doing so causes KMEM_CACHE to throw dmesg warnings because it tries
to create a debugfs entry with the name i915_buddy_block multiple times.
We could handle this by carefully giving each slab its own name but that
brings its own pain because then we have to store that string somewhere
and manage the lifetimes of the different slabs.  The most likely
outcome would be a global atomic which we increment to get a new name or
something like that.

The much easier solution is to use the i915_globals system like we do
for every other slab in i915.  This ensures that we have exactly one of
them for each i915 driver load and it gets neatly created on module load
and destroyed on module unload.  Using the globals system also means
that its now tied into the shrink handler so we can properly respond to
low-memory situations.

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Fixes: 88be9a0a06b7 ("drm/i915/ttm: add ttm_buddy_man")
Cc: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/i915_buddy.c   | 44 ++++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_buddy.h   |  3 +-
 drivers/gpu/drm/i915/i915_globals.c |  2 ++
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_buddy.c b/drivers/gpu/drm/i915/i915_buddy.c
index 29dd7d0310c1f..911feedad4513 100644
--- a/drivers/gpu/drm/i915/i915_buddy.c
+++ b/drivers/gpu/drm/i915/i915_buddy.c
@@ -8,8 +8,14 @@
 #include "i915_buddy.h"
 
 #include "i915_gem.h"
+#include "i915_globals.h"
 #include "i915_utils.h"
 
+static struct i915_global_buddy {
+	struct i915_global base;
+	struct kmem_cache *slab_blocks;
+} global;
+
 static struct i915_buddy_block *i915_block_alloc(struct i915_buddy_mm *mm,
 						 struct i915_buddy_block *parent,
 						 unsigned int order,
@@ -19,7 +25,7 @@ static struct i915_buddy_block *i915_block_alloc(struct i915_buddy_mm *mm,
 
 	GEM_BUG_ON(order > I915_BUDDY_MAX_ORDER);
 
-	block = kmem_cache_zalloc(mm->slab_blocks, GFP_KERNEL);
+	block = kmem_cache_zalloc(global.slab_blocks, GFP_KERNEL);
 	if (!block)
 		return NULL;
 
@@ -34,7 +40,7 @@ static struct i915_buddy_block *i915_block_alloc(struct i915_buddy_mm *mm,
 static void i915_block_free(struct i915_buddy_mm *mm,
 			    struct i915_buddy_block *block)
 {
-	kmem_cache_free(mm->slab_blocks, block);
+	kmem_cache_free(global.slab_blocks, block);
 }
 
 static void mark_allocated(struct i915_buddy_block *block)
@@ -85,15 +91,11 @@ int i915_buddy_init(struct i915_buddy_mm *mm, u64 size, u64 chunk_size)
 
 	GEM_BUG_ON(mm->max_order > I915_BUDDY_MAX_ORDER);
 
-	mm->slab_blocks = KMEM_CACHE(i915_buddy_block, SLAB_HWCACHE_ALIGN);
-	if (!mm->slab_blocks)
-		return -ENOMEM;
-
 	mm->free_list = kmalloc_array(mm->max_order + 1,
 				      sizeof(struct list_head),
 				      GFP_KERNEL);
 	if (!mm->free_list)
-		goto out_destroy_slab;
+		return -ENOMEM;
 
 	for (i = 0; i <= mm->max_order; ++i)
 		INIT_LIST_HEAD(&mm->free_list[i]);
@@ -145,8 +147,6 @@ int i915_buddy_init(struct i915_buddy_mm *mm, u64 size, u64 chunk_size)
 	kfree(mm->roots);
 out_free_list:
 	kfree(mm->free_list);
-out_destroy_slab:
-	kmem_cache_destroy(mm->slab_blocks);
 	return -ENOMEM;
 }
 
@@ -161,7 +161,6 @@ void i915_buddy_fini(struct i915_buddy_mm *mm)
 
 	kfree(mm->roots);
 	kfree(mm->free_list);
-	kmem_cache_destroy(mm->slab_blocks);
 }
 
 static int split_block(struct i915_buddy_mm *mm,
@@ -410,3 +409,28 @@ int i915_buddy_alloc_range(struct i915_buddy_mm *mm,
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftests/i915_buddy.c"
 #endif
+
+static void i915_global_buddy_shrink(void)
+{
+	kmem_cache_shrink(global.slab_blocks);
+}
+
+static void i915_global_buddy_exit(void)
+{
+	kmem_cache_destroy(global.slab_blocks);
+}
+
+static struct i915_global_buddy global = { {
+	.shrink = i915_global_buddy_shrink,
+	.exit = i915_global_buddy_exit,
+} };
+
+int __init i915_global_buddy_init(void)
+{
+	global.slab_blocks = KMEM_CACHE(i915_buddy_block, 0);
+	if (!global.slab_blocks)
+		return -ENOMEM;
+
+	i915_global_register(&global.base);
+	return 0;
+}
diff --git a/drivers/gpu/drm/i915/i915_buddy.h b/drivers/gpu/drm/i915/i915_buddy.h
index 37f8c42071d12..d8f26706de52f 100644
--- a/drivers/gpu/drm/i915/i915_buddy.h
+++ b/drivers/gpu/drm/i915/i915_buddy.h
@@ -47,7 +47,6 @@ struct i915_buddy_block {
  * i915_buddy_alloc* and i915_buddy_free* should suffice.
  */
 struct i915_buddy_mm {
-	struct kmem_cache *slab_blocks;
 	/* Maintain a free list for each order. */
 	struct list_head *free_list;
 
@@ -130,4 +129,6 @@ void i915_buddy_free(struct i915_buddy_mm *mm, struct i915_buddy_block *block);
 
 void i915_buddy_free_list(struct i915_buddy_mm *mm, struct list_head *objects);
 
+int i915_global_buddy_init(void);
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
index 87267e1d2ad92..e57102a4c8d16 100644
--- a/drivers/gpu/drm/i915/i915_globals.c
+++ b/drivers/gpu/drm/i915/i915_globals.c
@@ -8,6 +8,7 @@
 #include <linux/workqueue.h>
 
 #include "i915_active.h"
+#include "i915_buddy.h"
 #include "gem/i915_gem_context.h"
 #include "gem/i915_gem_object.h"
 #include "i915_globals.h"
@@ -87,6 +88,7 @@ static void __i915_globals_cleanup(void)
 
 static __initconst int (* const initfn[])(void) = {
 	i915_global_active_init,
+	i915_global_buddy_init,
 	i915_global_context_init,
 	i915_global_gem_context_init,
 	i915_global_objects_init,
-- 
2.31.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Fix the debugfs splat from mock selftests
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
                   ` (6 preceding siblings ...)
  (?)
@ 2021-07-19 18:38 ` Patchwork
  -1 siblings, 0 replies; 45+ messages in thread
From: Patchwork @ 2021-07-19 18:38 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx

== Series Details ==

Series: Fix the debugfs splat from mock selftests
URL   : https://patchwork.freedesktop.org/series/92729/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
ee28cc1d9eee drm/i915: Call i915_globals_exit() after i915_pmu_exit()
175239d5619b drm/i915: Call i915_globals_exit() if pci_register_device() fails
1f5efaa53ad2 drm/i915: Always call i915_globals_exit() from i915_exit()
-:69: ERROR:INITIALISED_STATIC: do not initialise statics to false
#69: FILE: drivers/gpu/drm/i915/i915_pci.c:1197:
+static bool i915_fully_loaded = false;

total: 1 errors, 0 warnings, 0 checks, 61 lines checked
920bf5ea1d38 drm/ttm: Force re-init if ttm_global_init() fails
a959d8c1714d drm/ttm: Initialize debugfs from ttm_global_init()
825df92b5659 drm/i915: Make the kmem slab for i915_buddy_block a global


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Fix the debugfs splat from mock selftests
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
                   ` (7 preceding siblings ...)
  (?)
@ 2021-07-19 18:38 ` Patchwork
  -1 siblings, 0 replies; 45+ messages in thread
From: Patchwork @ 2021-07-19 18:38 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx

== Series Details ==

Series: Fix the debugfs splat from mock selftests
URL   : https://patchwork.freedesktop.org/series/92729/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/ttm/ttm_device.c:144:5: warning: context imbalance in 'ttm_device_swapout' - wrong count at exit


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for Fix the debugfs splat from mock selftests
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
                   ` (8 preceding siblings ...)
  (?)
@ 2021-07-19 18:42 ` Patchwork
  -1 siblings, 0 replies; 45+ messages in thread
From: Patchwork @ 2021-07-19 18:42 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx

== Series Details ==

Series: Fix the debugfs splat from mock selftests
URL   : https://patchwork.freedesktop.org/series/92729/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Excess function parameter 'jump_whitelist' description in 'intel_engine_cmd_parser'
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Excess function parameter 'shadow_map' description in 'intel_engine_cmd_parser'
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Excess function parameter 'batch_map' description in 'intel_engine_cmd_parser'
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Function parameter or member 'trampoline' not described in 'intel_engine_cmd_parser'
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Excess function parameter 'jump_whitelist' description in 'intel_engine_cmd_parser'
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Excess function parameter 'shadow_map' description in 'intel_engine_cmd_parser'
./drivers/gpu/drm/i915/i915_cmd_parser.c:1436: warning: Excess function parameter 'batch_map' description in 'intel_engine_cmd_parser'


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Fix the debugfs splat from mock selftests
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
                   ` (9 preceding siblings ...)
  (?)
@ 2021-07-19 19:05 ` Patchwork
  -1 siblings, 0 replies; 45+ messages in thread
From: Patchwork @ 2021-07-19 19:05 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 4272 bytes --]

== Series Details ==

Series: Fix the debugfs splat from mock selftests
URL   : https://patchwork.freedesktop.org/series/92729/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10352 -> Patchwork_20646
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/index.html

Known issues
------------

  Here are the changes found in Patchwork_20646 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@cs-gfx:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][1] ([fdo#109271]) +17 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-kbl-soraka/igt@amdgpu/amd_basic@cs-gfx.html

  * igt@i915_selftest@live@execlists:
    - fi-bsw-kefka:       [PASS][2] -> [INCOMPLETE][3] ([i915#2782] / [i915#2940])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/fi-bsw-kefka/igt@i915_selftest@live@execlists.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-bsw-kefka/igt@i915_selftest@live@execlists.html

  * igt@i915_selftest@live@late_gt_pm:
    - fi-bsw-nick:        [PASS][4] -> [DMESG-FAIL][5] ([i915#2927])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/fi-bsw-nick/igt@i915_selftest@live@late_gt_pm.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-bsw-nick/igt@i915_selftest@live@late_gt_pm.html

  * igt@runner@aborted:
    - fi-bsw-kefka:       NOTRUN -> [FAIL][6] ([fdo#109271] / [i915#1436])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-bsw-kefka/igt@runner@aborted.html
    - fi-bsw-nick:        NOTRUN -> [FAIL][7] ([fdo#109271] / [i915#1436])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-bsw-nick/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@gt_heartbeat:
    - {fi-tgl-dsi}:       [DMESG-FAIL][8] ([i915#541]) -> [PASS][9]
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/fi-tgl-dsi/igt@i915_selftest@live@gt_heartbeat.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-tgl-dsi/igt@i915_selftest@live@gt_heartbeat.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-kbl-7500u:       [FAIL][10] ([i915#1372]) -> [PASS][11]
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/fi-kbl-7500u/igt@kms_chamelium@dp-crc-fast.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/fi-kbl-7500u/igt@kms_chamelium@dp-crc-fast.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [i915#1372]: https://gitlab.freedesktop.org/drm/intel/issues/1372
  [i915#1436]: https://gitlab.freedesktop.org/drm/intel/issues/1436
  [i915#2782]: https://gitlab.freedesktop.org/drm/intel/issues/2782
  [i915#2927]: https://gitlab.freedesktop.org/drm/intel/issues/2927
  [i915#2940]: https://gitlab.freedesktop.org/drm/intel/issues/2940
  [i915#541]: https://gitlab.freedesktop.org/drm/intel/issues/541


Participating hosts (40 -> 36)
------------------------------

  Missing    (4): fi-ilk-m540 fi-bsw-cyan fi-bdw-samus fi-hsw-4200u 


Build changes
-------------

  * Linux: CI_DRM_10352 -> Patchwork_20646

  CI-20190529: 20190529
  CI_DRM_10352: 030b944a4af3452eb766bfa3519a495462e1dc8d @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6146: 6caef22e4aafed275771f564d4ea4cab09896ebc @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_20646: 825df92b565927f8e4cfad68f5d8150b6d6ac89e @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

825df92b5659 drm/i915: Make the kmem slab for i915_buddy_block a global
a959d8c1714d drm/ttm: Initialize debugfs from ttm_global_init()
920bf5ea1d38 drm/ttm: Force re-init if ttm_global_init() fails
1f5efaa53ad2 drm/i915: Always call i915_globals_exit() from i915_exit()
175239d5619b drm/i915: Call i915_globals_exit() if pci_register_device() fails
ee28cc1d9eee drm/i915: Call i915_globals_exit() after i915_pmu_exit()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/index.html

[-- Attachment #1.2: Type: text/html, Size: 5273 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/6] drm/ttm: Force re-init if ttm_global_init() fails
  2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-19 19:21     ` Christian König
  -1 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2021-07-19 19:21 UTC (permalink / raw)
  To: Jason Ekstrand, intel-gfx, dri-devel

Am 19.07.21 um 20:30 schrieb Jason Ekstrand:
> If we have a failure, decrement the reference count so that the next
> call to ttm_global_init() will actually do something instead of assume
> everything is all set up.
>
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 62b53b37e4b1 ("drm/ttm: use a static ttm_bo_global instance")
> Cc: Christian König <christian.koenig@amd.com>

Good catch, path is Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/ttm/ttm_device.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 5f31acec3ad76..519deea8e39b7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -100,6 +100,8 @@ static int ttm_global_init(void)
>   	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
>   				&glob->bo_count);
>   out:
> +	if (ret)
> +		--ttm_glob_use_count;
>   	mutex_unlock(&ttm_global_mutex);
>   	return ret;
>   }


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 4/6] drm/ttm: Force re-init if ttm_global_init() fails
@ 2021-07-19 19:21     ` Christian König
  0 siblings, 0 replies; 45+ messages in thread
From: Christian König @ 2021-07-19 19:21 UTC (permalink / raw)
  To: Jason Ekstrand, intel-gfx, dri-devel

Am 19.07.21 um 20:30 schrieb Jason Ekstrand:
> If we have a failure, decrement the reference count so that the next
> call to ttm_global_init() will actually do something instead of assume
> everything is all set up.
>
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 62b53b37e4b1 ("drm/ttm: use a static ttm_bo_global instance")
> Cc: Christian König <christian.koenig@amd.com>

Good catch, path is Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/ttm/ttm_device.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 5f31acec3ad76..519deea8e39b7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -100,6 +100,8 @@ static int ttm_global_init(void)
>   	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
>   				&glob->bo_count);
>   out:
> +	if (ret)
> +		--ttm_glob_use_count;
>   	mutex_unlock(&ttm_global_mutex);
>   	return ret;
>   }

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Intel-gfx] ✓ Fi.CI.IGT: success for Fix the debugfs splat from mock selftests
  2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
                   ` (10 preceding siblings ...)
  (?)
@ 2021-07-19 23:32 ` Patchwork
  -1 siblings, 0 replies; 45+ messages in thread
From: Patchwork @ 2021-07-19 23:32 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 30264 bytes --]

== Series Details ==

Series: Fix the debugfs splat from mock selftests
URL   : https://patchwork.freedesktop.org/series/92729/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10352_full -> Patchwork_20646_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Known issues
------------

  Here are the changes found in Patchwork_20646_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@drm_import_export@flink:
    - shard-kbl:          [PASS][1] -> [INCOMPLETE][2] ([i915#2369])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-kbl3/igt@drm_import_export@flink.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl3/igt@drm_import_export@flink.html

  * igt@feature_discovery@display-4x:
    - shard-tglb:         NOTRUN -> [SKIP][3] ([i915#1839])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb7/igt@feature_discovery@display-4x.html
    - shard-iclb:         NOTRUN -> [SKIP][4] ([i915#1839])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb5/igt@feature_discovery@display-4x.html

  * igt@gem_create@create-massive:
    - shard-snb:          NOTRUN -> [DMESG-WARN][5] ([i915#3002])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-snb5/igt@gem_create@create-massive.html
    - shard-apl:          NOTRUN -> [DMESG-WARN][6] ([i915#3002])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl3/igt@gem_create@create-massive.html

  * igt@gem_ctx_persistence@engines-hostile@bcs0:
    - shard-tglb:         [PASS][7] -> [FAIL][8] ([i915#2410])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-tglb3/igt@gem_ctx_persistence@engines-hostile@bcs0.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb1/igt@gem_ctx_persistence@engines-hostile@bcs0.html

  * igt@gem_ctx_persistence@legacy-engines-mixed-process:
    - shard-snb:          NOTRUN -> [SKIP][9] ([fdo#109271] / [i915#1099]) +3 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-snb6/igt@gem_ctx_persistence@legacy-engines-mixed-process.html

  * igt@gem_eio@unwedge-stress:
    - shard-snb:          NOTRUN -> [FAIL][10] ([i915#3354])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-snb6/igt@gem_eio@unwedge-stress.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-tglb:         [PASS][11] -> [FAIL][12] ([i915#2842]) +1 similar issue
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-tglb3/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb1/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace-solo@rcs0:
    - shard-glk:          [PASS][13] -> [FAIL][14] ([i915#2842]) +1 similar issue
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-glk9/igt@gem_exec_fair@basic-pace-solo@rcs0.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-glk2/igt@gem_exec_fair@basic-pace-solo@rcs0.html

  * igt@gem_exec_fair@basic-pace@vcs1:
    - shard-iclb:         NOTRUN -> [FAIL][15] ([i915#2842])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb4/igt@gem_exec_fair@basic-pace@vcs1.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [PASS][16] -> [FAIL][17] ([i915#2842]) +1 similar issue
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-kbl3/igt@gem_exec_fair@basic-pace@vecs0.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl6/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_exec_whisper@basic-queues-all:
    - shard-glk:          [PASS][18] -> [DMESG-WARN][19] ([i915#118] / [i915#95])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-glk5/igt@gem_exec_whisper@basic-queues-all.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-glk3/igt@gem_exec_whisper@basic-queues-all.html

  * igt@gem_mmap_gtt@cpuset-medium-copy-odd:
    - shard-iclb:         [PASS][20] -> [FAIL][21] ([i915#2428])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb4/igt@gem_mmap_gtt@cpuset-medium-copy-odd.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb6/igt@gem_mmap_gtt@cpuset-medium-copy-odd.html

  * igt@gem_mmap_gtt@cpuset-medium-copy-xy:
    - shard-iclb:         [PASS][22] -> [FAIL][23] ([i915#307])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb7/igt@gem_mmap_gtt@cpuset-medium-copy-xy.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb4/igt@gem_mmap_gtt@cpuset-medium-copy-xy.html

  * igt@gem_pread@exhaustion:
    - shard-iclb:         NOTRUN -> [WARN][24] ([i915#2658])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@gem_pread@exhaustion.html

  * igt@gem_render_copy@y-tiled-to-vebox-linear:
    - shard-iclb:         NOTRUN -> [SKIP][25] ([i915#768])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb6/igt@gem_render_copy@y-tiled-to-vebox-linear.html

  * igt@gem_softpin@noreloc-s3:
    - shard-apl:          [PASS][26] -> [DMESG-WARN][27] ([i915#180]) +3 similar issues
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-apl2/igt@gem_softpin@noreloc-s3.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl6/igt@gem_softpin@noreloc-s3.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-tglb:         NOTRUN -> [FAIL][28] ([i915#3318])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb6/igt@gem_userptr_blits@vma-merge.html
    - shard-skl:          NOTRUN -> [FAIL][29] ([i915#3318])
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl5/igt@gem_userptr_blits@vma-merge.html

  * igt@i915_module_load@reload-with-fault-injection:
    - shard-skl:          [PASS][30] -> [DMESG-WARN][31] ([i915#1982]) +1 similar issue
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl3/igt@i915_module_load@reload-with-fault-injection.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl3/igt@i915_module_load@reload-with-fault-injection.html

  * igt@i915_suspend@forcewake:
    - shard-apl:          NOTRUN -> [DMESG-WARN][32] ([i915#180])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl7/igt@i915_suspend@forcewake.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0-hflip:
    - shard-skl:          NOTRUN -> [SKIP][33] ([fdo#109271] / [i915#3777])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-0-hflip.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-180-hflip:
    - shard-kbl:          NOTRUN -> [SKIP][34] ([fdo#109271] / [i915#3777])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl7/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-180-hflip.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-0-hflip:
    - shard-apl:          NOTRUN -> [SKIP][35] ([fdo#109271] / [i915#3777]) +2 similar issues
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl7/igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-0-hflip.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-180-async-flip:
    - shard-tglb:         NOTRUN -> [SKIP][36] ([fdo#111615])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb6/igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-180-async-flip.html

  * igt@kms_ccs@pipe-d-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc:
    - shard-kbl:          NOTRUN -> [SKIP][37] ([fdo#109271]) +78 similar issues
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl7/igt@kms_ccs@pipe-d-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_chamelium@vga-hpd-after-suspend:
    - shard-iclb:         NOTRUN -> [SKIP][38] ([fdo#109284] / [fdo#111827]) +2 similar issues
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@kms_chamelium@vga-hpd-after-suspend.html

  * igt@kms_chamelium@vga-hpd-for-each-pipe:
    - shard-kbl:          NOTRUN -> [SKIP][39] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl2/igt@kms_chamelium@vga-hpd-for-each-pipe.html

  * igt@kms_chamelium@vga-hpd-without-ddc:
    - shard-snb:          NOTRUN -> [SKIP][40] ([fdo#109271] / [fdo#111827]) +15 similar issues
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-snb7/igt@kms_chamelium@vga-hpd-without-ddc.html

  * igt@kms_color_chamelium@pipe-d-ctm-0-25:
    - shard-tglb:         NOTRUN -> [SKIP][41] ([fdo#109284] / [fdo#111827])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb7/igt@kms_color_chamelium@pipe-d-ctm-0-25.html
    - shard-iclb:         NOTRUN -> [SKIP][42] ([fdo#109278] / [fdo#109284] / [fdo#111827])
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb5/igt@kms_color_chamelium@pipe-d-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-d-ctm-0-5:
    - shard-skl:          NOTRUN -> [SKIP][43] ([fdo#109271] / [fdo#111827]) +3 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@kms_color_chamelium@pipe-d-ctm-0-5.html

  * igt@kms_color_chamelium@pipe-invalid-degamma-lut-sizes:
    - shard-apl:          NOTRUN -> [SKIP][44] ([fdo#109271] / [fdo#111827]) +26 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl1/igt@kms_color_chamelium@pipe-invalid-degamma-lut-sizes.html

  * igt@kms_content_protection@atomic-dpms:
    - shard-apl:          NOTRUN -> [TIMEOUT][45] ([i915#1319]) +2 similar issues
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl7/igt@kms_content_protection@atomic-dpms.html

  * igt@kms_content_protection@srm:
    - shard-kbl:          NOTRUN -> [TIMEOUT][46] ([i915#1319])
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl2/igt@kms_content_protection@srm.html

  * igt@kms_cursor_crc@pipe-a-cursor-32x10-sliding:
    - shard-tglb:         NOTRUN -> [SKIP][47] ([i915#3359])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb6/igt@kms_cursor_crc@pipe-a-cursor-32x10-sliding.html

  * igt@kms_cursor_crc@pipe-a-cursor-suspend:
    - shard-kbl:          [PASS][48] -> [DMESG-WARN][49] ([i915#180]) +4 similar issues
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-kbl6/igt@kms_cursor_crc@pipe-a-cursor-suspend.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl7/igt@kms_cursor_crc@pipe-a-cursor-suspend.html

  * igt@kms_cursor_crc@pipe-d-cursor-64x64-random:
    - shard-iclb:         NOTRUN -> [SKIP][50] ([fdo#109278]) +14 similar issues
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@kms_cursor_crc@pipe-d-cursor-64x64-random.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size:
    - shard-iclb:         NOTRUN -> [SKIP][51] ([fdo#109274] / [fdo#109278])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@kms_cursor_legacy@cursora-vs-flipb-atomic-transitions-varying-size.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size:
    - shard-skl:          [PASS][52] -> [FAIL][53] ([i915#2346] / [i915#533])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl4/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl4/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions-varying-size.html

  * igt@kms_flip@2x-flip-vs-modeset-vs-hang:
    - shard-iclb:         NOTRUN -> [SKIP][54] ([fdo#109274])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb6/igt@kms_flip@2x-flip-vs-modeset-vs-hang.html

  * igt@kms_flip@flip-vs-expired-vblank@a-edp1:
    - shard-skl:          [PASS][55] -> [DMESG-FAIL][56] ([i915#1982])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl1/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl5/igt@kms_flip@flip-vs-expired-vblank@a-edp1.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-move:
    - shard-iclb:         NOTRUN -> [SKIP][57] ([fdo#109280]) +4 similar issues
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb6/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-move.html

  * igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-msflip-blt:
    - shard-tglb:         NOTRUN -> [SKIP][58] ([fdo#111825]) +6 similar issues
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb6/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-shrfb-msflip-blt.html

  * igt@kms_hdr@static-toggle-suspend:
    - shard-iclb:         NOTRUN -> [SKIP][59] ([i915#1187])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@kms_hdr@static-toggle-suspend.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-d-frame-sequence:
    - shard-skl:          NOTRUN -> [SKIP][60] ([fdo#109271] / [i915#533]) +1 similar issue
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@kms_pipe_crc_basic@read-crc-pipe-d-frame-sequence.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d:
    - shard-apl:          NOTRUN -> [SKIP][61] ([fdo#109271] / [i915#533])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl7/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-d.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-basic:
    - shard-apl:          NOTRUN -> [FAIL][62] ([fdo#108145] / [i915#265]) +2 similar issues
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl1/igt@kms_plane_alpha_blend@pipe-a-alpha-basic.html

  * igt@kms_plane_alpha_blend@pipe-b-alpha-transparent-fb:
    - shard-apl:          NOTRUN -> [FAIL][63] ([i915#265])
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl3/igt@kms_plane_alpha_blend@pipe-b-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-c-alpha-7efc:
    - shard-kbl:          NOTRUN -> [FAIL][64] ([fdo#108145] / [i915#265]) +1 similar issue
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl7/igt@kms_plane_alpha_blend@pipe-c-alpha-7efc.html

  * igt@kms_plane_alpha_blend@pipe-c-alpha-basic:
    - shard-skl:          NOTRUN -> [FAIL][65] ([fdo#108145] / [i915#265])
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@kms_plane_alpha_blend@pipe-c-alpha-basic.html

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          [PASS][66] -> [FAIL][67] ([fdo#108145] / [i915#265]) +1 similar issue
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl1/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl5/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html

  * igt@kms_plane_lowres@pipe-b-tiling-none:
    - shard-iclb:         NOTRUN -> [SKIP][68] ([i915#3536])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@kms_plane_lowres@pipe-b-tiling-none.html

  * igt@kms_plane_scaling@scaler-with-clipping-clamping@pipe-c-scaler-with-clipping-clamping:
    - shard-apl:          NOTRUN -> [SKIP][69] ([fdo#109271] / [i915#2733])
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl7/igt@kms_plane_scaling@scaler-with-clipping-clamping@pipe-c-scaler-with-clipping-clamping.html

  * igt@kms_psr2_sf@cursor-plane-update-sf:
    - shard-skl:          NOTRUN -> [SKIP][70] ([fdo#109271] / [i915#658])
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl9/igt@kms_psr2_sf@cursor-plane-update-sf.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-2:
    - shard-apl:          NOTRUN -> [SKIP][71] ([fdo#109271] / [i915#658]) +5 similar issues
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl2/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-2.html

  * igt@kms_psr2_sf@plane-move-sf-dmg-area-3:
    - shard-kbl:          NOTRUN -> [SKIP][72] ([fdo#109271] / [i915#658]) +1 similar issue
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl6/igt@kms_psr2_sf@plane-move-sf-dmg-area-3.html

  * igt@kms_psr@psr2_primary_mmap_gtt:
    - shard-iclb:         NOTRUN -> [SKIP][73] ([fdo#109441])
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@kms_psr@psr2_primary_mmap_gtt.html

  * igt@kms_psr@psr2_sprite_blt:
    - shard-tglb:         NOTRUN -> [FAIL][74] ([i915#132] / [i915#3467])
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb6/igt@kms_psr@psr2_sprite_blt.html

  * igt@kms_universal_plane@disable-primary-vs-flip-pipe-d:
    - shard-snb:          NOTRUN -> [SKIP][75] ([fdo#109271]) +309 similar issues
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-snb6/igt@kms_universal_plane@disable-primary-vs-flip-pipe-d.html

  * igt@kms_vblank@pipe-c-accuracy-idle:
    - shard-skl:          [PASS][76] -> [FAIL][77] ([i915#43])
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl6/igt@kms_vblank@pipe-c-accuracy-idle.html
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl5/igt@kms_vblank@pipe-c-accuracy-idle.html

  * igt@kms_vblank@pipe-d-ts-continuation-idle:
    - shard-apl:          NOTRUN -> [SKIP][78] ([fdo#109271]) +292 similar issues
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl7/igt@kms_vblank@pipe-d-ts-continuation-idle.html

  * igt@kms_writeback@writeback-check-output:
    - shard-apl:          NOTRUN -> [SKIP][79] ([fdo#109271] / [i915#2437]) +1 similar issue
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl3/igt@kms_writeback@writeback-check-output.html

  * igt@kms_writeback@writeback-fb-id:
    - shard-kbl:          NOTRUN -> [SKIP][80] ([fdo#109271] / [i915#2437])
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl7/igt@kms_writeback@writeback-fb-id.html

  * igt@nouveau_crc@pipe-d-ctx-flip-detection:
    - shard-skl:          NOTRUN -> [SKIP][81] ([fdo#109271]) +65 similar issues
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl9/igt@nouveau_crc@pipe-d-ctx-flip-detection.html

  * igt@prime_vgem@coherency-gtt:
    - shard-tglb:         NOTRUN -> [SKIP][82] ([fdo#111656])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb6/igt@prime_vgem@coherency-gtt.html

  * igt@sysfs_clients@busy:
    - shard-skl:          NOTRUN -> [SKIP][83] ([fdo#109271] / [i915#2994])
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@sysfs_clients@busy.html

  * igt@sysfs_clients@fair-7:
    - shard-apl:          NOTRUN -> [SKIP][84] ([fdo#109271] / [i915#2994]) +3 similar issues
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl1/igt@sysfs_clients@fair-7.html
    - shard-kbl:          NOTRUN -> [SKIP][85] ([fdo#109271] / [i915#2994]) +1 similar issue
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl6/igt@sysfs_clients@fair-7.html

  * igt@sysfs_clients@sema-10:
    - shard-iclb:         NOTRUN -> [SKIP][86] ([i915#2994])
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@sysfs_clients@sema-10.html

  
#### Possible fixes ####

  * igt@gem_eio@in-flight-contexts-1us:
    - shard-tglb:         [TIMEOUT][87] ([i915#3063]) -> [PASS][88]
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-tglb1/igt@gem_eio@in-flight-contexts-1us.html
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb2/igt@gem_eio@in-flight-contexts-1us.html

  * igt@gem_exec_fair@basic-none-share@rcs0:
    - shard-tglb:         [FAIL][89] ([i915#2842]) -> [PASS][90] +1 similar issue
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-tglb5/igt@gem_exec_fair@basic-none-share@rcs0.html
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb7/igt@gem_exec_fair@basic-none-share@rcs0.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-glk:          [FAIL][91] ([i915#2842]) -> [PASS][92]
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-glk4/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-glk1/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace@vcs1:
    - shard-kbl:          [SKIP][93] ([fdo#109271]) -> [PASS][94] +1 similar issue
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-kbl3/igt@gem_exec_fair@basic-pace@vcs1.html
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl6/igt@gem_exec_fair@basic-pace@vcs1.html

  * igt@gem_exec_fair@basic-throttle@rcs0:
    - shard-iclb:         [FAIL][95] ([i915#2849]) -> [PASS][96]
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb6/igt@gem_exec_fair@basic-throttle@rcs0.html
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb2/igt@gem_exec_fair@basic-throttle@rcs0.html

  * igt@gem_exec_whisper@basic-fds-priority:
    - shard-glk:          [DMESG-WARN][97] ([i915#118] / [i915#95]) -> [PASS][98]
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-glk6/igt@gem_exec_whisper@basic-fds-priority.html
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-glk4/igt@gem_exec_whisper@basic-fds-priority.html

  * igt@gem_mmap_gtt@cpuset-big-copy-odd:
    - shard-iclb:         [FAIL][99] ([i915#307]) -> [PASS][100]
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb6/igt@gem_mmap_gtt@cpuset-big-copy-odd.html
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb3/igt@gem_mmap_gtt@cpuset-big-copy-odd.html

  * igt@gem_workarounds@suspend-resume:
    - shard-kbl:          [DMESG-WARN][101] ([i915#180]) -> [PASS][102] +2 similar issues
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-kbl4/igt@gem_workarounds@suspend-resume.html
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-kbl2/igt@gem_workarounds@suspend-resume.html

  * igt@gem_workarounds@suspend-resume-context:
    - shard-skl:          [INCOMPLETE][103] ([i915#198]) -> [PASS][104]
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl9/igt@gem_workarounds@suspend-resume-context.html
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@gem_workarounds@suspend-resume-context.html

  * igt@i915_selftest@mock@dmabuf:
    - shard-iclb:         [DMESG-WARN][105] ([i915#3746]) -> [PASS][106] +17 similar issues
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb3/igt@i915_selftest@mock@dmabuf.html
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb2/igt@i915_selftest@mock@dmabuf.html

  * igt@i915_selftest@mock@objects:
    - shard-skl:          [DMESG-WARN][107] ([i915#3746]) -> [PASS][108] +17 similar issues
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl8/igt@i915_selftest@mock@objects.html
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl2/igt@i915_selftest@mock@objects.html
    - shard-tglb:         [DMESG-WARN][109] ([i915#3746]) -> [PASS][110] +17 similar issues
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-tglb3/igt@i915_selftest@mock@objects.html
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb3/igt@i915_selftest@mock@objects.html

  * igt@i915_selftest@mock@uncore:
    - shard-glk:          [DMESG-WARN][111] ([i915#3746]) -> [PASS][112] +17 similar issues
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-glk7/igt@i915_selftest@mock@uncore.html
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-glk1/igt@i915_selftest@mock@uncore.html

  * igt@i915_selftest@mock@vma:
    - shard-snb:          [DMESG-WARN][113] ([i915#3746]) -> [PASS][114] +17 similar issues
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-snb5/igt@i915_selftest@mock@vma.html
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-snb7/igt@i915_selftest@mock@vma.html

  * igt@i915_suspend@fence-restore-tiled2untiled:
    - shard-apl:          [DMESG-WARN][115] ([i915#180]) -> [PASS][116] +2 similar issues
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-apl2/igt@i915_suspend@fence-restore-tiled2untiled.html
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-apl1/igt@i915_suspend@fence-restore-tiled2untiled.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic:
    - shard-skl:          [FAIL][117] ([i915#2346]) -> [PASS][118]
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl4/igt@kms_cursor_legacy@flip-vs-cursor-atomic.html
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl4/igt@kms_cursor_legacy@flip-vs-cursor-atomic.html

  * igt@kms_draw_crc@draw-method-xrgb2101010-mmap-cpu-ytiled:
    - shard-skl:          [FAIL][119] -> [PASS][120] +1 similar issue
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl4/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-cpu-ytiled.html
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl4/igt@kms_draw_crc@draw-method-xrgb2101010-mmap-cpu-ytiled.html

  * igt@kms_flip@2x-flip-vs-expired-vblank@ab-hdmi-a1-hdmi-a2:
    - shard-glk:          [FAIL][121] ([i915#79]) -> [PASS][122]
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-glk4/igt@kms_flip@2x-flip-vs-expired-vblank@ab-hdmi-a1-hdmi-a2.html
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-glk3/igt@kms_flip@2x-flip-vs-expired-vblank@ab-hdmi-a1-hdmi-a2.html

  * igt@kms_flip@plain-flip-fb-recreate-interruptible@a-edp1:
    - shard-skl:          [FAIL][123] ([i915#2122]) -> [PASS][124]
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl2/igt@kms_flip@plain-flip-fb-recreate-interruptible@a-edp1.html
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl1/igt@kms_flip@plain-flip-fb-recreate-interruptible@a-edp1.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          [FAIL][125] ([i915#1188]) -> [PASS][126]
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl3/igt@kms_hdr@bpc-switch-dpms.html
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl6/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min:
    - shard-skl:          [FAIL][127] ([fdo#108145] / [i915#265]) -> [PASS][128]
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl4/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min.html
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl4/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-min.html

  * igt@kms_psr@psr2_cursor_render:
    - shard-iclb:         [SKIP][129] ([fdo#109441]) -> [PASS][130] +2 similar issues
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb6/igt@kms_psr@psr2_cursor_render.html
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb2/igt@kms_psr@psr2_cursor_render.html

  * igt@perf@polling-small-buf:
    - shard-skl:          [FAIL][131] ([i915#1722]) -> [PASS][132]
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-skl10/igt@perf@polling-small-buf.html
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-skl1/igt@perf@polling-small-buf.html

  
#### Warnings ####

  * igt@i915_pm_dc@dc3co-vpb-simulation:
    - shard-iclb:         [SKIP][133] ([i915#588]) -> [SKIP][134] ([i915#658])
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb2/igt@i915_pm_dc@dc3co-vpb-simulation.html
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb8/igt@i915_pm_dc@dc3co-vpb-simulation.html

  * igt@i915_pm_rc6_residency@rc6-fence:
    - shard-iclb:         [WARN][135] ([i915#1804] / [i915#2684]) -> [WARN][136] ([i915#2684])
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb3/igt@i915_pm_rc6_residency@rc6-fence.html
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb5/igt@i915_pm_rc6_residency@rc6-fence.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1:
    - shard-iclb:         [SKIP][137] ([i915#2920]) -> [SKIP][138] ([i915#658]) +2 similar issues
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1.html
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb5/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2:
    - shard-iclb:         [SKIP][139] ([i915#658]) -> [SKIP][140] ([i915#2920]) +1 similar issue
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-iclb4/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2.html
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2.html

  * igt@kms_psr@psr2_basic:
    - shard-tglb:         [DMESG-FAIL][141] ([i915#132]) -> [FAIL][142] ([i915#132] / [i915#3467])
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10352/shard-tglb3/igt@kms_psr@psr2_basic.html
   [142]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/shard-tglb3/igt@kms_psr@psr2_basic.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][143], [FAIL]

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20646/index.html

[-- Attachment #1.2: Type: text/html, Size: 33568 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-20  8:25     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20  8:25 UTC (permalink / raw)
  To: Jason Ekstrand, intel-gfx, dri-devel


On 19/07/2021 19:30, Jason Ekstrand wrote:
> If the driver was not fully loaded, we may still have globals lying
> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> of memory slabs.  This can happen two ways: use_kms = false and if we've
> run mock selftests.  In either case, we have an early exit from
> i915_init which happens after i915_globals_init() and we need to clean
> up those globals.  While we're here, add an explicit boolean instead of
> using a random field from i915_pci_device to detect partial loads.
> 
> The mock selftests case gets especially sticky.  The load isn't entirely
> a no-op.  We actually do quite a bit inside those selftests including
> allocating a bunch of mock objects and running tests on them.  Once all
> those tests are complete, we exit early from i915_init().  Perviously,
> i915_init() would return a non-zero error code on failure and a zero
> error code on success.  In the success case, we would get to i915_exit()
> and check i915_pci_driver.driver.owner to detect if i915_init exited early
> and do nothing.  In the failure case, we would fail i915_init() but
> there would be no opportunity to clean up globals.
> 
> The most annoying part is that you don't actually notice the failure as
> part of the self-tests since leaking a bit of memory, while bad, doesn't
> result in anything observable from userspace.  Instead, the next time we
> load the driver (usually for next IGT test), i915_globals_init() gets
> invoked again, we go to allocate a bunch of new memory slabs, those
> implicitly create debugfs entries, and debugfs warns that we're trying
> to create directories and files that already exist.  Since this all
> happens as part of the next driver load, it shows up in the dmesg-warn
> of whatever IGT test ran after the mock selftests.

Story checks out but I totally don't get why it wouldn't be noticed 
until now. Was it perhaps part of the selfetsts contract that a reboot 
is required after failure?

> While the obvious thing to do here might be to call i915_globals_exit()
> after selftests, that's not actually safe.  The dma-buf selftests call
> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> the resulting dmabuf which calls fput() on the file.  However, fput()
> isn't immediate and gets flushed right before syscall returns.  This
> means that all the fput()s from the selftests don't happen until right
> before the module load syscall used to fire off the selftests returns
> which is after i915_init().  If we call i915_globals_exit() in
> i915_init() after selftests, we end up freeing slabs out from under
> objects which won't get released until fput() is flushed at the end of
> the module load.

Nasty. Wasn't visible while globals memory leak was "in place". :I

> The solution here is to let i915_init() return success early and detect
> the early success in i915_exit() and only tear down globals and nothing
> else.  This way the module loads successfully, regardless of the success
> or failure of the tests.  Because we've not enumerated any PCI devices,
> no device nodes are created and it's entirely useless from userspace.
> The only thing the module does at that point is hold on to a bit of
> memory until we unload it and i915_exit() is called.  Importantly, this
> means that everything from our selftests has the ability to properly
> flush out between i915_init() and i915_exit() because there are a couple
> syscall boundaries in between.

When you say "couple of syscall boundaries" you mean exactly two (module 
init/unload) or there is more to it? Like why "couple" is needed and not 
just that the module load syscall has exited? That part sounds 
potentially dodgy. What mechanism is used by the delayed flush?

Have you checked how this change interacts with the test runner and CI?

> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> Cc: Daniel Vetter <daniel@ffwll.ch>
> ---
>   drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
>   1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 4e627b57d31a2..24e4e54516936 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
>   	.driver.pm = &i915_pm_ops,
>   };
>   
> +static bool i915_fully_loaded = false;

No need to initialize.

> +
>   static int __init i915_init(void)
>   {
>   	bool use_kms = true;
>   	int err;
>   
> +	i915_fully_loaded = false;

Ditto.

> +
>   	err = i915_globals_init();
>   	if (err)
>   		return err;
>   
> +	/* i915_mock_selftests() only returns zero if no mock subtests were


/*
  * Please use this multi line comment style in i915.
  */


> +	 * run.  If we get any non-zero error code, we return early here.
> +	 * We always return success because selftests may have allocated
> +	 * objects from slabs which will get cleaned up by i915_exit().  We
> +	 * could attempt to clean up immediately and fail module load but,
> +	 * thanks to interactions with other parts of the kernel (struct
> +	 * file, in particular), it's safer to let the module fully load
> +	 * and then clean up on unload.
> +	 */
>   	err = i915_mock_selftests();
>   	if (err)
> -		return err > 0 ? 0 : err;
> +		return 0;
>   
>   	/*
>   	 * Enable KMS by default, unless explicitly overriden by
> @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
>   		return 0;
>   	}
>   
> +	/* After this point, i915_init() must either fully succeed or
> +	 * properly tear everything down and fail.  We don't have separate
> +	 * flags for each set-up bit.
> +	 */
> +	i915_fully_loaded = true;
> +
>   	i915_pmu_init();
>   
>   	err = pci_register_driver(&i915_pci_driver);
> @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
>   
>   static void __exit i915_exit(void)
>   {
> -	if (!i915_pci_driver.driver.owner)
> -		return;
> -
> -	i915_perf_sysctl_unregister();
> -	pci_unregister_driver(&i915_pci_driver);
> -	i915_pmu_exit();
> +	if (i915_fully_loaded) {
> +		i915_perf_sysctl_unregister();
> +		pci_unregister_driver(&i915_pci_driver);
> +		i915_pmu_exit();
> +	}
>   	i915_globals_exit();
>   }
>   
> 

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20  8:25     ` Tvrtko Ursulin
  0 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20  8:25 UTC (permalink / raw)
  To: Jason Ekstrand, intel-gfx, dri-devel


On 19/07/2021 19:30, Jason Ekstrand wrote:
> If the driver was not fully loaded, we may still have globals lying
> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> of memory slabs.  This can happen two ways: use_kms = false and if we've
> run mock selftests.  In either case, we have an early exit from
> i915_init which happens after i915_globals_init() and we need to clean
> up those globals.  While we're here, add an explicit boolean instead of
> using a random field from i915_pci_device to detect partial loads.
> 
> The mock selftests case gets especially sticky.  The load isn't entirely
> a no-op.  We actually do quite a bit inside those selftests including
> allocating a bunch of mock objects and running tests on them.  Once all
> those tests are complete, we exit early from i915_init().  Perviously,
> i915_init() would return a non-zero error code on failure and a zero
> error code on success.  In the success case, we would get to i915_exit()
> and check i915_pci_driver.driver.owner to detect if i915_init exited early
> and do nothing.  In the failure case, we would fail i915_init() but
> there would be no opportunity to clean up globals.
> 
> The most annoying part is that you don't actually notice the failure as
> part of the self-tests since leaking a bit of memory, while bad, doesn't
> result in anything observable from userspace.  Instead, the next time we
> load the driver (usually for next IGT test), i915_globals_init() gets
> invoked again, we go to allocate a bunch of new memory slabs, those
> implicitly create debugfs entries, and debugfs warns that we're trying
> to create directories and files that already exist.  Since this all
> happens as part of the next driver load, it shows up in the dmesg-warn
> of whatever IGT test ran after the mock selftests.

Story checks out but I totally don't get why it wouldn't be noticed 
until now. Was it perhaps part of the selfetsts contract that a reboot 
is required after failure?

> While the obvious thing to do here might be to call i915_globals_exit()
> after selftests, that's not actually safe.  The dma-buf selftests call
> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> the resulting dmabuf which calls fput() on the file.  However, fput()
> isn't immediate and gets flushed right before syscall returns.  This
> means that all the fput()s from the selftests don't happen until right
> before the module load syscall used to fire off the selftests returns
> which is after i915_init().  If we call i915_globals_exit() in
> i915_init() after selftests, we end up freeing slabs out from under
> objects which won't get released until fput() is flushed at the end of
> the module load.

Nasty. Wasn't visible while globals memory leak was "in place". :I

> The solution here is to let i915_init() return success early and detect
> the early success in i915_exit() and only tear down globals and nothing
> else.  This way the module loads successfully, regardless of the success
> or failure of the tests.  Because we've not enumerated any PCI devices,
> no device nodes are created and it's entirely useless from userspace.
> The only thing the module does at that point is hold on to a bit of
> memory until we unload it and i915_exit() is called.  Importantly, this
> means that everything from our selftests has the ability to properly
> flush out between i915_init() and i915_exit() because there are a couple
> syscall boundaries in between.

When you say "couple of syscall boundaries" you mean exactly two (module 
init/unload) or there is more to it? Like why "couple" is needed and not 
just that the module load syscall has exited? That part sounds 
potentially dodgy. What mechanism is used by the delayed flush?

Have you checked how this change interacts with the test runner and CI?

> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> Cc: Daniel Vetter <daniel@ffwll.ch>
> ---
>   drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
>   1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 4e627b57d31a2..24e4e54516936 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
>   	.driver.pm = &i915_pm_ops,
>   };
>   
> +static bool i915_fully_loaded = false;

No need to initialize.

> +
>   static int __init i915_init(void)
>   {
>   	bool use_kms = true;
>   	int err;
>   
> +	i915_fully_loaded = false;

Ditto.

> +
>   	err = i915_globals_init();
>   	if (err)
>   		return err;
>   
> +	/* i915_mock_selftests() only returns zero if no mock subtests were


/*
  * Please use this multi line comment style in i915.
  */


> +	 * run.  If we get any non-zero error code, we return early here.
> +	 * We always return success because selftests may have allocated
> +	 * objects from slabs which will get cleaned up by i915_exit().  We
> +	 * could attempt to clean up immediately and fail module load but,
> +	 * thanks to interactions with other parts of the kernel (struct
> +	 * file, in particular), it's safer to let the module fully load
> +	 * and then clean up on unload.
> +	 */
>   	err = i915_mock_selftests();
>   	if (err)
> -		return err > 0 ? 0 : err;
> +		return 0;
>   
>   	/*
>   	 * Enable KMS by default, unless explicitly overriden by
> @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
>   		return 0;
>   	}
>   
> +	/* After this point, i915_init() must either fully succeed or
> +	 * properly tear everything down and fail.  We don't have separate
> +	 * flags for each set-up bit.
> +	 */
> +	i915_fully_loaded = true;
> +
>   	i915_pmu_init();
>   
>   	err = pci_register_driver(&i915_pci_driver);
> @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
>   
>   static void __exit i915_exit(void)
>   {
> -	if (!i915_pci_driver.driver.owner)
> -		return;
> -
> -	i915_perf_sysctl_unregister();
> -	pci_unregister_driver(&i915_pci_driver);
> -	i915_pmu_exit();
> +	if (i915_fully_loaded) {
> +		i915_perf_sysctl_unregister();
> +		pci_unregister_driver(&i915_pci_driver);
> +		i915_pmu_exit();
> +	}
>   	i915_globals_exit();
>   }
>   
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/6] drm/i915: Call i915_globals_exit() if pci_register_device() fails
  2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-20 14:11     ` Daniel Vetter
  -1 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-20 14:11 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Mon, Jul 19, 2021 at 01:30:43PM -0500, Jason Ekstrand wrote:
> In the unlikely event that pci_register_device() fails, we were tearing
> down our PMU setup but not globals.  This leaves a bunch of memory slabs
> lying around.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> Cc: Daniel Vetter <daniel@ffwll.ch>

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>


> ---
>  drivers/gpu/drm/i915/i915_globals.c | 4 ++--
>  drivers/gpu/drm/i915/i915_pci.c     | 1 +
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
> index 77f1911c463b8..87267e1d2ad92 100644
> --- a/drivers/gpu/drm/i915/i915_globals.c
> +++ b/drivers/gpu/drm/i915/i915_globals.c
> @@ -138,7 +138,7 @@ void i915_globals_unpark(void)
>  	atomic_inc(&active);
>  }
>  
> -static void __exit __i915_globals_flush(void)
> +static void __i915_globals_flush(void)
>  {
>  	atomic_inc(&active); /* skip shrinking */
>  
> @@ -148,7 +148,7 @@ static void __exit __i915_globals_flush(void)
>  	atomic_dec(&active);
>  }
>  
> -void __exit i915_globals_exit(void)
> +void i915_globals_exit(void)
>  {
>  	GEM_BUG_ON(atomic_read(&active));
>  
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 50ed93b03e582..4e627b57d31a2 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -1230,6 +1230,7 @@ static int __init i915_init(void)
>  	err = pci_register_driver(&i915_pci_driver);
>  	if (err) {
>  		i915_pmu_exit();
> +		i915_globals_exit();
>  		return err;
>  	}
>  
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 2/6] drm/i915: Call i915_globals_exit() if pci_register_device() fails
@ 2021-07-20 14:11     ` Daniel Vetter
  0 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-20 14:11 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Mon, Jul 19, 2021 at 01:30:43PM -0500, Jason Ekstrand wrote:
> In the unlikely event that pci_register_device() fails, we were tearing
> down our PMU setup but not globals.  This leaves a bunch of memory slabs
> lying around.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> Cc: Daniel Vetter <daniel@ffwll.ch>

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>


> ---
>  drivers/gpu/drm/i915/i915_globals.c | 4 ++--
>  drivers/gpu/drm/i915/i915_pci.c     | 1 +
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_globals.c b/drivers/gpu/drm/i915/i915_globals.c
> index 77f1911c463b8..87267e1d2ad92 100644
> --- a/drivers/gpu/drm/i915/i915_globals.c
> +++ b/drivers/gpu/drm/i915/i915_globals.c
> @@ -138,7 +138,7 @@ void i915_globals_unpark(void)
>  	atomic_inc(&active);
>  }
>  
> -static void __exit __i915_globals_flush(void)
> +static void __i915_globals_flush(void)
>  {
>  	atomic_inc(&active); /* skip shrinking */
>  
> @@ -148,7 +148,7 @@ static void __exit __i915_globals_flush(void)
>  	atomic_dec(&active);
>  }
>  
> -void __exit i915_globals_exit(void)
> +void i915_globals_exit(void)
>  {
>  	GEM_BUG_ON(atomic_read(&active));
>  
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 50ed93b03e582..4e627b57d31a2 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -1230,6 +1230,7 @@ static int __init i915_init(void)
>  	err = pci_register_driver(&i915_pci_driver);
>  	if (err) {
>  		i915_pmu_exit();
> +		i915_globals_exit();
>  		return err;
>  	}
>  
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-20 14:18     ` Daniel Vetter
  -1 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> If the driver was not fully loaded, we may still have globals lying
> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> of memory slabs.  This can happen two ways: use_kms = false and if we've
> run mock selftests.  In either case, we have an early exit from
> i915_init which happens after i915_globals_init() and we need to clean
> up those globals.  While we're here, add an explicit boolean instead of
> using a random field from i915_pci_device to detect partial loads.
> 
> The mock selftests case gets especially sticky.  The load isn't entirely
> a no-op.  We actually do quite a bit inside those selftests including
> allocating a bunch of mock objects and running tests on them.  Once all
> those tests are complete, we exit early from i915_init().  Perviously,
> i915_init() would return a non-zero error code on failure and a zero
> error code on success.  In the success case, we would get to i915_exit()
> and check i915_pci_driver.driver.owner to detect if i915_init exited early
> and do nothing.  In the failure case, we would fail i915_init() but
> there would be no opportunity to clean up globals.
> 
> The most annoying part is that you don't actually notice the failure as
> part of the self-tests since leaking a bit of memory, while bad, doesn't
> result in anything observable from userspace.  Instead, the next time we
> load the driver (usually for next IGT test), i915_globals_init() gets
> invoked again, we go to allocate a bunch of new memory slabs, those
> implicitly create debugfs entries, and debugfs warns that we're trying
> to create directories and files that already exist.  Since this all
> happens as part of the next driver load, it shows up in the dmesg-warn
> of whatever IGT test ran after the mock selftests.
> 
> While the obvious thing to do here might be to call i915_globals_exit()
> after selftests, that's not actually safe.  The dma-buf selftests call
> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> the resulting dmabuf which calls fput() on the file.  However, fput()
> isn't immediate and gets flushed right before syscall returns.  This
> means that all the fput()s from the selftests don't happen until right
> before the module load syscall used to fire off the selftests returns
> which is after i915_init().  If we call i915_globals_exit() in
> i915_init() after selftests, we end up freeing slabs out from under
> objects which won't get released until fput() is flushed at the end of
> the module load.
> 
> The solution here is to let i915_init() return success early and detect
> the early success in i915_exit() and only tear down globals and nothing
> else.  This way the module loads successfully, regardless of the success
> or failure of the tests.  Because we've not enumerated any PCI devices,
> no device nodes are created and it's entirely useless from userspace.
> The only thing the module does at that point is hold on to a bit of
> memory until we unload it and i915_exit() is called.  Importantly, this
> means that everything from our selftests has the ability to properly
> flush out between i915_init() and i915_exit() because there are a couple
> syscall boundaries in between.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> Cc: Daniel Vetter <daniel@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 4e627b57d31a2..24e4e54516936 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
>  	.driver.pm = &i915_pm_ops,
>  };
>  
> +static bool i915_fully_loaded = false;
> +
>  static int __init i915_init(void)
>  {
>  	bool use_kms = true;
>  	int err;
>  
> +	i915_fully_loaded = false;
> +
>  	err = i915_globals_init();
>  	if (err)
>  		return err;
>  
> +	/* i915_mock_selftests() only returns zero if no mock subtests were
> +	 * run.  If we get any non-zero error code, we return early here.
> +	 * We always return success because selftests may have allocated
> +	 * objects from slabs which will get cleaned up by i915_exit().  We
> +	 * could attempt to clean up immediately and fail module load but,
> +	 * thanks to interactions with other parts of the kernel (struct
> +	 * file, in particular), it's safer to let the module fully load
> +	 * and then clean up on unload.
> +	 */
>  	err = i915_mock_selftests();
>  	if (err)
> -		return err > 0 ? 0 : err;
> +		return 0;

At least the module options still claim that you can run selftests and
still load the driver. Which makes sense for perf/hw selftests, since
those need the driver, but would result in the same old bug resurfacing
that you're trying to fix there.

Is that description just confused and needs some fixing, or do we have a
gap here?

Patch itself looks reasonable, with the nits from Tvrtko addressed:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

>  
>  	/*
>  	 * Enable KMS by default, unless explicitly overriden by
> @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
>  		return 0;
>  	}
>  
> +	/* After this point, i915_init() must either fully succeed or
> +	 * properly tear everything down and fail.  We don't have separate
> +	 * flags for each set-up bit.
> +	 */
> +	i915_fully_loaded = true;
> +
>  	i915_pmu_init();
>  
>  	err = pci_register_driver(&i915_pci_driver);
> @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
>  
>  static void __exit i915_exit(void)
>  {
> -	if (!i915_pci_driver.driver.owner)
> -		return;
> -
> -	i915_perf_sysctl_unregister();
> -	pci_unregister_driver(&i915_pci_driver);
> -	i915_pmu_exit();
> +	if (i915_fully_loaded) {
> +		i915_perf_sysctl_unregister();
> +		pci_unregister_driver(&i915_pci_driver);
> +		i915_pmu_exit();
> +	}
>  	i915_globals_exit();
>  }
>  
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 14:18     ` Daniel Vetter
  0 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-20 14:18 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> If the driver was not fully loaded, we may still have globals lying
> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> of memory slabs.  This can happen two ways: use_kms = false and if we've
> run mock selftests.  In either case, we have an early exit from
> i915_init which happens after i915_globals_init() and we need to clean
> up those globals.  While we're here, add an explicit boolean instead of
> using a random field from i915_pci_device to detect partial loads.
> 
> The mock selftests case gets especially sticky.  The load isn't entirely
> a no-op.  We actually do quite a bit inside those selftests including
> allocating a bunch of mock objects and running tests on them.  Once all
> those tests are complete, we exit early from i915_init().  Perviously,
> i915_init() would return a non-zero error code on failure and a zero
> error code on success.  In the success case, we would get to i915_exit()
> and check i915_pci_driver.driver.owner to detect if i915_init exited early
> and do nothing.  In the failure case, we would fail i915_init() but
> there would be no opportunity to clean up globals.
> 
> The most annoying part is that you don't actually notice the failure as
> part of the self-tests since leaking a bit of memory, while bad, doesn't
> result in anything observable from userspace.  Instead, the next time we
> load the driver (usually for next IGT test), i915_globals_init() gets
> invoked again, we go to allocate a bunch of new memory slabs, those
> implicitly create debugfs entries, and debugfs warns that we're trying
> to create directories and files that already exist.  Since this all
> happens as part of the next driver load, it shows up in the dmesg-warn
> of whatever IGT test ran after the mock selftests.
> 
> While the obvious thing to do here might be to call i915_globals_exit()
> after selftests, that's not actually safe.  The dma-buf selftests call
> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> the resulting dmabuf which calls fput() on the file.  However, fput()
> isn't immediate and gets flushed right before syscall returns.  This
> means that all the fput()s from the selftests don't happen until right
> before the module load syscall used to fire off the selftests returns
> which is after i915_init().  If we call i915_globals_exit() in
> i915_init() after selftests, we end up freeing slabs out from under
> objects which won't get released until fput() is flushed at the end of
> the module load.
> 
> The solution here is to let i915_init() return success early and detect
> the early success in i915_exit() and only tear down globals and nothing
> else.  This way the module loads successfully, regardless of the success
> or failure of the tests.  Because we've not enumerated any PCI devices,
> no device nodes are created and it's entirely useless from userspace.
> The only thing the module does at that point is hold on to a bit of
> memory until we unload it and i915_exit() is called.  Importantly, this
> means that everything from our selftests has the ability to properly
> flush out between i915_init() and i915_exit() because there are a couple
> syscall boundaries in between.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> Cc: Daniel Vetter <daniel@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> index 4e627b57d31a2..24e4e54516936 100644
> --- a/drivers/gpu/drm/i915/i915_pci.c
> +++ b/drivers/gpu/drm/i915/i915_pci.c
> @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
>  	.driver.pm = &i915_pm_ops,
>  };
>  
> +static bool i915_fully_loaded = false;
> +
>  static int __init i915_init(void)
>  {
>  	bool use_kms = true;
>  	int err;
>  
> +	i915_fully_loaded = false;
> +
>  	err = i915_globals_init();
>  	if (err)
>  		return err;
>  
> +	/* i915_mock_selftests() only returns zero if no mock subtests were
> +	 * run.  If we get any non-zero error code, we return early here.
> +	 * We always return success because selftests may have allocated
> +	 * objects from slabs which will get cleaned up by i915_exit().  We
> +	 * could attempt to clean up immediately and fail module load but,
> +	 * thanks to interactions with other parts of the kernel (struct
> +	 * file, in particular), it's safer to let the module fully load
> +	 * and then clean up on unload.
> +	 */
>  	err = i915_mock_selftests();
>  	if (err)
> -		return err > 0 ? 0 : err;
> +		return 0;

At least the module options still claim that you can run selftests and
still load the driver. Which makes sense for perf/hw selftests, since
those need the driver, but would result in the same old bug resurfacing
that you're trying to fix there.

Is that description just confused and needs some fixing, or do we have a
gap here?

Patch itself looks reasonable, with the nits from Tvrtko addressed:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

>  
>  	/*
>  	 * Enable KMS by default, unless explicitly overriden by
> @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
>  		return 0;
>  	}
>  
> +	/* After this point, i915_init() must either fully succeed or
> +	 * properly tear everything down and fail.  We don't have separate
> +	 * flags for each set-up bit.
> +	 */
> +	i915_fully_loaded = true;
> +
>  	i915_pmu_init();
>  
>  	err = pci_register_driver(&i915_pci_driver);
> @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
>  
>  static void __exit i915_exit(void)
>  {
> -	if (!i915_pci_driver.driver.owner)
> -		return;
> -
> -	i915_perf_sysctl_unregister();
> -	pci_unregister_driver(&i915_pci_driver);
> -	i915_pmu_exit();
> +	if (i915_fully_loaded) {
> +		i915_perf_sysctl_unregister();
> +		pci_unregister_driver(&i915_pci_driver);
> +		i915_pmu_exit();
> +	}
>  	i915_globals_exit();
>  }
>  
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/6] drm/ttm: Initialize debugfs from ttm_global_init()
  2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-20 14:22     ` Daniel Vetter
  -1 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-20 14:22 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Mon, Jul 19, 2021 at 01:30:46PM -0500, Jason Ekstrand wrote:
> We create a bunch of debugfs entries as a side-effect of
> ttm_global_init() and then never clean them up.  This isn't usually a
> problem because we free the whole debugfs directory on module unload.
> However, if the global reference count ever goes to zero and then
> ttm_global_init() is called again, we'll re-create those debugfs entries
> and debugfs will complain in dmesg that we're creating entries that
> already exist.  This patch fixes this problem by changing the lifetime
> of the whole TTM debugfs directory to match that of the TTM global
> state.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> ---
>  drivers/gpu/drm/ttm/ttm_device.c | 12 ++++++++++++
>  drivers/gpu/drm/ttm/ttm_module.c |  4 ----
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 519deea8e39b7..74e3b460132b3 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -44,6 +44,8 @@ static unsigned ttm_glob_use_count;
>  struct ttm_global ttm_glob;
>  EXPORT_SYMBOL(ttm_glob);
>  
> +struct dentry *ttm_debugfs_root;
> +
>  static void ttm_global_release(void)
>  {
>  	struct ttm_global *glob = &ttm_glob;
> @@ -53,6 +55,7 @@ static void ttm_global_release(void)
>  		goto out;
>  
>  	ttm_pool_mgr_fini();
> +	debugfs_remove(ttm_debugfs_root);
>  
>  	__free_page(glob->dummy_read_page);
>  	memset(glob, 0, sizeof(*glob));
> @@ -73,6 +76,13 @@ static int ttm_global_init(void)
>  
>  	si_meminfo(&si);
>  
> +	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
> +	if (IS_ERR(ttm_debugfs_root)) {
> +		ret = PTR_ERR(ttm_debugfs_root);
> +		ttm_debugfs_root = NULL;
> +		goto out;
> +	}
> +
>  	/* Limit the number of pages in the pool to about 50% of the total
>  	 * system memory.
>  	 */
> @@ -100,6 +110,8 @@ static int ttm_global_init(void)
>  	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
>  				&glob->bo_count);
>  out:
> +	if (ret && ttm_debugfs_root)
> +		debugfs_remove(ttm_debugfs_root);
>  	if (ret)
>  		--ttm_glob_use_count;
>  	mutex_unlock(&ttm_global_mutex);
> diff --git a/drivers/gpu/drm/ttm/ttm_module.c b/drivers/gpu/drm/ttm/ttm_module.c
> index 997c458f68a9a..88554f2db11fe 100644
> --- a/drivers/gpu/drm/ttm/ttm_module.c
> +++ b/drivers/gpu/drm/ttm/ttm_module.c
> @@ -72,17 +72,13 @@ pgprot_t ttm_prot_from_caching(enum ttm_caching caching, pgprot_t tmp)
>  	return tmp;
>  }
>  
> -struct dentry *ttm_debugfs_root;
> -
>  static int __init ttm_init(void)
>  {
> -	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
>  	return 0;
>  }
>  
>  static void __exit ttm_exit(void)
>  {
> -	debugfs_remove(ttm_debugfs_root);
>  }

I think you can delete these functions and the lines below too, they
should be optional. With that:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

-Daniel

>  
>  module_init(ttm_init);
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 5/6] drm/ttm: Initialize debugfs from ttm_global_init()
@ 2021-07-20 14:22     ` Daniel Vetter
  0 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-20 14:22 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: intel-gfx, dri-devel

On Mon, Jul 19, 2021 at 01:30:46PM -0500, Jason Ekstrand wrote:
> We create a bunch of debugfs entries as a side-effect of
> ttm_global_init() and then never clean them up.  This isn't usually a
> problem because we free the whole debugfs directory on module unload.
> However, if the global reference count ever goes to zero and then
> ttm_global_init() is called again, we'll re-create those debugfs entries
> and debugfs will complain in dmesg that we're creating entries that
> already exist.  This patch fixes this problem by changing the lifetime
> of the whole TTM debugfs directory to match that of the TTM global
> state.
> 
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> ---
>  drivers/gpu/drm/ttm/ttm_device.c | 12 ++++++++++++
>  drivers/gpu/drm/ttm/ttm_module.c |  4 ----
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 519deea8e39b7..74e3b460132b3 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -44,6 +44,8 @@ static unsigned ttm_glob_use_count;
>  struct ttm_global ttm_glob;
>  EXPORT_SYMBOL(ttm_glob);
>  
> +struct dentry *ttm_debugfs_root;
> +
>  static void ttm_global_release(void)
>  {
>  	struct ttm_global *glob = &ttm_glob;
> @@ -53,6 +55,7 @@ static void ttm_global_release(void)
>  		goto out;
>  
>  	ttm_pool_mgr_fini();
> +	debugfs_remove(ttm_debugfs_root);
>  
>  	__free_page(glob->dummy_read_page);
>  	memset(glob, 0, sizeof(*glob));
> @@ -73,6 +76,13 @@ static int ttm_global_init(void)
>  
>  	si_meminfo(&si);
>  
> +	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
> +	if (IS_ERR(ttm_debugfs_root)) {
> +		ret = PTR_ERR(ttm_debugfs_root);
> +		ttm_debugfs_root = NULL;
> +		goto out;
> +	}
> +
>  	/* Limit the number of pages in the pool to about 50% of the total
>  	 * system memory.
>  	 */
> @@ -100,6 +110,8 @@ static int ttm_global_init(void)
>  	debugfs_create_atomic_t("buffer_objects", 0444, ttm_debugfs_root,
>  				&glob->bo_count);
>  out:
> +	if (ret && ttm_debugfs_root)
> +		debugfs_remove(ttm_debugfs_root);
>  	if (ret)
>  		--ttm_glob_use_count;
>  	mutex_unlock(&ttm_global_mutex);
> diff --git a/drivers/gpu/drm/ttm/ttm_module.c b/drivers/gpu/drm/ttm/ttm_module.c
> index 997c458f68a9a..88554f2db11fe 100644
> --- a/drivers/gpu/drm/ttm/ttm_module.c
> +++ b/drivers/gpu/drm/ttm/ttm_module.c
> @@ -72,17 +72,13 @@ pgprot_t ttm_prot_from_caching(enum ttm_caching caching, pgprot_t tmp)
>  	return tmp;
>  }
>  
> -struct dentry *ttm_debugfs_root;
> -
>  static int __init ttm_init(void)
>  {
> -	ttm_debugfs_root = debugfs_create_dir("ttm", NULL);
>  	return 0;
>  }
>  
>  static void __exit ttm_exit(void)
>  {
> -	debugfs_remove(ttm_debugfs_root);
>  }

I think you can delete these functions and the lines below too, they
should be optional. With that:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

-Daniel

>  
>  module_init(ttm_init);
> -- 
> 2.31.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20  8:25     ` Tvrtko Ursulin
@ 2021-07-20 14:53       ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-20 14:53 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel GFX, Maling list - DRI developers

On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 19/07/2021 19:30, Jason Ekstrand wrote:
> > If the driver was not fully loaded, we may still have globals lying
> > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > run mock selftests.  In either case, we have an early exit from
> > i915_init which happens after i915_globals_init() and we need to clean
> > up those globals.  While we're here, add an explicit boolean instead of
> > using a random field from i915_pci_device to detect partial loads.
> >
> > The mock selftests case gets especially sticky.  The load isn't entirely
> > a no-op.  We actually do quite a bit inside those selftests including
> > allocating a bunch of mock objects and running tests on them.  Once all
> > those tests are complete, we exit early from i915_init().  Perviously,
> > i915_init() would return a non-zero error code on failure and a zero
> > error code on success.  In the success case, we would get to i915_exit()
> > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > and do nothing.  In the failure case, we would fail i915_init() but
> > there would be no opportunity to clean up globals.
> >
> > The most annoying part is that you don't actually notice the failure as
> > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > result in anything observable from userspace.  Instead, the next time we
> > load the driver (usually for next IGT test), i915_globals_init() gets
> > invoked again, we go to allocate a bunch of new memory slabs, those
> > implicitly create debugfs entries, and debugfs warns that we're trying
> > to create directories and files that already exist.  Since this all
> > happens as part of the next driver load, it shows up in the dmesg-warn
> > of whatever IGT test ran after the mock selftests.
>
> Story checks out but I totally don't get why it wouldn't be noticed
> until now. Was it perhaps part of the selfetsts contract that a reboot
> is required after failure?

No.  They do unload the driver, though.  They just don't re-load it.

> > While the obvious thing to do here might be to call i915_globals_exit()
> > after selftests, that's not actually safe.  The dma-buf selftests call
> > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > the resulting dmabuf which calls fput() on the file.  However, fput()
> > isn't immediate and gets flushed right before syscall returns.  This
> > means that all the fput()s from the selftests don't happen until right
> > before the module load syscall used to fire off the selftests returns
> > which is after i915_init().  If we call i915_globals_exit() in
> > i915_init() after selftests, we end up freeing slabs out from under
> > objects which won't get released until fput() is flushed at the end of
> > the module load.
>
> Nasty. Wasn't visible while globals memory leak was "in place". :I
>
> > The solution here is to let i915_init() return success early and detect
> > the early success in i915_exit() and only tear down globals and nothing
> > else.  This way the module loads successfully, regardless of the success
> > or failure of the tests.  Because we've not enumerated any PCI devices,
> > no device nodes are created and it's entirely useless from userspace.
> > The only thing the module does at that point is hold on to a bit of
> > memory until we unload it and i915_exit() is called.  Importantly, this
> > means that everything from our selftests has the ability to properly
> > flush out between i915_init() and i915_exit() because there are a couple
> > syscall boundaries in between.
>
> When you say "couple of syscall boundaries" you mean exactly two (module
> init/unload) or there is more to it? Like why "couple" is needed and not
> just that the module load syscall has exited? That part sounds
> potentially dodgy. What mechanism is used by the delayed flush?
>
> Have you checked how this change interacts with the test runner and CI?

By the end of the series, a bunch of tests are fixed.  In particular,
https://gitlab.freedesktop.org/drm/intel/-/issues/3746

> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >   drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> >   1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > index 4e627b57d31a2..24e4e54516936 100644
> > --- a/drivers/gpu/drm/i915/i915_pci.c
> > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> >       .driver.pm = &i915_pm_ops,
> >   };
> >
> > +static bool i915_fully_loaded = false;
>
> No need to initialize.

Already dropped.

> > +
> >   static int __init i915_init(void)
> >   {
> >       bool use_kms = true;
> >       int err;
> >
> > +     i915_fully_loaded = false;
>
> Ditto.

So, this is something I'm unclear on.  I know that static memory gets
auto-initialized to zero but what happens if you unload and reload a
module?  Is it re-initialized to zero?  If it is then we can drop
this.

> > +
> >       err = i915_globals_init();
> >       if (err)
> >               return err;
> >
> > +     /* i915_mock_selftests() only returns zero if no mock subtests were
>
>
> /*
>   * Please use this multi line comment style in i915.
>   */

Done.

> > +      * run.  If we get any non-zero error code, we return early here.
> > +      * We always return success because selftests may have allocated
> > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > +      * could attempt to clean up immediately and fail module load but,
> > +      * thanks to interactions with other parts of the kernel (struct
> > +      * file, in particular), it's safer to let the module fully load
> > +      * and then clean up on unload.
> > +      */
> >       err = i915_mock_selftests();
> >       if (err)
> > -             return err > 0 ? 0 : err;
> > +             return 0;
> >
> >       /*
> >        * Enable KMS by default, unless explicitly overriden by
> > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> >               return 0;
> >       }
> >
> > +     /* After this point, i915_init() must either fully succeed or
> > +      * properly tear everything down and fail.  We don't have separate
> > +      * flags for each set-up bit.
> > +      */
> > +     i915_fully_loaded = true;
> > +
> >       i915_pmu_init();
> >
> >       err = pci_register_driver(&i915_pci_driver);
> > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> >
> >   static void __exit i915_exit(void)
> >   {
> > -     if (!i915_pci_driver.driver.owner)
> > -             return;
> > -
> > -     i915_perf_sysctl_unregister();
> > -     pci_unregister_driver(&i915_pci_driver);
> > -     i915_pmu_exit();
> > +     if (i915_fully_loaded) {
> > +             i915_perf_sysctl_unregister();
> > +             pci_unregister_driver(&i915_pci_driver);
> > +             i915_pmu_exit();
> > +     }
> >       i915_globals_exit();
> >   }
> >
> >
>
> Regards,
>
> Tvrtko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 14:53       ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-20 14:53 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel GFX, Maling list - DRI developers

On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 19/07/2021 19:30, Jason Ekstrand wrote:
> > If the driver was not fully loaded, we may still have globals lying
> > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > run mock selftests.  In either case, we have an early exit from
> > i915_init which happens after i915_globals_init() and we need to clean
> > up those globals.  While we're here, add an explicit boolean instead of
> > using a random field from i915_pci_device to detect partial loads.
> >
> > The mock selftests case gets especially sticky.  The load isn't entirely
> > a no-op.  We actually do quite a bit inside those selftests including
> > allocating a bunch of mock objects and running tests on them.  Once all
> > those tests are complete, we exit early from i915_init().  Perviously,
> > i915_init() would return a non-zero error code on failure and a zero
> > error code on success.  In the success case, we would get to i915_exit()
> > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > and do nothing.  In the failure case, we would fail i915_init() but
> > there would be no opportunity to clean up globals.
> >
> > The most annoying part is that you don't actually notice the failure as
> > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > result in anything observable from userspace.  Instead, the next time we
> > load the driver (usually for next IGT test), i915_globals_init() gets
> > invoked again, we go to allocate a bunch of new memory slabs, those
> > implicitly create debugfs entries, and debugfs warns that we're trying
> > to create directories and files that already exist.  Since this all
> > happens as part of the next driver load, it shows up in the dmesg-warn
> > of whatever IGT test ran after the mock selftests.
>
> Story checks out but I totally don't get why it wouldn't be noticed
> until now. Was it perhaps part of the selfetsts contract that a reboot
> is required after failure?

No.  They do unload the driver, though.  They just don't re-load it.

> > While the obvious thing to do here might be to call i915_globals_exit()
> > after selftests, that's not actually safe.  The dma-buf selftests call
> > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > the resulting dmabuf which calls fput() on the file.  However, fput()
> > isn't immediate and gets flushed right before syscall returns.  This
> > means that all the fput()s from the selftests don't happen until right
> > before the module load syscall used to fire off the selftests returns
> > which is after i915_init().  If we call i915_globals_exit() in
> > i915_init() after selftests, we end up freeing slabs out from under
> > objects which won't get released until fput() is flushed at the end of
> > the module load.
>
> Nasty. Wasn't visible while globals memory leak was "in place". :I
>
> > The solution here is to let i915_init() return success early and detect
> > the early success in i915_exit() and only tear down globals and nothing
> > else.  This way the module loads successfully, regardless of the success
> > or failure of the tests.  Because we've not enumerated any PCI devices,
> > no device nodes are created and it's entirely useless from userspace.
> > The only thing the module does at that point is hold on to a bit of
> > memory until we unload it and i915_exit() is called.  Importantly, this
> > means that everything from our selftests has the ability to properly
> > flush out between i915_init() and i915_exit() because there are a couple
> > syscall boundaries in between.
>
> When you say "couple of syscall boundaries" you mean exactly two (module
> init/unload) or there is more to it? Like why "couple" is needed and not
> just that the module load syscall has exited? That part sounds
> potentially dodgy. What mechanism is used by the delayed flush?
>
> Have you checked how this change interacts with the test runner and CI?

By the end of the series, a bunch of tests are fixed.  In particular,
https://gitlab.freedesktop.org/drm/intel/-/issues/3746

> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >   drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> >   1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > index 4e627b57d31a2..24e4e54516936 100644
> > --- a/drivers/gpu/drm/i915/i915_pci.c
> > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> >       .driver.pm = &i915_pm_ops,
> >   };
> >
> > +static bool i915_fully_loaded = false;
>
> No need to initialize.

Already dropped.

> > +
> >   static int __init i915_init(void)
> >   {
> >       bool use_kms = true;
> >       int err;
> >
> > +     i915_fully_loaded = false;
>
> Ditto.

So, this is something I'm unclear on.  I know that static memory gets
auto-initialized to zero but what happens if you unload and reload a
module?  Is it re-initialized to zero?  If it is then we can drop
this.

> > +
> >       err = i915_globals_init();
> >       if (err)
> >               return err;
> >
> > +     /* i915_mock_selftests() only returns zero if no mock subtests were
>
>
> /*
>   * Please use this multi line comment style in i915.
>   */

Done.

> > +      * run.  If we get any non-zero error code, we return early here.
> > +      * We always return success because selftests may have allocated
> > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > +      * could attempt to clean up immediately and fail module load but,
> > +      * thanks to interactions with other parts of the kernel (struct
> > +      * file, in particular), it's safer to let the module fully load
> > +      * and then clean up on unload.
> > +      */
> >       err = i915_mock_selftests();
> >       if (err)
> > -             return err > 0 ? 0 : err;
> > +             return 0;
> >
> >       /*
> >        * Enable KMS by default, unless explicitly overriden by
> > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> >               return 0;
> >       }
> >
> > +     /* After this point, i915_init() must either fully succeed or
> > +      * properly tear everything down and fail.  We don't have separate
> > +      * flags for each set-up bit.
> > +      */
> > +     i915_fully_loaded = true;
> > +
> >       i915_pmu_init();
> >
> >       err = pci_register_driver(&i915_pci_driver);
> > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> >
> >   static void __exit i915_exit(void)
> >   {
> > -     if (!i915_pci_driver.driver.owner)
> > -             return;
> > -
> > -     i915_perf_sysctl_unregister();
> > -     pci_unregister_driver(&i915_pci_driver);
> > -     i915_pmu_exit();
> > +     if (i915_fully_loaded) {
> > +             i915_perf_sysctl_unregister();
> > +             pci_unregister_driver(&i915_pci_driver);
> > +             i915_pmu_exit();
> > +     }
> >       i915_globals_exit();
> >   }
> >
> >
>
> Regards,
>
> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20 14:18     ` [Intel-gfx] " Daniel Vetter
@ 2021-07-20 14:55       ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-20 14:55 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel GFX, Maling list - DRI developers

On Tue, Jul 20, 2021 at 9:18 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> > If the driver was not fully loaded, we may still have globals lying
> > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > run mock selftests.  In either case, we have an early exit from
> > i915_init which happens after i915_globals_init() and we need to clean
> > up those globals.  While we're here, add an explicit boolean instead of
> > using a random field from i915_pci_device to detect partial loads.
> >
> > The mock selftests case gets especially sticky.  The load isn't entirely
> > a no-op.  We actually do quite a bit inside those selftests including
> > allocating a bunch of mock objects and running tests on them.  Once all
> > those tests are complete, we exit early from i915_init().  Perviously,
> > i915_init() would return a non-zero error code on failure and a zero
> > error code on success.  In the success case, we would get to i915_exit()
> > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > and do nothing.  In the failure case, we would fail i915_init() but
> > there would be no opportunity to clean up globals.
> >
> > The most annoying part is that you don't actually notice the failure as
> > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > result in anything observable from userspace.  Instead, the next time we
> > load the driver (usually for next IGT test), i915_globals_init() gets
> > invoked again, we go to allocate a bunch of new memory slabs, those
> > implicitly create debugfs entries, and debugfs warns that we're trying
> > to create directories and files that already exist.  Since this all
> > happens as part of the next driver load, it shows up in the dmesg-warn
> > of whatever IGT test ran after the mock selftests.
> >
> > While the obvious thing to do here might be to call i915_globals_exit()
> > after selftests, that's not actually safe.  The dma-buf selftests call
> > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > the resulting dmabuf which calls fput() on the file.  However, fput()
> > isn't immediate and gets flushed right before syscall returns.  This
> > means that all the fput()s from the selftests don't happen until right
> > before the module load syscall used to fire off the selftests returns
> > which is after i915_init().  If we call i915_globals_exit() in
> > i915_init() after selftests, we end up freeing slabs out from under
> > objects which won't get released until fput() is flushed at the end of
> > the module load.
> >
> > The solution here is to let i915_init() return success early and detect
> > the early success in i915_exit() and only tear down globals and nothing
> > else.  This way the module loads successfully, regardless of the success
> > or failure of the tests.  Because we've not enumerated any PCI devices,
> > no device nodes are created and it's entirely useless from userspace.
> > The only thing the module does at that point is hold on to a bit of
> > memory until we unload it and i915_exit() is called.  Importantly, this
> > means that everything from our selftests has the ability to properly
> > flush out between i915_init() and i915_exit() because there are a couple
> > syscall boundaries in between.
> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> >  1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > index 4e627b57d31a2..24e4e54516936 100644
> > --- a/drivers/gpu/drm/i915/i915_pci.c
> > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> >       .driver.pm = &i915_pm_ops,
> >  };
> >
> > +static bool i915_fully_loaded = false;
> > +
> >  static int __init i915_init(void)
> >  {
> >       bool use_kms = true;
> >       int err;
> >
> > +     i915_fully_loaded = false;
> > +
> >       err = i915_globals_init();
> >       if (err)
> >               return err;
> >
> > +     /* i915_mock_selftests() only returns zero if no mock subtests were
> > +      * run.  If we get any non-zero error code, we return early here.
> > +      * We always return success because selftests may have allocated
> > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > +      * could attempt to clean up immediately and fail module load but,
> > +      * thanks to interactions with other parts of the kernel (struct
> > +      * file, in particular), it's safer to let the module fully load
> > +      * and then clean up on unload.
> > +      */
> >       err = i915_mock_selftests();
> >       if (err)
> > -             return err > 0 ? 0 : err;
> > +             return 0;
>
> At least the module options still claim that you can run selftests and
> still load the driver. Which makes sense for perf/hw selftests, since
> those need the driver, but would result in the same old bug resurfacing
> that you're trying to fix there.
>
> Is that description just confused and needs some fixing, or do we have a
> gap here?

I don't think there's real need for a fully loaded driver after mock
selftests.  They exist entirely to run against a mock driver, not the
real one.

> Patch itself looks reasonable, with the nits from Tvrtko addressed:

Done

> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Thanks

--Jason

> >
> >       /*
> >        * Enable KMS by default, unless explicitly overriden by
> > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> >               return 0;
> >       }
> >
> > +     /* After this point, i915_init() must either fully succeed or
> > +      * properly tear everything down and fail.  We don't have separate
> > +      * flags for each set-up bit.
> > +      */
> > +     i915_fully_loaded = true;
> > +
> >       i915_pmu_init();
> >
> >       err = pci_register_driver(&i915_pci_driver);
> > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> >
> >  static void __exit i915_exit(void)
> >  {
> > -     if (!i915_pci_driver.driver.owner)
> > -             return;
> > -
> > -     i915_perf_sysctl_unregister();
> > -     pci_unregister_driver(&i915_pci_driver);
> > -     i915_pmu_exit();
> > +     if (i915_fully_loaded) {
> > +             i915_perf_sysctl_unregister();
> > +             pci_unregister_driver(&i915_pci_driver);
> > +             i915_pmu_exit();
> > +     }
> >       i915_globals_exit();
> >  }
> >
> > --
> > 2.31.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 14:55       ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-20 14:55 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel GFX, Maling list - DRI developers

On Tue, Jul 20, 2021 at 9:18 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> > If the driver was not fully loaded, we may still have globals lying
> > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > run mock selftests.  In either case, we have an early exit from
> > i915_init which happens after i915_globals_init() and we need to clean
> > up those globals.  While we're here, add an explicit boolean instead of
> > using a random field from i915_pci_device to detect partial loads.
> >
> > The mock selftests case gets especially sticky.  The load isn't entirely
> > a no-op.  We actually do quite a bit inside those selftests including
> > allocating a bunch of mock objects and running tests on them.  Once all
> > those tests are complete, we exit early from i915_init().  Perviously,
> > i915_init() would return a non-zero error code on failure and a zero
> > error code on success.  In the success case, we would get to i915_exit()
> > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > and do nothing.  In the failure case, we would fail i915_init() but
> > there would be no opportunity to clean up globals.
> >
> > The most annoying part is that you don't actually notice the failure as
> > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > result in anything observable from userspace.  Instead, the next time we
> > load the driver (usually for next IGT test), i915_globals_init() gets
> > invoked again, we go to allocate a bunch of new memory slabs, those
> > implicitly create debugfs entries, and debugfs warns that we're trying
> > to create directories and files that already exist.  Since this all
> > happens as part of the next driver load, it shows up in the dmesg-warn
> > of whatever IGT test ran after the mock selftests.
> >
> > While the obvious thing to do here might be to call i915_globals_exit()
> > after selftests, that's not actually safe.  The dma-buf selftests call
> > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > the resulting dmabuf which calls fput() on the file.  However, fput()
> > isn't immediate and gets flushed right before syscall returns.  This
> > means that all the fput()s from the selftests don't happen until right
> > before the module load syscall used to fire off the selftests returns
> > which is after i915_init().  If we call i915_globals_exit() in
> > i915_init() after selftests, we end up freeing slabs out from under
> > objects which won't get released until fput() is flushed at the end of
> > the module load.
> >
> > The solution here is to let i915_init() return success early and detect
> > the early success in i915_exit() and only tear down globals and nothing
> > else.  This way the module loads successfully, regardless of the success
> > or failure of the tests.  Because we've not enumerated any PCI devices,
> > no device nodes are created and it's entirely useless from userspace.
> > The only thing the module does at that point is hold on to a bit of
> > memory until we unload it and i915_exit() is called.  Importantly, this
> > means that everything from our selftests has the ability to properly
> > flush out between i915_init() and i915_exit() because there are a couple
> > syscall boundaries in between.
> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> >  1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > index 4e627b57d31a2..24e4e54516936 100644
> > --- a/drivers/gpu/drm/i915/i915_pci.c
> > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> >       .driver.pm = &i915_pm_ops,
> >  };
> >
> > +static bool i915_fully_loaded = false;
> > +
> >  static int __init i915_init(void)
> >  {
> >       bool use_kms = true;
> >       int err;
> >
> > +     i915_fully_loaded = false;
> > +
> >       err = i915_globals_init();
> >       if (err)
> >               return err;
> >
> > +     /* i915_mock_selftests() only returns zero if no mock subtests were
> > +      * run.  If we get any non-zero error code, we return early here.
> > +      * We always return success because selftests may have allocated
> > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > +      * could attempt to clean up immediately and fail module load but,
> > +      * thanks to interactions with other parts of the kernel (struct
> > +      * file, in particular), it's safer to let the module fully load
> > +      * and then clean up on unload.
> > +      */
> >       err = i915_mock_selftests();
> >       if (err)
> > -             return err > 0 ? 0 : err;
> > +             return 0;
>
> At least the module options still claim that you can run selftests and
> still load the driver. Which makes sense for perf/hw selftests, since
> those need the driver, but would result in the same old bug resurfacing
> that you're trying to fix there.
>
> Is that description just confused and needs some fixing, or do we have a
> gap here?

I don't think there's real need for a fully loaded driver after mock
selftests.  They exist entirely to run against a mock driver, not the
real one.

> Patch itself looks reasonable, with the nits from Tvrtko addressed:

Done

> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Thanks

--Jason

> >
> >       /*
> >        * Enable KMS by default, unless explicitly overriden by
> > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> >               return 0;
> >       }
> >
> > +     /* After this point, i915_init() must either fully succeed or
> > +      * properly tear everything down and fail.  We don't have separate
> > +      * flags for each set-up bit.
> > +      */
> > +     i915_fully_loaded = true;
> > +
> >       i915_pmu_init();
> >
> >       err = pci_register_driver(&i915_pci_driver);
> > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> >
> >  static void __exit i915_exit(void)
> >  {
> > -     if (!i915_pci_driver.driver.owner)
> > -             return;
> > -
> > -     i915_perf_sysctl_unregister();
> > -     pci_unregister_driver(&i915_pci_driver);
> > -     i915_pmu_exit();
> > +     if (i915_fully_loaded) {
> > +             i915_perf_sysctl_unregister();
> > +             pci_unregister_driver(&i915_pci_driver);
> > +             i915_pmu_exit();
> > +     }
> >       i915_globals_exit();
> >  }
> >
> > --
> > 2.31.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20  8:25     ` Tvrtko Ursulin
@ 2021-07-20 15:05       ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-20 15:05 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel GFX, Maling list - DRI developers

Sorry... didn't reply to everything the first time

On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 19/07/2021 19:30, Jason Ekstrand wrote:
> > If the driver was not fully loaded, we may still have globals lying
> > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > run mock selftests.  In either case, we have an early exit from
> > i915_init which happens after i915_globals_init() and we need to clean
> > up those globals.  While we're here, add an explicit boolean instead of
> > using a random field from i915_pci_device to detect partial loads.
> >
> > The mock selftests case gets especially sticky.  The load isn't entirely
> > a no-op.  We actually do quite a bit inside those selftests including
> > allocating a bunch of mock objects and running tests on them.  Once all
> > those tests are complete, we exit early from i915_init().  Perviously,
> > i915_init() would return a non-zero error code on failure and a zero
> > error code on success.  In the success case, we would get to i915_exit()
> > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > and do nothing.  In the failure case, we would fail i915_init() but
> > there would be no opportunity to clean up globals.
> >
> > The most annoying part is that you don't actually notice the failure as
> > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > result in anything observable from userspace.  Instead, the next time we
> > load the driver (usually for next IGT test), i915_globals_init() gets
> > invoked again, we go to allocate a bunch of new memory slabs, those
> > implicitly create debugfs entries, and debugfs warns that we're trying
> > to create directories and files that already exist.  Since this all
> > happens as part of the next driver load, it shows up in the dmesg-warn
> > of whatever IGT test ran after the mock selftests.
>
> Story checks out but I totally don't get why it wouldn't be noticed
> until now. Was it perhaps part of the selfetsts contract that a reboot
> is required after failure?

If there is such a contract, CI doesn't follow it.  We unload the
driver after selftests but that's it.

> > While the obvious thing to do here might be to call i915_globals_exit()
> > after selftests, that's not actually safe.  The dma-buf selftests call
> > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > the resulting dmabuf which calls fput() on the file.  However, fput()
> > isn't immediate and gets flushed right before syscall returns.  This
> > means that all the fput()s from the selftests don't happen until right
> > before the module load syscall used to fire off the selftests returns
> > which is after i915_init().  If we call i915_globals_exit() in
> > i915_init() after selftests, we end up freeing slabs out from under
> > objects which won't get released until fput() is flushed at the end of
> > the module load.
>
> Nasty. Wasn't visible while globals memory leak was "in place". :I
>
> > The solution here is to let i915_init() return success early and detect
> > the early success in i915_exit() and only tear down globals and nothing
> > else.  This way the module loads successfully, regardless of the success
> > or failure of the tests.  Because we've not enumerated any PCI devices,
> > no device nodes are created and it's entirely useless from userspace.
> > The only thing the module does at that point is hold on to a bit of
> > memory until we unload it and i915_exit() is called.  Importantly, this
> > means that everything from our selftests has the ability to properly
> > flush out between i915_init() and i915_exit() because there are a couple
> > syscall boundaries in between.
>
> When you say "couple of syscall boundaries" you mean exactly two (module
> init/unload) or there is more to it? Like why "couple" is needed and not
> just that the module load syscall has exited? That part sounds
> potentially dodgy. What mechanism is used by the delayed flush?

It only needs the one syscall.  I've changed the text to say "at least
one syscall boundary".  I think that's more clear without providing an
exact count which may not be tractable.

> Have you checked how this change interacts with the test runner and CI?

As far as I know, there's no interesting interaction here.  That said,
I did just find that the live selftests fail the modprobe on selftest
failure which means they're tearing down globals before a full syscall
boundary which may be sketchy.  Fortunately, now that we have
i915_globals_exit() on the tear-down path if PCI probe fails, if
someone ever does do something sketchy there, we'll catch it in dmesg
immediately.  Maybe we should switch those to always return 0 as well
while we're here?

> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >   drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> >   1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > index 4e627b57d31a2..24e4e54516936 100644
> > --- a/drivers/gpu/drm/i915/i915_pci.c
> > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> >       .driver.pm = &i915_pm_ops,
> >   };
> >
> > +static bool i915_fully_loaded = false;
>
> No need to initialize.
>
> > +
> >   static int __init i915_init(void)
> >   {
> >       bool use_kms = true;
> >       int err;
> >
> > +     i915_fully_loaded = false;
>
> Ditto.
>
> > +
> >       err = i915_globals_init();
> >       if (err)
> >               return err;
> >
> > +     /* i915_mock_selftests() only returns zero if no mock subtests were
>
>
> /*
>   * Please use this multi line comment style in i915.
>   */
>
>
> > +      * run.  If we get any non-zero error code, we return early here.
> > +      * We always return success because selftests may have allocated
> > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > +      * could attempt to clean up immediately and fail module load but,
> > +      * thanks to interactions with other parts of the kernel (struct
> > +      * file, in particular), it's safer to let the module fully load
> > +      * and then clean up on unload.
> > +      */
> >       err = i915_mock_selftests();
> >       if (err)
> > -             return err > 0 ? 0 : err;
> > +             return 0;
> >
> >       /*
> >        * Enable KMS by default, unless explicitly overriden by
> > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> >               return 0;
> >       }
> >
> > +     /* After this point, i915_init() must either fully succeed or
> > +      * properly tear everything down and fail.  We don't have separate
> > +      * flags for each set-up bit.
> > +      */
> > +     i915_fully_loaded = true;
> > +
> >       i915_pmu_init();
> >
> >       err = pci_register_driver(&i915_pci_driver);
> > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> >
> >   static void __exit i915_exit(void)
> >   {
> > -     if (!i915_pci_driver.driver.owner)
> > -             return;
> > -
> > -     i915_perf_sysctl_unregister();
> > -     pci_unregister_driver(&i915_pci_driver);
> > -     i915_pmu_exit();
> > +     if (i915_fully_loaded) {
> > +             i915_perf_sysctl_unregister();
> > +             pci_unregister_driver(&i915_pci_driver);
> > +             i915_pmu_exit();
> > +     }
> >       i915_globals_exit();
> >   }
> >
> >
>
> Regards,
>
> Tvrtko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 15:05       ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-20 15:05 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel GFX, Maling list - DRI developers

Sorry... didn't reply to everything the first time

On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 19/07/2021 19:30, Jason Ekstrand wrote:
> > If the driver was not fully loaded, we may still have globals lying
> > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > run mock selftests.  In either case, we have an early exit from
> > i915_init which happens after i915_globals_init() and we need to clean
> > up those globals.  While we're here, add an explicit boolean instead of
> > using a random field from i915_pci_device to detect partial loads.
> >
> > The mock selftests case gets especially sticky.  The load isn't entirely
> > a no-op.  We actually do quite a bit inside those selftests including
> > allocating a bunch of mock objects and running tests on them.  Once all
> > those tests are complete, we exit early from i915_init().  Perviously,
> > i915_init() would return a non-zero error code on failure and a zero
> > error code on success.  In the success case, we would get to i915_exit()
> > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > and do nothing.  In the failure case, we would fail i915_init() but
> > there would be no opportunity to clean up globals.
> >
> > The most annoying part is that you don't actually notice the failure as
> > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > result in anything observable from userspace.  Instead, the next time we
> > load the driver (usually for next IGT test), i915_globals_init() gets
> > invoked again, we go to allocate a bunch of new memory slabs, those
> > implicitly create debugfs entries, and debugfs warns that we're trying
> > to create directories and files that already exist.  Since this all
> > happens as part of the next driver load, it shows up in the dmesg-warn
> > of whatever IGT test ran after the mock selftests.
>
> Story checks out but I totally don't get why it wouldn't be noticed
> until now. Was it perhaps part of the selfetsts contract that a reboot
> is required after failure?

If there is such a contract, CI doesn't follow it.  We unload the
driver after selftests but that's it.

> > While the obvious thing to do here might be to call i915_globals_exit()
> > after selftests, that's not actually safe.  The dma-buf selftests call
> > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > the resulting dmabuf which calls fput() on the file.  However, fput()
> > isn't immediate and gets flushed right before syscall returns.  This
> > means that all the fput()s from the selftests don't happen until right
> > before the module load syscall used to fire off the selftests returns
> > which is after i915_init().  If we call i915_globals_exit() in
> > i915_init() after selftests, we end up freeing slabs out from under
> > objects which won't get released until fput() is flushed at the end of
> > the module load.
>
> Nasty. Wasn't visible while globals memory leak was "in place". :I
>
> > The solution here is to let i915_init() return success early and detect
> > the early success in i915_exit() and only tear down globals and nothing
> > else.  This way the module loads successfully, regardless of the success
> > or failure of the tests.  Because we've not enumerated any PCI devices,
> > no device nodes are created and it's entirely useless from userspace.
> > The only thing the module does at that point is hold on to a bit of
> > memory until we unload it and i915_exit() is called.  Importantly, this
> > means that everything from our selftests has the ability to properly
> > flush out between i915_init() and i915_exit() because there are a couple
> > syscall boundaries in between.
>
> When you say "couple of syscall boundaries" you mean exactly two (module
> init/unload) or there is more to it? Like why "couple" is needed and not
> just that the module load syscall has exited? That part sounds
> potentially dodgy. What mechanism is used by the delayed flush?

It only needs the one syscall.  I've changed the text to say "at least
one syscall boundary".  I think that's more clear without providing an
exact count which may not be tractable.

> Have you checked how this change interacts with the test runner and CI?

As far as I know, there's no interesting interaction here.  That said,
I did just find that the live selftests fail the modprobe on selftest
failure which means they're tearing down globals before a full syscall
boundary which may be sketchy.  Fortunately, now that we have
i915_globals_exit() on the tear-down path if PCI probe fails, if
someone ever does do something sketchy there, we'll catch it in dmesg
immediately.  Maybe we should switch those to always return 0 as well
while we're here?

> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >   drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> >   1 file changed, 25 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > index 4e627b57d31a2..24e4e54516936 100644
> > --- a/drivers/gpu/drm/i915/i915_pci.c
> > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> >       .driver.pm = &i915_pm_ops,
> >   };
> >
> > +static bool i915_fully_loaded = false;
>
> No need to initialize.
>
> > +
> >   static int __init i915_init(void)
> >   {
> >       bool use_kms = true;
> >       int err;
> >
> > +     i915_fully_loaded = false;
>
> Ditto.
>
> > +
> >       err = i915_globals_init();
> >       if (err)
> >               return err;
> >
> > +     /* i915_mock_selftests() only returns zero if no mock subtests were
>
>
> /*
>   * Please use this multi line comment style in i915.
>   */
>
>
> > +      * run.  If we get any non-zero error code, we return early here.
> > +      * We always return success because selftests may have allocated
> > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > +      * could attempt to clean up immediately and fail module load but,
> > +      * thanks to interactions with other parts of the kernel (struct
> > +      * file, in particular), it's safer to let the module fully load
> > +      * and then clean up on unload.
> > +      */
> >       err = i915_mock_selftests();
> >       if (err)
> > -             return err > 0 ? 0 : err;
> > +             return 0;
> >
> >       /*
> >        * Enable KMS by default, unless explicitly overriden by
> > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> >               return 0;
> >       }
> >
> > +     /* After this point, i915_init() must either fully succeed or
> > +      * properly tear everything down and fail.  We don't have separate
> > +      * flags for each set-up bit.
> > +      */
> > +     i915_fully_loaded = true;
> > +
> >       i915_pmu_init();
> >
> >       err = pci_register_driver(&i915_pci_driver);
> > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> >
> >   static void __exit i915_exit(void)
> >   {
> > -     if (!i915_pci_driver.driver.owner)
> > -             return;
> > -
> > -     i915_perf_sysctl_unregister();
> > -     pci_unregister_driver(&i915_pci_driver);
> > -     i915_pmu_exit();
> > +     if (i915_fully_loaded) {
> > +             i915_perf_sysctl_unregister();
> > +             pci_unregister_driver(&i915_pci_driver);
> > +             i915_pmu_exit();
> > +     }
> >       i915_globals_exit();
> >   }
> >
> >
>
> Regards,
>
> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20 14:53       ` Jason Ekstrand
@ 2021-07-20 15:22         ` Tvrtko Ursulin
  -1 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20 15:22 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers


On 20/07/2021 15:53, Jason Ekstrand wrote:
> On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 19/07/2021 19:30, Jason Ekstrand wrote:
>>> If the driver was not fully loaded, we may still have globals lying
>>> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
>>> of memory slabs.  This can happen two ways: use_kms = false and if we've
>>> run mock selftests.  In either case, we have an early exit from
>>> i915_init which happens after i915_globals_init() and we need to clean
>>> up those globals.  While we're here, add an explicit boolean instead of
>>> using a random field from i915_pci_device to detect partial loads.
>>>
>>> The mock selftests case gets especially sticky.  The load isn't entirely
>>> a no-op.  We actually do quite a bit inside those selftests including
>>> allocating a bunch of mock objects and running tests on them.  Once all
>>> those tests are complete, we exit early from i915_init().  Perviously,
>>> i915_init() would return a non-zero error code on failure and a zero
>>> error code on success.  In the success case, we would get to i915_exit()
>>> and check i915_pci_driver.driver.owner to detect if i915_init exited early
>>> and do nothing.  In the failure case, we would fail i915_init() but
>>> there would be no opportunity to clean up globals.
>>>
>>> The most annoying part is that you don't actually notice the failure as
>>> part of the self-tests since leaking a bit of memory, while bad, doesn't
>>> result in anything observable from userspace.  Instead, the next time we
>>> load the driver (usually for next IGT test), i915_globals_init() gets
>>> invoked again, we go to allocate a bunch of new memory slabs, those
>>> implicitly create debugfs entries, and debugfs warns that we're trying
>>> to create directories and files that already exist.  Since this all
>>> happens as part of the next driver load, it shows up in the dmesg-warn
>>> of whatever IGT test ran after the mock selftests.
>>
>> Story checks out but I totally don't get why it wouldn't be noticed
>> until now. Was it perhaps part of the selfetsts contract that a reboot
>> is required after failure?
> 
> No.  They do unload the driver, though.  They just don't re-load it.

I guess that does mean behaviour is reboot after first selftests 
failure, which would explain why it wasn't caught until now. I was 
running selftests and I know why I did not see it but that shall not be 
mentioned here. :)

>>> While the obvious thing to do here might be to call i915_globals_exit()
>>> after selftests, that's not actually safe.  The dma-buf selftests call
>>> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
>>> the resulting dmabuf which calls fput() on the file.  However, fput()
>>> isn't immediate and gets flushed right before syscall returns.  This
>>> means that all the fput()s from the selftests don't happen until right
>>> before the module load syscall used to fire off the selftests returns
>>> which is after i915_init().  If we call i915_globals_exit() in
>>> i915_init() after selftests, we end up freeing slabs out from under
>>> objects which won't get released until fput() is flushed at the end of
>>> the module load.
>>
>> Nasty. Wasn't visible while globals memory leak was "in place". :I
>>
>>> The solution here is to let i915_init() return success early and detect
>>> the early success in i915_exit() and only tear down globals and nothing
>>> else.  This way the module loads successfully, regardless of the success
>>> or failure of the tests.  Because we've not enumerated any PCI devices,
>>> no device nodes are created and it's entirely useless from userspace.
>>> The only thing the module does at that point is hold on to a bit of
>>> memory until we unload it and i915_exit() is called.  Importantly, this
>>> means that everything from our selftests has the ability to properly
>>> flush out between i915_init() and i915_exit() because there are a couple
>>> syscall boundaries in between.
>>
>> When you say "couple of syscall boundaries" you mean exactly two (module
>> init/unload) or there is more to it? Like why "couple" is needed and not
>> just that the module load syscall has exited? That part sounds
>> potentially dodgy. What mechanism is used by the delayed flush?
>>
>> Have you checked how this change interacts with the test runner and CI?
> 
> By the end of the series, a bunch of tests are fixed.  In particular,
> https://gitlab.freedesktop.org/drm/intel/-/issues/3746

Wait but that means CI does reload the driver. So again I totally don't 
understand why this is only popping up now.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 15:22         ` Tvrtko Ursulin
  0 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20 15:22 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers


On 20/07/2021 15:53, Jason Ekstrand wrote:
> On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 19/07/2021 19:30, Jason Ekstrand wrote:
>>> If the driver was not fully loaded, we may still have globals lying
>>> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
>>> of memory slabs.  This can happen two ways: use_kms = false and if we've
>>> run mock selftests.  In either case, we have an early exit from
>>> i915_init which happens after i915_globals_init() and we need to clean
>>> up those globals.  While we're here, add an explicit boolean instead of
>>> using a random field from i915_pci_device to detect partial loads.
>>>
>>> The mock selftests case gets especially sticky.  The load isn't entirely
>>> a no-op.  We actually do quite a bit inside those selftests including
>>> allocating a bunch of mock objects and running tests on them.  Once all
>>> those tests are complete, we exit early from i915_init().  Perviously,
>>> i915_init() would return a non-zero error code on failure and a zero
>>> error code on success.  In the success case, we would get to i915_exit()
>>> and check i915_pci_driver.driver.owner to detect if i915_init exited early
>>> and do nothing.  In the failure case, we would fail i915_init() but
>>> there would be no opportunity to clean up globals.
>>>
>>> The most annoying part is that you don't actually notice the failure as
>>> part of the self-tests since leaking a bit of memory, while bad, doesn't
>>> result in anything observable from userspace.  Instead, the next time we
>>> load the driver (usually for next IGT test), i915_globals_init() gets
>>> invoked again, we go to allocate a bunch of new memory slabs, those
>>> implicitly create debugfs entries, and debugfs warns that we're trying
>>> to create directories and files that already exist.  Since this all
>>> happens as part of the next driver load, it shows up in the dmesg-warn
>>> of whatever IGT test ran after the mock selftests.
>>
>> Story checks out but I totally don't get why it wouldn't be noticed
>> until now. Was it perhaps part of the selfetsts contract that a reboot
>> is required after failure?
> 
> No.  They do unload the driver, though.  They just don't re-load it.

I guess that does mean behaviour is reboot after first selftests 
failure, which would explain why it wasn't caught until now. I was 
running selftests and I know why I did not see it but that shall not be 
mentioned here. :)

>>> While the obvious thing to do here might be to call i915_globals_exit()
>>> after selftests, that's not actually safe.  The dma-buf selftests call
>>> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
>>> the resulting dmabuf which calls fput() on the file.  However, fput()
>>> isn't immediate and gets flushed right before syscall returns.  This
>>> means that all the fput()s from the selftests don't happen until right
>>> before the module load syscall used to fire off the selftests returns
>>> which is after i915_init().  If we call i915_globals_exit() in
>>> i915_init() after selftests, we end up freeing slabs out from under
>>> objects which won't get released until fput() is flushed at the end of
>>> the module load.
>>
>> Nasty. Wasn't visible while globals memory leak was "in place". :I
>>
>>> The solution here is to let i915_init() return success early and detect
>>> the early success in i915_exit() and only tear down globals and nothing
>>> else.  This way the module loads successfully, regardless of the success
>>> or failure of the tests.  Because we've not enumerated any PCI devices,
>>> no device nodes are created and it's entirely useless from userspace.
>>> The only thing the module does at that point is hold on to a bit of
>>> memory until we unload it and i915_exit() is called.  Importantly, this
>>> means that everything from our selftests has the ability to properly
>>> flush out between i915_init() and i915_exit() because there are a couple
>>> syscall boundaries in between.
>>
>> When you say "couple of syscall boundaries" you mean exactly two (module
>> init/unload) or there is more to it? Like why "couple" is needed and not
>> just that the module load syscall has exited? That part sounds
>> potentially dodgy. What mechanism is used by the delayed flush?
>>
>> Have you checked how this change interacts with the test runner and CI?
> 
> By the end of the series, a bunch of tests are fixed.  In particular,
> https://gitlab.freedesktop.org/drm/intel/-/issues/3746

Wait but that means CI does reload the driver. So again I totally don't 
understand why this is only popping up now.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20 15:05       ` Jason Ekstrand
@ 2021-07-20 15:25         ` Tvrtko Ursulin
  -1 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20 15:25 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers


On 20/07/2021 16:05, Jason Ekstrand wrote:
> Sorry... didn't reply to everything the first time
> 
> On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 19/07/2021 19:30, Jason Ekstrand wrote:
>>> If the driver was not fully loaded, we may still have globals lying
>>> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
>>> of memory slabs.  This can happen two ways: use_kms = false and if we've
>>> run mock selftests.  In either case, we have an early exit from
>>> i915_init which happens after i915_globals_init() and we need to clean
>>> up those globals.  While we're here, add an explicit boolean instead of
>>> using a random field from i915_pci_device to detect partial loads.
>>>
>>> The mock selftests case gets especially sticky.  The load isn't entirely
>>> a no-op.  We actually do quite a bit inside those selftests including
>>> allocating a bunch of mock objects and running tests on them.  Once all
>>> those tests are complete, we exit early from i915_init().  Perviously,
>>> i915_init() would return a non-zero error code on failure and a zero
>>> error code on success.  In the success case, we would get to i915_exit()
>>> and check i915_pci_driver.driver.owner to detect if i915_init exited early
>>> and do nothing.  In the failure case, we would fail i915_init() but
>>> there would be no opportunity to clean up globals.
>>>
>>> The most annoying part is that you don't actually notice the failure as
>>> part of the self-tests since leaking a bit of memory, while bad, doesn't
>>> result in anything observable from userspace.  Instead, the next time we
>>> load the driver (usually for next IGT test), i915_globals_init() gets
>>> invoked again, we go to allocate a bunch of new memory slabs, those
>>> implicitly create debugfs entries, and debugfs warns that we're trying
>>> to create directories and files that already exist.  Since this all
>>> happens as part of the next driver load, it shows up in the dmesg-warn
>>> of whatever IGT test ran after the mock selftests.
>>
>> Story checks out but I totally don't get why it wouldn't be noticed
>> until now. Was it perhaps part of the selfetsts contract that a reboot
>> is required after failure?
> 
> If there is such a contract, CI doesn't follow it.  We unload the
> driver after selftests but that's it.
> 
>>> While the obvious thing to do here might be to call i915_globals_exit()
>>> after selftests, that's not actually safe.  The dma-buf selftests call
>>> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
>>> the resulting dmabuf which calls fput() on the file.  However, fput()
>>> isn't immediate and gets flushed right before syscall returns.  This
>>> means that all the fput()s from the selftests don't happen until right
>>> before the module load syscall used to fire off the selftests returns
>>> which is after i915_init().  If we call i915_globals_exit() in
>>> i915_init() after selftests, we end up freeing slabs out from under
>>> objects which won't get released until fput() is flushed at the end of
>>> the module load.
>>
>> Nasty. Wasn't visible while globals memory leak was "in place". :I
>>
>>> The solution here is to let i915_init() return success early and detect
>>> the early success in i915_exit() and only tear down globals and nothing
>>> else.  This way the module loads successfully, regardless of the success
>>> or failure of the tests.  Because we've not enumerated any PCI devices,
>>> no device nodes are created and it's entirely useless from userspace.
>>> The only thing the module does at that point is hold on to a bit of
>>> memory until we unload it and i915_exit() is called.  Importantly, this
>>> means that everything from our selftests has the ability to properly
>>> flush out between i915_init() and i915_exit() because there are a couple
>>> syscall boundaries in between.
>>
>> When you say "couple of syscall boundaries" you mean exactly two (module
>> init/unload) or there is more to it? Like why "couple" is needed and not
>> just that the module load syscall has exited? That part sounds
>> potentially dodgy. What mechanism is used by the delayed flush?
> 
> It only needs the one syscall.  I've changed the text to say "at least
> one syscall boundary".  I think that's more clear without providing an
> exact count which may not be tractable.

One additional syscall _after_ the module load one exits, or just that 
one? What is the barrier used? I don't think "syscall boundary" is an 
established synchronisation term so lets understand fully what's 
happening here.

Regards,

Tvrtko

>> Have you checked how this change interacts with the test runner and CI?
> 
> As far as I know, there's no interesting interaction here.  That said,
> I did just find that the live selftests fail the modprobe on selftest
> failure which means they're tearing down globals before a full syscall
> boundary which may be sketchy.  Fortunately, now that we have
> i915_globals_exit() on the tear-down path if PCI probe fails, if
> someone ever does do something sketchy there, we'll catch it in dmesg
> immediately.  Maybe we should switch those to always return 0 as well
> while we're here?
> 
>>>
>>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>>> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>> ---
>>>    drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
>>>    1 file changed, 25 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
>>> index 4e627b57d31a2..24e4e54516936 100644
>>> --- a/drivers/gpu/drm/i915/i915_pci.c
>>> +++ b/drivers/gpu/drm/i915/i915_pci.c
>>> @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
>>>        .driver.pm = &i915_pm_ops,
>>>    };
>>>
>>> +static bool i915_fully_loaded = false;
>>
>> No need to initialize.
>>
>>> +
>>>    static int __init i915_init(void)
>>>    {
>>>        bool use_kms = true;
>>>        int err;
>>>
>>> +     i915_fully_loaded = false;
>>
>> Ditto.
>>
>>> +
>>>        err = i915_globals_init();
>>>        if (err)
>>>                return err;
>>>
>>> +     /* i915_mock_selftests() only returns zero if no mock subtests were
>>
>>
>> /*
>>    * Please use this multi line comment style in i915.
>>    */
>>
>>
>>> +      * run.  If we get any non-zero error code, we return early here.
>>> +      * We always return success because selftests may have allocated
>>> +      * objects from slabs which will get cleaned up by i915_exit().  We
>>> +      * could attempt to clean up immediately and fail module load but,
>>> +      * thanks to interactions with other parts of the kernel (struct
>>> +      * file, in particular), it's safer to let the module fully load
>>> +      * and then clean up on unload.
>>> +      */
>>>        err = i915_mock_selftests();
>>>        if (err)
>>> -             return err > 0 ? 0 : err;
>>> +             return 0;
>>>
>>>        /*
>>>         * Enable KMS by default, unless explicitly overriden by
>>> @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
>>>                return 0;
>>>        }
>>>
>>> +     /* After this point, i915_init() must either fully succeed or
>>> +      * properly tear everything down and fail.  We don't have separate
>>> +      * flags for each set-up bit.
>>> +      */
>>> +     i915_fully_loaded = true;
>>> +
>>>        i915_pmu_init();
>>>
>>>        err = pci_register_driver(&i915_pci_driver);
>>> @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
>>>
>>>    static void __exit i915_exit(void)
>>>    {
>>> -     if (!i915_pci_driver.driver.owner)
>>> -             return;
>>> -
>>> -     i915_perf_sysctl_unregister();
>>> -     pci_unregister_driver(&i915_pci_driver);
>>> -     i915_pmu_exit();
>>> +     if (i915_fully_loaded) {
>>> +             i915_perf_sysctl_unregister();
>>> +             pci_unregister_driver(&i915_pci_driver);
>>> +             i915_pmu_exit();
>>> +     }
>>>        i915_globals_exit();
>>>    }
>>>
>>>
>>
>> Regards,
>>
>> Tvrtko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 15:25         ` Tvrtko Ursulin
  0 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20 15:25 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers


On 20/07/2021 16:05, Jason Ekstrand wrote:
> Sorry... didn't reply to everything the first time
> 
> On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>>
>> On 19/07/2021 19:30, Jason Ekstrand wrote:
>>> If the driver was not fully loaded, we may still have globals lying
>>> around.  If we don't tear those down in i915_exit(), we'll leak a bunch
>>> of memory slabs.  This can happen two ways: use_kms = false and if we've
>>> run mock selftests.  In either case, we have an early exit from
>>> i915_init which happens after i915_globals_init() and we need to clean
>>> up those globals.  While we're here, add an explicit boolean instead of
>>> using a random field from i915_pci_device to detect partial loads.
>>>
>>> The mock selftests case gets especially sticky.  The load isn't entirely
>>> a no-op.  We actually do quite a bit inside those selftests including
>>> allocating a bunch of mock objects and running tests on them.  Once all
>>> those tests are complete, we exit early from i915_init().  Perviously,
>>> i915_init() would return a non-zero error code on failure and a zero
>>> error code on success.  In the success case, we would get to i915_exit()
>>> and check i915_pci_driver.driver.owner to detect if i915_init exited early
>>> and do nothing.  In the failure case, we would fail i915_init() but
>>> there would be no opportunity to clean up globals.
>>>
>>> The most annoying part is that you don't actually notice the failure as
>>> part of the self-tests since leaking a bit of memory, while bad, doesn't
>>> result in anything observable from userspace.  Instead, the next time we
>>> load the driver (usually for next IGT test), i915_globals_init() gets
>>> invoked again, we go to allocate a bunch of new memory slabs, those
>>> implicitly create debugfs entries, and debugfs warns that we're trying
>>> to create directories and files that already exist.  Since this all
>>> happens as part of the next driver load, it shows up in the dmesg-warn
>>> of whatever IGT test ran after the mock selftests.
>>
>> Story checks out but I totally don't get why it wouldn't be noticed
>> until now. Was it perhaps part of the selfetsts contract that a reboot
>> is required after failure?
> 
> If there is such a contract, CI doesn't follow it.  We unload the
> driver after selftests but that's it.
> 
>>> While the obvious thing to do here might be to call i915_globals_exit()
>>> after selftests, that's not actually safe.  The dma-buf selftests call
>>> i915_gem_prime_export which creates a file.  We call dma_buf_put() on
>>> the resulting dmabuf which calls fput() on the file.  However, fput()
>>> isn't immediate and gets flushed right before syscall returns.  This
>>> means that all the fput()s from the selftests don't happen until right
>>> before the module load syscall used to fire off the selftests returns
>>> which is after i915_init().  If we call i915_globals_exit() in
>>> i915_init() after selftests, we end up freeing slabs out from under
>>> objects which won't get released until fput() is flushed at the end of
>>> the module load.
>>
>> Nasty. Wasn't visible while globals memory leak was "in place". :I
>>
>>> The solution here is to let i915_init() return success early and detect
>>> the early success in i915_exit() and only tear down globals and nothing
>>> else.  This way the module loads successfully, regardless of the success
>>> or failure of the tests.  Because we've not enumerated any PCI devices,
>>> no device nodes are created and it's entirely useless from userspace.
>>> The only thing the module does at that point is hold on to a bit of
>>> memory until we unload it and i915_exit() is called.  Importantly, this
>>> means that everything from our selftests has the ability to properly
>>> flush out between i915_init() and i915_exit() because there are a couple
>>> syscall boundaries in between.
>>
>> When you say "couple of syscall boundaries" you mean exactly two (module
>> init/unload) or there is more to it? Like why "couple" is needed and not
>> just that the module load syscall has exited? That part sounds
>> potentially dodgy. What mechanism is used by the delayed flush?
> 
> It only needs the one syscall.  I've changed the text to say "at least
> one syscall boundary".  I think that's more clear without providing an
> exact count which may not be tractable.

One additional syscall _after_ the module load one exits, or just that 
one? What is the barrier used? I don't think "syscall boundary" is an 
established synchronisation term so lets understand fully what's 
happening here.

Regards,

Tvrtko

>> Have you checked how this change interacts with the test runner and CI?
> 
> As far as I know, there's no interesting interaction here.  That said,
> I did just find that the live selftests fail the modprobe on selftest
> failure which means they're tearing down globals before a full syscall
> boundary which may be sketchy.  Fortunately, now that we have
> i915_globals_exit() on the tear-down path if PCI probe fails, if
> someone ever does do something sketchy there, we'll catch it in dmesg
> immediately.  Maybe we should switch those to always return 0 as well
> while we're here?
> 
>>>
>>> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
>>> Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>> ---
>>>    drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
>>>    1 file changed, 25 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
>>> index 4e627b57d31a2..24e4e54516936 100644
>>> --- a/drivers/gpu/drm/i915/i915_pci.c
>>> +++ b/drivers/gpu/drm/i915/i915_pci.c
>>> @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
>>>        .driver.pm = &i915_pm_ops,
>>>    };
>>>
>>> +static bool i915_fully_loaded = false;
>>
>> No need to initialize.
>>
>>> +
>>>    static int __init i915_init(void)
>>>    {
>>>        bool use_kms = true;
>>>        int err;
>>>
>>> +     i915_fully_loaded = false;
>>
>> Ditto.
>>
>>> +
>>>        err = i915_globals_init();
>>>        if (err)
>>>                return err;
>>>
>>> +     /* i915_mock_selftests() only returns zero if no mock subtests were
>>
>>
>> /*
>>    * Please use this multi line comment style in i915.
>>    */
>>
>>
>>> +      * run.  If we get any non-zero error code, we return early here.
>>> +      * We always return success because selftests may have allocated
>>> +      * objects from slabs which will get cleaned up by i915_exit().  We
>>> +      * could attempt to clean up immediately and fail module load but,
>>> +      * thanks to interactions with other parts of the kernel (struct
>>> +      * file, in particular), it's safer to let the module fully load
>>> +      * and then clean up on unload.
>>> +      */
>>>        err = i915_mock_selftests();
>>>        if (err)
>>> -             return err > 0 ? 0 : err;
>>> +             return 0;
>>>
>>>        /*
>>>         * Enable KMS by default, unless explicitly overriden by
>>> @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
>>>                return 0;
>>>        }
>>>
>>> +     /* After this point, i915_init() must either fully succeed or
>>> +      * properly tear everything down and fail.  We don't have separate
>>> +      * flags for each set-up bit.
>>> +      */
>>> +     i915_fully_loaded = true;
>>> +
>>>        i915_pmu_init();
>>>
>>>        err = pci_register_driver(&i915_pci_driver);
>>> @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
>>>
>>>    static void __exit i915_exit(void)
>>>    {
>>> -     if (!i915_pci_driver.driver.owner)
>>> -             return;
>>> -
>>> -     i915_perf_sysctl_unregister();
>>> -     pci_unregister_driver(&i915_pci_driver);
>>> -     i915_pmu_exit();
>>> +     if (i915_fully_loaded) {
>>> +             i915_perf_sysctl_unregister();
>>> +             pci_unregister_driver(&i915_pci_driver);
>>> +             i915_pmu_exit();
>>> +     }
>>>        i915_globals_exit();
>>>    }
>>>
>>>
>>
>> Regards,
>>
>> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20 14:53       ` Jason Ekstrand
@ 2021-07-20 15:30         ` Tvrtko Ursulin
  -1 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20 15:30 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers


Now you confused me with two replies I forgot to reply to all... :))

On 20/07/2021 15:53, Jason Ekstrand wrote:
> On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:

[snip]

>>> +
>>>    static int __init i915_init(void)
>>>    {
>>>        bool use_kms = true;
>>>        int err;
>>>
>>> +     i915_fully_loaded = false;
>>
>> Ditto.
> 
> So, this is something I'm unclear on.  I know that static memory gets
> auto-initialized to zero but what happens if you unload and reload a
> module?  Is it re-initialized to zero?  If it is then we can drop
> this.

Well it's not in memory after it is unloaded so clearly it has to get 
initialised by the module loader every time it loads it.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-20 15:30         ` Tvrtko Ursulin
  0 siblings, 0 replies; 45+ messages in thread
From: Tvrtko Ursulin @ 2021-07-20 15:30 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers


Now you confused me with two replies I forgot to reply to all... :))

On 20/07/2021 15:53, Jason Ekstrand wrote:
> On Tue, Jul 20, 2021 at 3:25 AM Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:

[snip]

>>> +
>>>    static int __init i915_init(void)
>>>    {
>>>        bool use_kms = true;
>>>        int err;
>>>
>>> +     i915_fully_loaded = false;
>>
>> Ditto.
> 
> So, this is something I'm unclear on.  I know that static memory gets
> auto-initialized to zero but what happens if you unload and reload a
> module?  Is it re-initialized to zero?  If it is then we can drop
> this.

Well it's not in memory after it is unloaded so clearly it has to get 
initialised by the module loader every time it loads it.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-20 14:55       ` [Intel-gfx] " Jason Ekstrand
@ 2021-07-21 11:26         ` Daniel Vetter
  -1 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-21 11:26 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers

On Tue, Jul 20, 2021 at 09:55:22AM -0500, Jason Ekstrand wrote:
> On Tue, Jul 20, 2021 at 9:18 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> > > If the driver was not fully loaded, we may still have globals lying
> > > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > > run mock selftests.  In either case, we have an early exit from
> > > i915_init which happens after i915_globals_init() and we need to clean
> > > up those globals.  While we're here, add an explicit boolean instead of
> > > using a random field from i915_pci_device to detect partial loads.
> > >
> > > The mock selftests case gets especially sticky.  The load isn't entirely
> > > a no-op.  We actually do quite a bit inside those selftests including
> > > allocating a bunch of mock objects and running tests on them.  Once all
> > > those tests are complete, we exit early from i915_init().  Perviously,
> > > i915_init() would return a non-zero error code on failure and a zero
> > > error code on success.  In the success case, we would get to i915_exit()
> > > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > > and do nothing.  In the failure case, we would fail i915_init() but
> > > there would be no opportunity to clean up globals.
> > >
> > > The most annoying part is that you don't actually notice the failure as
> > > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > > result in anything observable from userspace.  Instead, the next time we
> > > load the driver (usually for next IGT test), i915_globals_init() gets
> > > invoked again, we go to allocate a bunch of new memory slabs, those
> > > implicitly create debugfs entries, and debugfs warns that we're trying
> > > to create directories and files that already exist.  Since this all
> > > happens as part of the next driver load, it shows up in the dmesg-warn
> > > of whatever IGT test ran after the mock selftests.
> > >
> > > While the obvious thing to do here might be to call i915_globals_exit()
> > > after selftests, that's not actually safe.  The dma-buf selftests call
> > > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > > the resulting dmabuf which calls fput() on the file.  However, fput()
> > > isn't immediate and gets flushed right before syscall returns.  This
> > > means that all the fput()s from the selftests don't happen until right
> > > before the module load syscall used to fire off the selftests returns
> > > which is after i915_init().  If we call i915_globals_exit() in
> > > i915_init() after selftests, we end up freeing slabs out from under
> > > objects which won't get released until fput() is flushed at the end of
> > > the module load.
> > >
> > > The solution here is to let i915_init() return success early and detect
> > > the early success in i915_exit() and only tear down globals and nothing
> > > else.  This way the module loads successfully, regardless of the success
> > > or failure of the tests.  Because we've not enumerated any PCI devices,
> > > no device nodes are created and it's entirely useless from userspace.
> > > The only thing the module does at that point is hold on to a bit of
> > > memory until we unload it and i915_exit() is called.  Importantly, this
> > > means that everything from our selftests has the ability to properly
> > > flush out between i915_init() and i915_exit() because there are a couple
> > > syscall boundaries in between.
> > >
> > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > ---
> > >  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> > >  1 file changed, 25 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > > index 4e627b57d31a2..24e4e54516936 100644
> > > --- a/drivers/gpu/drm/i915/i915_pci.c
> > > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> > >       .driver.pm = &i915_pm_ops,
> > >  };
> > >
> > > +static bool i915_fully_loaded = false;
> > > +
> > >  static int __init i915_init(void)
> > >  {
> > >       bool use_kms = true;
> > >       int err;
> > >
> > > +     i915_fully_loaded = false;
> > > +
> > >       err = i915_globals_init();
> > >       if (err)
> > >               return err;
> > >
> > > +     /* i915_mock_selftests() only returns zero if no mock subtests were
> > > +      * run.  If we get any non-zero error code, we return early here.
> > > +      * We always return success because selftests may have allocated
> > > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > > +      * could attempt to clean up immediately and fail module load but,
> > > +      * thanks to interactions with other parts of the kernel (struct
> > > +      * file, in particular), it's safer to let the module fully load
> > > +      * and then clean up on unload.
> > > +      */
> > >       err = i915_mock_selftests();
> > >       if (err)
> > > -             return err > 0 ? 0 : err;
> > > +             return 0;
> >
> > At least the module options still claim that you can run selftests and
> > still load the driver. Which makes sense for perf/hw selftests, since
> > those need the driver, but would result in the same old bug resurfacing
> > that you're trying to fix there.
> >
> > Is that description just confused and needs some fixing, or do we have a
> > gap here?
> 
> I don't think there's real need for a fully loaded driver after mock
> selftests.  They exist entirely to run against a mock driver, not the
> real one.

Can you pls update the module option help then for the next round?
-Daniel

> 
> > Patch itself looks reasonable, with the nits from Tvrtko addressed:
> 
> Done
> 
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Thanks
> 
> --Jason
> 
> > >
> > >       /*
> > >        * Enable KMS by default, unless explicitly overriden by
> > > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> > >               return 0;
> > >       }
> > >
> > > +     /* After this point, i915_init() must either fully succeed or
> > > +      * properly tear everything down and fail.  We don't have separate
> > > +      * flags for each set-up bit.
> > > +      */
> > > +     i915_fully_loaded = true;
> > > +
> > >       i915_pmu_init();
> > >
> > >       err = pci_register_driver(&i915_pci_driver);
> > > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> > >
> > >  static void __exit i915_exit(void)
> > >  {
> > > -     if (!i915_pci_driver.driver.owner)
> > > -             return;
> > > -
> > > -     i915_perf_sysctl_unregister();
> > > -     pci_unregister_driver(&i915_pci_driver);
> > > -     i915_pmu_exit();
> > > +     if (i915_fully_loaded) {
> > > +             i915_perf_sysctl_unregister();
> > > +             pci_unregister_driver(&i915_pci_driver);
> > > +             i915_pmu_exit();
> > > +     }
> > >       i915_globals_exit();
> > >  }
> > >
> > > --
> > > 2.31.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-21 11:26         ` Daniel Vetter
  0 siblings, 0 replies; 45+ messages in thread
From: Daniel Vetter @ 2021-07-21 11:26 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Intel GFX, Maling list - DRI developers

On Tue, Jul 20, 2021 at 09:55:22AM -0500, Jason Ekstrand wrote:
> On Tue, Jul 20, 2021 at 9:18 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> > > If the driver was not fully loaded, we may still have globals lying
> > > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > > run mock selftests.  In either case, we have an early exit from
> > > i915_init which happens after i915_globals_init() and we need to clean
> > > up those globals.  While we're here, add an explicit boolean instead of
> > > using a random field from i915_pci_device to detect partial loads.
> > >
> > > The mock selftests case gets especially sticky.  The load isn't entirely
> > > a no-op.  We actually do quite a bit inside those selftests including
> > > allocating a bunch of mock objects and running tests on them.  Once all
> > > those tests are complete, we exit early from i915_init().  Perviously,
> > > i915_init() would return a non-zero error code on failure and a zero
> > > error code on success.  In the success case, we would get to i915_exit()
> > > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > > and do nothing.  In the failure case, we would fail i915_init() but
> > > there would be no opportunity to clean up globals.
> > >
> > > The most annoying part is that you don't actually notice the failure as
> > > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > > result in anything observable from userspace.  Instead, the next time we
> > > load the driver (usually for next IGT test), i915_globals_init() gets
> > > invoked again, we go to allocate a bunch of new memory slabs, those
> > > implicitly create debugfs entries, and debugfs warns that we're trying
> > > to create directories and files that already exist.  Since this all
> > > happens as part of the next driver load, it shows up in the dmesg-warn
> > > of whatever IGT test ran after the mock selftests.
> > >
> > > While the obvious thing to do here might be to call i915_globals_exit()
> > > after selftests, that's not actually safe.  The dma-buf selftests call
> > > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > > the resulting dmabuf which calls fput() on the file.  However, fput()
> > > isn't immediate and gets flushed right before syscall returns.  This
> > > means that all the fput()s from the selftests don't happen until right
> > > before the module load syscall used to fire off the selftests returns
> > > which is after i915_init().  If we call i915_globals_exit() in
> > > i915_init() after selftests, we end up freeing slabs out from under
> > > objects which won't get released until fput() is flushed at the end of
> > > the module load.
> > >
> > > The solution here is to let i915_init() return success early and detect
> > > the early success in i915_exit() and only tear down globals and nothing
> > > else.  This way the module loads successfully, regardless of the success
> > > or failure of the tests.  Because we've not enumerated any PCI devices,
> > > no device nodes are created and it's entirely useless from userspace.
> > > The only thing the module does at that point is hold on to a bit of
> > > memory until we unload it and i915_exit() is called.  Importantly, this
> > > means that everything from our selftests has the ability to properly
> > > flush out between i915_init() and i915_exit() because there are a couple
> > > syscall boundaries in between.
> > >
> > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > ---
> > >  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> > >  1 file changed, 25 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > > index 4e627b57d31a2..24e4e54516936 100644
> > > --- a/drivers/gpu/drm/i915/i915_pci.c
> > > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> > >       .driver.pm = &i915_pm_ops,
> > >  };
> > >
> > > +static bool i915_fully_loaded = false;
> > > +
> > >  static int __init i915_init(void)
> > >  {
> > >       bool use_kms = true;
> > >       int err;
> > >
> > > +     i915_fully_loaded = false;
> > > +
> > >       err = i915_globals_init();
> > >       if (err)
> > >               return err;
> > >
> > > +     /* i915_mock_selftests() only returns zero if no mock subtests were
> > > +      * run.  If we get any non-zero error code, we return early here.
> > > +      * We always return success because selftests may have allocated
> > > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > > +      * could attempt to clean up immediately and fail module load but,
> > > +      * thanks to interactions with other parts of the kernel (struct
> > > +      * file, in particular), it's safer to let the module fully load
> > > +      * and then clean up on unload.
> > > +      */
> > >       err = i915_mock_selftests();
> > >       if (err)
> > > -             return err > 0 ? 0 : err;
> > > +             return 0;
> >
> > At least the module options still claim that you can run selftests and
> > still load the driver. Which makes sense for perf/hw selftests, since
> > those need the driver, but would result in the same old bug resurfacing
> > that you're trying to fix there.
> >
> > Is that description just confused and needs some fixing, or do we have a
> > gap here?
> 
> I don't think there's real need for a fully loaded driver after mock
> selftests.  They exist entirely to run against a mock driver, not the
> real one.

Can you pls update the module option help then for the next round?
-Daniel

> 
> > Patch itself looks reasonable, with the nits from Tvrtko addressed:
> 
> Done
> 
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Thanks
> 
> --Jason
> 
> > >
> > >       /*
> > >        * Enable KMS by default, unless explicitly overriden by
> > > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> > >               return 0;
> > >       }
> > >
> > > +     /* After this point, i915_init() must either fully succeed or
> > > +      * properly tear everything down and fail.  We don't have separate
> > > +      * flags for each set-up bit.
> > > +      */
> > > +     i915_fully_loaded = true;
> > > +
> > >       i915_pmu_init();
> > >
> > >       err = pci_register_driver(&i915_pci_driver);
> > > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> > >
> > >  static void __exit i915_exit(void)
> > >  {
> > > -     if (!i915_pci_driver.driver.owner)
> > > -             return;
> > > -
> > > -     i915_perf_sysctl_unregister();
> > > -     pci_unregister_driver(&i915_pci_driver);
> > > -     i915_pmu_exit();
> > > +     if (i915_fully_loaded) {
> > > +             i915_perf_sysctl_unregister();
> > > +             pci_unregister_driver(&i915_pci_driver);
> > > +             i915_pmu_exit();
> > > +     }
> > >       i915_globals_exit();
> > >  }
> > >
> > > --
> > > 2.31.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
  2021-07-21 11:26         ` [Intel-gfx] " Daniel Vetter
@ 2021-07-21 15:20           ` Jason Ekstrand
  -1 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-21 15:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel GFX, Maling list - DRI developers

On Wed, Jul 21, 2021 at 6:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 20, 2021 at 09:55:22AM -0500, Jason Ekstrand wrote:
> > On Tue, Jul 20, 2021 at 9:18 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> > > > If the driver was not fully loaded, we may still have globals lying
> > > > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > > > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > > > run mock selftests.  In either case, we have an early exit from
> > > > i915_init which happens after i915_globals_init() and we need to clean
> > > > up those globals.  While we're here, add an explicit boolean instead of
> > > > using a random field from i915_pci_device to detect partial loads.
> > > >
> > > > The mock selftests case gets especially sticky.  The load isn't entirely
> > > > a no-op.  We actually do quite a bit inside those selftests including
> > > > allocating a bunch of mock objects and running tests on them.  Once all
> > > > those tests are complete, we exit early from i915_init().  Perviously,
> > > > i915_init() would return a non-zero error code on failure and a zero
> > > > error code on success.  In the success case, we would get to i915_exit()
> > > > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > > > and do nothing.  In the failure case, we would fail i915_init() but
> > > > there would be no opportunity to clean up globals.
> > > >
> > > > The most annoying part is that you don't actually notice the failure as
> > > > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > > > result in anything observable from userspace.  Instead, the next time we
> > > > load the driver (usually for next IGT test), i915_globals_init() gets
> > > > invoked again, we go to allocate a bunch of new memory slabs, those
> > > > implicitly create debugfs entries, and debugfs warns that we're trying
> > > > to create directories and files that already exist.  Since this all
> > > > happens as part of the next driver load, it shows up in the dmesg-warn
> > > > of whatever IGT test ran after the mock selftests.
> > > >
> > > > While the obvious thing to do here might be to call i915_globals_exit()
> > > > after selftests, that's not actually safe.  The dma-buf selftests call
> > > > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > > > the resulting dmabuf which calls fput() on the file.  However, fput()
> > > > isn't immediate and gets flushed right before syscall returns.  This
> > > > means that all the fput()s from the selftests don't happen until right
> > > > before the module load syscall used to fire off the selftests returns
> > > > which is after i915_init().  If we call i915_globals_exit() in
> > > > i915_init() after selftests, we end up freeing slabs out from under
> > > > objects which won't get released until fput() is flushed at the end of
> > > > the module load.
> > > >
> > > > The solution here is to let i915_init() return success early and detect
> > > > the early success in i915_exit() and only tear down globals and nothing
> > > > else.  This way the module loads successfully, regardless of the success
> > > > or failure of the tests.  Because we've not enumerated any PCI devices,
> > > > no device nodes are created and it's entirely useless from userspace.
> > > > The only thing the module does at that point is hold on to a bit of
> > > > memory until we unload it and i915_exit() is called.  Importantly, this
> > > > means that everything from our selftests has the ability to properly
> > > > flush out between i915_init() and i915_exit() because there are a couple
> > > > syscall boundaries in between.
> > > >
> > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> > > >  1 file changed, 25 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > > > index 4e627b57d31a2..24e4e54516936 100644
> > > > --- a/drivers/gpu/drm/i915/i915_pci.c
> > > > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > > > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> > > >       .driver.pm = &i915_pm_ops,
> > > >  };
> > > >
> > > > +static bool i915_fully_loaded = false;
> > > > +
> > > >  static int __init i915_init(void)
> > > >  {
> > > >       bool use_kms = true;
> > > >       int err;
> > > >
> > > > +     i915_fully_loaded = false;
> > > > +
> > > >       err = i915_globals_init();
> > > >       if (err)
> > > >               return err;
> > > >
> > > > +     /* i915_mock_selftests() only returns zero if no mock subtests were
> > > > +      * run.  If we get any non-zero error code, we return early here.
> > > > +      * We always return success because selftests may have allocated
> > > > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > > > +      * could attempt to clean up immediately and fail module load but,
> > > > +      * thanks to interactions with other parts of the kernel (struct
> > > > +      * file, in particular), it's safer to let the module fully load
> > > > +      * and then clean up on unload.
> > > > +      */
> > > >       err = i915_mock_selftests();
> > > >       if (err)
> > > > -             return err > 0 ? 0 : err;
> > > > +             return 0;
> > >
> > > At least the module options still claim that you can run selftests and
> > > still load the driver. Which makes sense for perf/hw selftests, since
> > > those need the driver, but would result in the same old bug resurfacing
> > > that you're trying to fix there.
> > >
> > > Is that description just confused and needs some fixing, or do we have a
> > > gap here?
> >
> > I don't think there's real need for a fully loaded driver after mock
> > selftests.  They exist entirely to run against a mock driver, not the
> > real one.
>
> Can you pls update the module option help then for the next round?

Done.

> -Daniel
>
> >
> > > Patch itself looks reasonable, with the nits from Tvrtko addressed:
> >
> > Done
> >
> > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > Thanks
> >
> > --Jason
> >
> > > >
> > > >       /*
> > > >        * Enable KMS by default, unless explicitly overriden by
> > > > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> > > >               return 0;
> > > >       }
> > > >
> > > > +     /* After this point, i915_init() must either fully succeed or
> > > > +      * properly tear everything down and fail.  We don't have separate
> > > > +      * flags for each set-up bit.
> > > > +      */
> > > > +     i915_fully_loaded = true;
> > > > +
> > > >       i915_pmu_init();
> > > >
> > > >       err = pci_register_driver(&i915_pci_driver);
> > > > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> > > >
> > > >  static void __exit i915_exit(void)
> > > >  {
> > > > -     if (!i915_pci_driver.driver.owner)
> > > > -             return;
> > > > -
> > > > -     i915_perf_sysctl_unregister();
> > > > -     pci_unregister_driver(&i915_pci_driver);
> > > > -     i915_pmu_exit();
> > > > +     if (i915_fully_loaded) {
> > > > +             i915_perf_sysctl_unregister();
> > > > +             pci_unregister_driver(&i915_pci_driver);
> > > > +             i915_pmu_exit();
> > > > +     }
> > > >       i915_globals_exit();
> > > >  }
> > > >
> > > > --
> > > > 2.31.1
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-gfx] [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit()
@ 2021-07-21 15:20           ` Jason Ekstrand
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Ekstrand @ 2021-07-21 15:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Intel GFX, Maling list - DRI developers

On Wed, Jul 21, 2021 at 6:26 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 20, 2021 at 09:55:22AM -0500, Jason Ekstrand wrote:
> > On Tue, Jul 20, 2021 at 9:18 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Mon, Jul 19, 2021 at 01:30:44PM -0500, Jason Ekstrand wrote:
> > > > If the driver was not fully loaded, we may still have globals lying
> > > > around.  If we don't tear those down in i915_exit(), we'll leak a bunch
> > > > of memory slabs.  This can happen two ways: use_kms = false and if we've
> > > > run mock selftests.  In either case, we have an early exit from
> > > > i915_init which happens after i915_globals_init() and we need to clean
> > > > up those globals.  While we're here, add an explicit boolean instead of
> > > > using a random field from i915_pci_device to detect partial loads.
> > > >
> > > > The mock selftests case gets especially sticky.  The load isn't entirely
> > > > a no-op.  We actually do quite a bit inside those selftests including
> > > > allocating a bunch of mock objects and running tests on them.  Once all
> > > > those tests are complete, we exit early from i915_init().  Perviously,
> > > > i915_init() would return a non-zero error code on failure and a zero
> > > > error code on success.  In the success case, we would get to i915_exit()
> > > > and check i915_pci_driver.driver.owner to detect if i915_init exited early
> > > > and do nothing.  In the failure case, we would fail i915_init() but
> > > > there would be no opportunity to clean up globals.
> > > >
> > > > The most annoying part is that you don't actually notice the failure as
> > > > part of the self-tests since leaking a bit of memory, while bad, doesn't
> > > > result in anything observable from userspace.  Instead, the next time we
> > > > load the driver (usually for next IGT test), i915_globals_init() gets
> > > > invoked again, we go to allocate a bunch of new memory slabs, those
> > > > implicitly create debugfs entries, and debugfs warns that we're trying
> > > > to create directories and files that already exist.  Since this all
> > > > happens as part of the next driver load, it shows up in the dmesg-warn
> > > > of whatever IGT test ran after the mock selftests.
> > > >
> > > > While the obvious thing to do here might be to call i915_globals_exit()
> > > > after selftests, that's not actually safe.  The dma-buf selftests call
> > > > i915_gem_prime_export which creates a file.  We call dma_buf_put() on
> > > > the resulting dmabuf which calls fput() on the file.  However, fput()
> > > > isn't immediate and gets flushed right before syscall returns.  This
> > > > means that all the fput()s from the selftests don't happen until right
> > > > before the module load syscall used to fire off the selftests returns
> > > > which is after i915_init().  If we call i915_globals_exit() in
> > > > i915_init() after selftests, we end up freeing slabs out from under
> > > > objects which won't get released until fput() is flushed at the end of
> > > > the module load.
> > > >
> > > > The solution here is to let i915_init() return success early and detect
> > > > the early success in i915_exit() and only tear down globals and nothing
> > > > else.  This way the module loads successfully, regardless of the success
> > > > or failure of the tests.  Because we've not enumerated any PCI devices,
> > > > no device nodes are created and it's entirely useless from userspace.
> > > > The only thing the module does at that point is hold on to a bit of
> > > > memory until we unload it and i915_exit() is called.  Importantly, this
> > > > means that everything from our selftests has the ability to properly
> > > > flush out between i915_init() and i915_exit() because there are a couple
> > > > syscall boundaries in between.
> > > >
> > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > > Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_pci.c | 32 +++++++++++++++++++++++++-------
> > > >  1 file changed, 25 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
> > > > index 4e627b57d31a2..24e4e54516936 100644
> > > > --- a/drivers/gpu/drm/i915/i915_pci.c
> > > > +++ b/drivers/gpu/drm/i915/i915_pci.c
> > > > @@ -1194,18 +1194,31 @@ static struct pci_driver i915_pci_driver = {
> > > >       .driver.pm = &i915_pm_ops,
> > > >  };
> > > >
> > > > +static bool i915_fully_loaded = false;
> > > > +
> > > >  static int __init i915_init(void)
> > > >  {
> > > >       bool use_kms = true;
> > > >       int err;
> > > >
> > > > +     i915_fully_loaded = false;
> > > > +
> > > >       err = i915_globals_init();
> > > >       if (err)
> > > >               return err;
> > > >
> > > > +     /* i915_mock_selftests() only returns zero if no mock subtests were
> > > > +      * run.  If we get any non-zero error code, we return early here.
> > > > +      * We always return success because selftests may have allocated
> > > > +      * objects from slabs which will get cleaned up by i915_exit().  We
> > > > +      * could attempt to clean up immediately and fail module load but,
> > > > +      * thanks to interactions with other parts of the kernel (struct
> > > > +      * file, in particular), it's safer to let the module fully load
> > > > +      * and then clean up on unload.
> > > > +      */
> > > >       err = i915_mock_selftests();
> > > >       if (err)
> > > > -             return err > 0 ? 0 : err;
> > > > +             return 0;
> > >
> > > At least the module options still claim that you can run selftests and
> > > still load the driver. Which makes sense for perf/hw selftests, since
> > > those need the driver, but would result in the same old bug resurfacing
> > > that you're trying to fix there.
> > >
> > > Is that description just confused and needs some fixing, or do we have a
> > > gap here?
> >
> > I don't think there's real need for a fully loaded driver after mock
> > selftests.  They exist entirely to run against a mock driver, not the
> > real one.
>
> Can you pls update the module option help then for the next round?

Done.

> -Daniel
>
> >
> > > Patch itself looks reasonable, with the nits from Tvrtko addressed:
> >
> > Done
> >
> > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > Thanks
> >
> > --Jason
> >
> > > >
> > > >       /*
> > > >        * Enable KMS by default, unless explicitly overriden by
> > > > @@ -1225,6 +1238,12 @@ static int __init i915_init(void)
> > > >               return 0;
> > > >       }
> > > >
> > > > +     /* After this point, i915_init() must either fully succeed or
> > > > +      * properly tear everything down and fail.  We don't have separate
> > > > +      * flags for each set-up bit.
> > > > +      */
> > > > +     i915_fully_loaded = true;
> > > > +
> > > >       i915_pmu_init();
> > > >
> > > >       err = pci_register_driver(&i915_pci_driver);
> > > > @@ -1240,12 +1259,11 @@ static int __init i915_init(void)
> > > >
> > > >  static void __exit i915_exit(void)
> > > >  {
> > > > -     if (!i915_pci_driver.driver.owner)
> > > > -             return;
> > > > -
> > > > -     i915_perf_sysctl_unregister();
> > > > -     pci_unregister_driver(&i915_pci_driver);
> > > > -     i915_pmu_exit();
> > > > +     if (i915_fully_loaded) {
> > > > +             i915_perf_sysctl_unregister();
> > > > +             pci_unregister_driver(&i915_pci_driver);
> > > > +             i915_pmu_exit();
> > > > +     }
> > > >       i915_globals_exit();
> > > >  }
> > > >
> > > > --
> > > > 2.31.1
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-07-21 15:21 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19 18:30 [PATCH 0/6] Fix the debugfs splat from mock selftests Jason Ekstrand
2021-07-19 18:30 ` [Intel-gfx] " Jason Ekstrand
2021-07-19 18:30 ` [PATCH 1/6] drm/i915: Call i915_globals_exit() after i915_pmu_exit() Jason Ekstrand
2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
2021-07-19 18:30 ` [PATCH 2/6] drm/i915: Call i915_globals_exit() if pci_register_device() fails Jason Ekstrand
2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
2021-07-20 14:11   ` Daniel Vetter
2021-07-20 14:11     ` [Intel-gfx] " Daniel Vetter
2021-07-19 18:30 ` [PATCH 3/6] drm/i915: Always call i915_globals_exit() from i915_exit() Jason Ekstrand
2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
2021-07-20  8:25   ` Tvrtko Ursulin
2021-07-20  8:25     ` Tvrtko Ursulin
2021-07-20 14:53     ` Jason Ekstrand
2021-07-20 14:53       ` Jason Ekstrand
2021-07-20 15:22       ` Tvrtko Ursulin
2021-07-20 15:22         ` Tvrtko Ursulin
2021-07-20 15:30       ` Tvrtko Ursulin
2021-07-20 15:30         ` Tvrtko Ursulin
2021-07-20 15:05     ` Jason Ekstrand
2021-07-20 15:05       ` Jason Ekstrand
2021-07-20 15:25       ` Tvrtko Ursulin
2021-07-20 15:25         ` Tvrtko Ursulin
2021-07-20 14:18   ` Daniel Vetter
2021-07-20 14:18     ` [Intel-gfx] " Daniel Vetter
2021-07-20 14:55     ` Jason Ekstrand
2021-07-20 14:55       ` [Intel-gfx] " Jason Ekstrand
2021-07-21 11:26       ` Daniel Vetter
2021-07-21 11:26         ` [Intel-gfx] " Daniel Vetter
2021-07-21 15:20         ` Jason Ekstrand
2021-07-21 15:20           ` [Intel-gfx] " Jason Ekstrand
2021-07-19 18:30 ` [PATCH 4/6] drm/ttm: Force re-init if ttm_global_init() fails Jason Ekstrand
2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
2021-07-19 19:21   ` Christian König
2021-07-19 19:21     ` [Intel-gfx] " Christian König
2021-07-19 18:30 ` [PATCH 5/6] drm/ttm: Initialize debugfs from ttm_global_init() Jason Ekstrand
2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
2021-07-20 14:22   ` Daniel Vetter
2021-07-20 14:22     ` [Intel-gfx] " Daniel Vetter
2021-07-19 18:30 ` [PATCH 6/6] drm/i915: Make the kmem slab for i915_buddy_block a global Jason Ekstrand
2021-07-19 18:30   ` [Intel-gfx] " Jason Ekstrand
2021-07-19 18:38 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Fix the debugfs splat from mock selftests Patchwork
2021-07-19 18:38 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-07-19 18:42 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-07-19 19:05 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-07-19 23:32 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.