linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
@ 2019-08-09 12:41 Martin Wilck
  2019-08-09 12:53 ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Wilck @ 2019-08-09 12:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: linux-kernel

This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
(5.3-rc3 with just a few patches on top), after starting a KVM virtual
machine. The X screen was frozen. Remote login via ssh was still
possible, thus I was able to retrieve basic logs.

sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
minute, the same two processes were still blocked. KVM seems to try to
acquire a lock that kcompactd is holding. kcompactd is waiting for IO
to complete on pages owned by the i915 driver.

kcompactd stack:

Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State
Aug 09 12:12:48 apollon.suse.de
kernel:   task                        PC stack   pid father
Aug 09 12:12:48 apollon.suse.de kernel:
kcompactd0      D    0    43      2 0x80004000
Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
Aug 09 12:12:48 apollon.suse.de kernel:  io_schedule+0x12/0x40
Aug 09 12:12:48 apollon.suse.de kernel:  __lock_page+0x123/0x200
Aug 09 12:12:48 apollon.suse.de kernel:  ?
gen8_ppgtt_clear_pdp+0xc0/0x140 [i915]
Aug 09 12:12:48 apollon.suse.de kernel:  ?
file_fdatawait_range+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel:  set_page_dirty_lock+0x49/0x50
Aug 09 12:12:48 apollon.suse.de
kernel:  i915_gem_userptr_put_pages+0x13f/0x1c0 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  __i915_gem_object_put_pages+0x5e/0xa0 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  userptr_mn_invalidate_range_start+0x1ff/0x220 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  __mmu_notifier_invalidate_range_start+0x57/0xa0
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
Aug 09 12:12:48 apollon.suse.de kernel:  ? __mod_lruvec_state+0x3f/0xf0
Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
Aug 09 12:12:48 apollon.suse.de kernel:  ?
fast_isolate_freepages+0x6b0/0x6b0
Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
Aug 09 12:12:48 apollon.suse.de kernel:  ?
entry_SYSCALL_64_after_hwframe+0xb8/0xbe
Aug 09 12:12:48 apollon.suse.de kernel:  kcompactd_do_work+0x120/0x290


KVM stack:

Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM       D    0
25189      1 0x00000320
Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
Aug 09 12:12:48 apollon.suse.de
kernel:  schedule_preempt_disabled+0xa/0x10
Aug 09 12:12:48 apollon.suse.de
kernel:  __mutex_lock.isra.0+0x172/0x4d0
Aug 09 12:12:48 apollon.suse.de
kernel:  userptr_mn_invalidate_range_start+0x1bf/0x220 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  __mmu_notifier_invalidate_range_start+0x57/0xa0
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
Aug 09 12:12:48 apollon.suse.de kernel:  ?
fast_isolate_freepages+0x6b0/0x6b0
Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone_order+0xc6/0xf0
Aug 09 12:12:48 apollon.suse.de
kernel:  try_to_compact_pages+0xcc/0x2a0
Aug 09 12:12:48 apollon.suse.de
kernel:  __alloc_pages_direct_compact+0x7c/0x150
Aug 09 12:12:48 apollon.suse.de
kernel:  __alloc_pages_slowpath+0x1ee/0xd00
Aug 09 12:12:48 apollon.suse.de kernel:  ? vmx_vcpu_load+0x100/0x120
[kvm_intel]

Full logs can be found under https://pastebin.com/KJ6tccj4
I haven't yet tried if this is reproducible.

Regards
Martin


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
  2019-08-09 12:41 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages Martin Wilck
@ 2019-08-09 12:53 ` Chris Wilson
  2019-09-10 14:20   ` Leho Kraav
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2019-08-09 12:53 UTC (permalink / raw)
  To: intel-gfx, Martin Wilck; +Cc: linux-kernel

Quoting Martin Wilck (2019-08-09 13:41:42)
> This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> machine. The X screen was frozen. Remote login via ssh was still
> possible, thus I was able to retrieve basic logs.
> 
> sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> minute, the same two processes were still blocked. KVM seems to try to
> acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> to complete on pages owned by the i915 driver.

My bad, it's known. We haven't decided on whether to revert the
unfortunate recursive locking (and so hit another warn elsewhere) or to
ignore the dirty pages (and so risk losing data across swap).

cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
-Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
  2019-08-09 12:53 ` Chris Wilson
@ 2019-09-10 14:20   ` Leho Kraav
  2019-09-12 11:23     ` Martin Wilck
  0 siblings, 1 reply; 9+ messages in thread
From: Leho Kraav @ 2019-09-10 14:20 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Martin Wilck, linux-kernel

On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote:
> Quoting Martin Wilck (2019-08-09 13:41:42)
> > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> > (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> > machine. The X screen was frozen. Remote login via ssh was still
> > possible, thus I was able to retrieve basic logs.
> > 
> > sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> > minute, the same two processes were still blocked. KVM seems to try to
> > acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> > to complete on pages owned by the i915 driver.
> 
> My bad, it's known. We haven't decided on whether to revert the
> unfortunate recursive locking (and so hit another warn elsewhere) or to
> ignore the dirty pages (and so risk losing data across swap).
> 
> cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
> -Chris

Hi Chris. Is this exactly what I'm hitting at
https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps?

It reliably breaks the graphics userland, as the machine consistently
freezes at any random moment.

Any workaround options, even if with a performance penalty? Revert
cb6d7c7dc7ff but side effects?

5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick
with the newest if possible.

-- 
Leho Kraav, senior technology & digital marketing architect

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
  2019-09-10 14:20   ` Leho Kraav
@ 2019-09-12 11:23     ` Martin Wilck
  2019-09-12 11:58       ` leho
  2019-09-12 11:59       ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Martin Wilck @ 2019-09-12 11:23 UTC (permalink / raw)
  To: chris; +Cc: torvalds, Michal Koutny, intel-gfx, leho, tiwai, linux-kernel

Hi Chris,

On Tue, 2019-09-10 at 17:20 +0300, Leho Kraav wrote:
> On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote:
> > Quoting Martin Wilck (2019-08-09 13:41:42)
> > > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-
> > > default
> > > (5.3-rc3 with just a few patches on top), after starting a KVM
> > > virtual
> > > machine. The X screen was frozen. Remote login via ssh was still
> > > possible, thus I was able to retrieve basic logs.
> > > 
> > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After
> > > a
> > > minute, the same two processes were still blocked. KVM seems to
> > > try to
> > > acquire a lock that kcompactd is holding. kcompactd is waiting
> > > for IO
> > > to complete on pages owned by the i915 driver.
> > 
> > My bad, it's known. We haven't decided on whether to revert the
> > unfortunate recursive locking (and so hit another warn elsewhere)
> > or to
> > ignore the dirty pages (and so risk losing data across swap).
> > 
> > cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around
> > set_page_dirty()")
> > -Chris
> 
> Hi Chris. Is this exactly what I'm hitting at
> https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps?
> 
> It reliably breaks the graphics userland, as the machine consistently
> freezes at any random moment.
> 
> Any workaround options, even if with a performance penalty? Revert
> cb6d7c7dc7ff but side effects?
> 
> 5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick
> with the newest if possible.

There's a considerable risk that many users will start seeing this
regression when 5.3 is released. I am not aware of a workaround.

Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
Acquire the page lock around set_page_dirty()")? And if we do, what
would be the consequences? Would other patches need to be reverted,
too?

Thanks,
Martin


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
  2019-09-12 11:23     ` Martin Wilck
@ 2019-09-12 11:58       ` leho
  2019-09-12 11:59       ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: leho @ 2019-09-12 11:58 UTC (permalink / raw)
  To: Martin Wilck
  Cc: chris, torvalds, Michal Koutny, intel-gfx, tiwai, linux-kernel

On Thu, Sep 12, 2019 at 11:23:09AM +0000, Martin Wilck wrote:
> 
> There's a considerable risk that many users will start seeing this
> regression when 5.3 is released. I am not aware of a workaround.
> 
> Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> Acquire the page lock around set_page_dirty()")? And if we do, what
> would be the consequences? Would other patches need to be reverted,
> too?

I've been running with revert patch for a couple of days and have not
encountered any kernel warnings thus far, nor any other ill effects that
could be attributed to this locking mechanism.

But I'm far from familiar with these subsystems.

Graphics does not hang anymore.

I've also received developer feedback in private that this really should
be fixed before 5.3 release.

-- 
Leho Kraav, senior technology & digital marketing architect

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
  2019-09-12 11:23     ` Martin Wilck
  2019-09-12 11:58       ` leho
@ 2019-09-12 11:59       ` Linus Torvalds
  2019-09-12 12:47         ` Chris Wilson
  2019-09-12 12:56         ` [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()" Chris Wilson
  1 sibling, 2 replies; 9+ messages in thread
From: Linus Torvalds @ 2019-09-12 11:59 UTC (permalink / raw)
  To: Martin Wilck; +Cc: chris, Michal Koutny, intel-gfx, leho, tiwai, linux-kernel

On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck <Martin.Wilck@suse.com> wrote:
>
> Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> Acquire the page lock around set_page_dirty()")? And if we do, what
> would be the consequences? Would other patches need to be reverted,
> too?

Looking at that commit, and the backtrace of the lockup, I think that
reverting it is the correct thing to do.

You can't take the page lock in invalidate_range(), since it's called
from try_to_unmap(), which is called with the page lock already held.

So commit aa56a292ce62 is just fundamentally completely wrong and
should be reverted.

               Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
  2019-09-12 11:59       ` Linus Torvalds
@ 2019-09-12 12:47         ` Chris Wilson
  2019-09-12 12:56         ` [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()" Chris Wilson
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2019-09-12 12:47 UTC (permalink / raw)
  To: Linus Torvalds, Martin Wilck
  Cc: Michal Koutny, intel-gfx, leho, tiwai, linux-kernel

Quoting Linus Torvalds (2019-09-12 12:59:25)
> On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck <Martin.Wilck@suse.com> wrote:
> >
> > Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> > Acquire the page lock around set_page_dirty()")? And if we do, what
> > would be the consequences? Would other patches need to be reverted,
> > too?
> 
> Looking at that commit, and the backtrace of the lockup, I think that
> reverting it is the correct thing to do.
> 
> You can't take the page lock in invalidate_range(), since it's called
> from try_to_unmap(), which is called with the page lock already held.
> 
> So commit aa56a292ce62 is just fundamentally completely wrong and
> should be reverted.

There's still the dilemma that we get called without the page lock, but
at this moment in time in order to hit 5.3, it needs a revert sent
directly to Linus.
-Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"
  2019-09-12 11:59       ` Linus Torvalds
  2019-09-12 12:47         ` Chris Wilson
@ 2019-09-12 12:56         ` Chris Wilson
  2019-09-12 14:39           ` [Intel-gfx] " Tvrtko Ursulin
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2019-09-12 12:56 UTC (permalink / raw)
  To: intel-gfx, torvalds
  Cc: Martin.Wilck, MKoutny, leho, tiwai, linux-kernel, Chris Wilson,
	Lionel Landwerlin, Tvrtko Ursulin, Jani Nikula, Joonas Lahtinen,
	stable

The userptr put_pages can be called from inside try_to_unmap, and so
enters with the page lock held on one of the object's backing pages. We
cannot take the page lock ourselves for fear of recursion.

Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reported-by: Martin Wilck <Martin.Wilck@suse.com>
Reported-by: Leo Kraav <leho@kraav.com>
Fixes: aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
References: https://bugzilla.kernel.org/show_bug.cgi?id=203317
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 74da35611d7c..11b231c187c5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -672,15 +672,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
 
 	for_each_sgt_page(page, sgt_iter, pages) {
 		if (obj->mm.dirty)
-			/*
-			 * As this may not be anonymous memory (e.g. shmem)
-			 * but exist on a real mapping, we have to lock
-			 * the page in order to dirty it -- holding
-			 * the page reference is not sufficient to
-			 * prevent the inode from being truncated.
-			 * Play safe and take the lock.
-			 */
-			set_page_dirty_lock(page);
+			set_page_dirty(page);
 
 		mark_page_accessed(page);
 		put_page(page);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"
  2019-09-12 12:56         ` [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()" Chris Wilson
@ 2019-09-12 14:39           ` Tvrtko Ursulin
  0 siblings, 0 replies; 9+ messages in thread
From: Tvrtko Ursulin @ 2019-09-12 14:39 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, torvalds
  Cc: tiwai, linux-kernel, leho, Jani Nikula, MKoutny, stable, Martin.Wilck


On 12/09/2019 13:56, Chris Wilson wrote:
> The userptr put_pages can be called from inside try_to_unmap, and so
> enters with the page lock held on one of the object's backing pages. We
> cannot take the page lock ourselves for fear of recursion.
> 
> Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Reported-by: Martin Wilck <Martin.Wilck@suse.com>
> Reported-by: Leo Kraav <leho@kraav.com>
> Fixes: aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
> References: https://bugzilla.kernel.org/show_bug.cgi?id=203317
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: stable@vger.kernel.org
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 10 +---------
>   1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> index 74da35611d7c..11b231c187c5 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> @@ -672,15 +672,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
>   
>   	for_each_sgt_page(page, sgt_iter, pages) {
>   		if (obj->mm.dirty)
> -			/*
> -			 * As this may not be anonymous memory (e.g. shmem)
> -			 * but exist on a real mapping, we have to lock
> -			 * the page in order to dirty it -- holding
> -			 * the page reference is not sufficient to
> -			 * prevent the inode from being truncated.
> -			 * Play safe and take the lock.
> -			 */
> -			set_page_dirty_lock(page);
> +			set_page_dirty(page);
>   
>   		mark_page_accessed(page);
>   		put_page(page);
> 

Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-09-12 14:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 12:41 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages Martin Wilck
2019-08-09 12:53 ` Chris Wilson
2019-09-10 14:20   ` Leho Kraav
2019-09-12 11:23     ` Martin Wilck
2019-09-12 11:58       ` leho
2019-09-12 11:59       ` Linus Torvalds
2019-09-12 12:47         ` Chris Wilson
2019-09-12 12:56         ` [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()" Chris Wilson
2019-09-12 14:39           ` [Intel-gfx] " Tvrtko Ursulin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).