All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Daniel Vetter <daniel.vetter@ffwll.ch>,
	"Intel Graphics Development" <intel-gfx@lists.freedesktop.org>
Cc: "LKML" <linux-kernel@vger.kernel.org>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	"Mika Kuoppala" <mika.kuoppala@intel.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Marta Lofstedt" <marta.lofstedt@intel.com>,
	"Daniel Vetter" <daniel.vetter@intel.com>
Subject: Re: [PATCH 2/2] drm/i915: Use rcu instead of stop_machine in set_wedged
Date: Fri, 06 Oct 2017 10:17:58 +0100	[thread overview]
Message-ID: <150728147814.14660.6817347768155085898@mail.alporthouse.com> (raw)
In-Reply-To: <20171006090637.25545-2-daniel.vetter@ffwll.ch>

Quoting Daniel Vetter (2017-10-06 10:06:37)
> stop_machine is not really a locking primitive we should use, except
> when the hw folks tell us the hw is broken and that's the only way to
> work around it.
> 
> This patch tries to address the locking abuse of stop_machine() from
> 
> commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Nov 22 14:41:21 2016 +0000
> 
>     drm/i915: Stop the machine as we install the wedged submit_request handler
> 
> Chris said parts of the reasons for going with stop_machine() was that
> it's no overhead for the fast-path. But these callbacks use irqsave
> spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.

I still want a discussion of the reason why keeping the normal path clean
and why an alternative is sought, here. That design leads into vv

> To stay as close as possible to the stop_machine semantics we first
> update all the submit function pointers to the nop handler, then call
> synchronize_rcu() to make sure no new requests can be submitted. This
> should give us exactly the huge barrier we want.
> 
> I pondered whether we should annotate engine->submit_request as __rcu
> and use rcu_assign_pointer and rcu_dereference on it. But the reason
> behind those is to make sure the compiler/cpu barriers are there for
> when you have an actual data structure you point at, to make sure all
> the writes are seen correctly on the read side. But we just have a
> function pointer, and .text isn't changed, so no need for these
> barriers and hence no need for annotations.
> 
> This should fix the followwing lockdep splat:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
> ------------------------------------------------------
> kworker/3:4/562 is trying to acquire lock:
>  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40
> 
> but task is already holding lock:
>  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #6 (&dev->struct_mutex){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __mutex_lock+0x86/0x9b0
>        mutex_lock_interruptible_nested+0x1b/0x20
>        i915_mutex_lock_interruptible+0x51/0x130 [i915]
>        i915_gem_fault+0x209/0x650 [i915]
>        __do_fault+0x1e/0x80
>        __handle_mm_fault+0xa08/0xed0
>        handle_mm_fault+0x156/0x300
>        __do_page_fault+0x2c5/0x570
>        do_page_fault+0x28/0x250
>        page_fault+0x22/0x30
> 
> -> #5 (&mm->mmap_sem){++++}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __might_fault+0x68/0x90
>        _copy_to_user+0x23/0x70
>        filldir+0xa5/0x120
>        dcache_readdir+0xf9/0x170
>        iterate_dir+0x69/0x1a0
>        SyS_getdents+0xa5/0x140
>        entry_SYSCALL_64_fastpath+0x1c/0xb1
> 
> -> #4 (&sb->s_type->i_mutex_key#5){++++}:
>        down_write+0x3b/0x70
>        handle_create+0xcb/0x1e0
>        devtmpfsd+0x139/0x180
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
> 
> -> #3 ((complete)&req.done){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        wait_for_common+0x58/0x210
>        wait_for_completion+0x1d/0x20
>        devtmpfs_create_node+0x13d/0x160
>        device_add+0x5eb/0x620
>        device_create_groups_vargs+0xe0/0xf0
>        device_create+0x3a/0x40
>        msr_device_create+0x2b/0x40
>        cpuhp_invoke_callback+0xc9/0xbf0
>        cpuhp_thread_fun+0x17b/0x240
>        smpboot_thread_fn+0x18a/0x280
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
> 
> -> #2 (cpuhp_state-up){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        cpuhp_issue_call+0x133/0x1c0
>        __cpuhp_setup_state_cpuslocked+0x139/0x2a0
>        __cpuhp_setup_state+0x46/0x60
>        page_writeback_init+0x43/0x67
>        pagecache_init+0x3d/0x42
>        start_kernel+0x3a8/0x3fc
>        x86_64_start_reservations+0x2a/0x2c
>        x86_64_start_kernel+0x6d/0x70
>        verify_cpu+0x0/0xfb
> 
> -> #1 (cpuhp_state_mutex){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __mutex_lock+0x86/0x9b0
>        mutex_lock_nested+0x1b/0x20
>        __cpuhp_setup_state_cpuslocked+0x53/0x2a0
>        __cpuhp_setup_state+0x46/0x60
>        page_alloc_init+0x28/0x30
>        start_kernel+0x145/0x3fc
>        x86_64_start_reservations+0x2a/0x2c
>        x86_64_start_kernel+0x6d/0x70
>        verify_cpu+0x0/0xfb
> 
> -> #0 (cpu_hotplug_lock.rw_sem){++++}:
>        check_prev_add+0x430/0x840
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        cpus_read_lock+0x3d/0xb0
>        stop_machine+0x1c/0x40
>        i915_gem_set_wedged+0x1a/0x20 [i915]
>        i915_reset+0xb9/0x230 [i915]
>        i915_reset_device+0x1f6/0x260 [i915]
>        i915_handle_error+0x2d8/0x430 [i915]
>        hangcheck_declare_hang+0xd3/0xf0 [i915]
>        i915_hangcheck_elapsed+0x262/0x2d0 [i915]
>        process_one_work+0x233/0x660
>        worker_thread+0x4e/0x3b0
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&dev->struct_mutex);
>                                lock(&mm->mmap_sem);
>                                lock(&dev->struct_mutex);
>   lock(cpu_hotplug_lock.rw_sem);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kworker/3:4/562:
>  #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
>  #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
>  #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
> 
> stack backtrace:
> CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
> Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
> Workqueue: events_long i915_hangcheck_elapsed [i915]
> Call Trace:
>  dump_stack+0x68/0x9f
>  print_circular_bug+0x235/0x3c0
>  ? lockdep_init_map_crosslock+0x20/0x20
>  check_prev_add+0x430/0x840
>  ? irq_work_queue+0x86/0xe0
>  ? wake_up_klogd+0x53/0x70
>  __lock_acquire+0x1420/0x15e0
>  ? __lock_acquire+0x1420/0x15e0
>  ? lockdep_init_map_crosslock+0x20/0x20
>  lock_acquire+0xb0/0x200
>  ? stop_machine+0x1c/0x40
>  ? i915_gem_object_truncate+0x50/0x50 [i915]
>  cpus_read_lock+0x3d/0xb0
>  ? stop_machine+0x1c/0x40
>  stop_machine+0x1c/0x40
>  i915_gem_set_wedged+0x1a/0x20 [i915]
>  i915_reset+0xb9/0x230 [i915]
>  i915_reset_device+0x1f6/0x260 [i915]
>  ? gen8_gt_irq_ack+0x170/0x170 [i915]
>  ? work_on_cpu_safe+0x60/0x60
>  i915_handle_error+0x2d8/0x430 [i915]
>  ? vsnprintf+0xd1/0x4b0
>  ? scnprintf+0x3a/0x70
>  hangcheck_declare_hang+0xd3/0xf0 [i915]
>  ? intel_runtime_pm_put+0x56/0xa0 [i915]
>  i915_hangcheck_elapsed+0x262/0x2d0 [i915]
>  process_one_work+0x233/0x660
>  worker_thread+0x4e/0x3b0
>  kthread+0x152/0x190
>  ? process_one_work+0x660/0x660
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x27/0x40
> Setting dangerous option reset - tainting kernel
> i915 0000:00:02.0: Resetting chip after gpu hang
> Setting dangerous option reset - tainting kernel
> i915 0000:00:02.0: Resetting chip after gpu hang
> 
> v2: Have 1 global synchronize_rcu() barrier across all engines, and
> improve commit message.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Marta Lofstedt <marta.lofstedt@intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c                   | 31 +++++++++--------------
>  drivers/gpu/drm/i915/i915_gem_request.c           |  2 ++
>  drivers/gpu/drm/i915/selftests/i915_gem_request.c |  2 ++
>  3 files changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index ab8c6946fea4..e79a6ca60265 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3020,16 +3020,8 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>         intel_engine_init_global_seqno(request->engine, request->global_seqno);
>  }
>  
> -static void engine_set_wedged(struct intel_engine_cs *engine)
> +static void engine_complete_requests(struct intel_engine_cs *engine)
>  {
> -       /* We need to be sure that no thread is running the old callback as
> -        * we install the nop handler (otherwise we would submit a request
> -        * to hardware that will never complete). In order to prevent this
> -        * race, we wait until the machine is idle before making the swap
> -        * (using stop_machine()).
> -        */
> -       engine->submit_request = nop_submit_request;
> -
>         /* Mark all executing requests as skipped */
>         engine->cancel_requests(engine);
>  
> @@ -3041,24 +3033,25 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>                                        intel_engine_last_submit(engine));
>  }
>  
> -static int __i915_gem_set_wedged_BKL(void *data)
> +void i915_gem_set_wedged(struct drm_i915_private *i915)
>  {
> -       struct drm_i915_private *i915 = data;
>         struct intel_engine_cs *engine;
>         enum intel_engine_id id;
>  
>         for_each_engine(engine, i915, id)
> -               engine_set_wedged(engine);
> +               engine->submit_request = nop_submit_request;
>  
> -       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> -       wake_up_all(&i915->gpu_error.reset_queue);
> +       /* Make sure no one is running the old callback before we proceed with
> +        * cancelling requests and resetting the completion tracking. Otherwise
> +        * we might submit a request to the hardware which never completes.
> +        */
> +       synchronize_rcu();
>  
> -       return 0;
> -}
> +       for_each_engine(engine, i915, id)
> +               engine_complete_requests(engine);
>  
> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> -{
> -       stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
> +       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> +       wake_up_all(&i915->gpu_error.reset_queue);
>  }
>  
>  bool i915_gem_unset_wedged(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index b100b38f1dd2..ef78a85cb845 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -556,7 +556,9 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>         switch (state) {
>         case FENCE_COMPLETE:
>                 trace_i915_gem_request_submit(request);
> +               rcu_read_lock();
>                 request->engine->submit_request(request);
> +               rcu_read_unlock();
>                 break;
>  
>         case FENCE_FREE:
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_request.c b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> index 78b9f811707f..a999161e8db1 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> @@ -215,7 +215,9 @@ static int igt_request_rewind(void *arg)
>         }
>         i915_gem_request_get(vip);
>         i915_add_request(vip);
> +       rcu_read_lock();
>         request->engine->submit_request(request);
> +       rcu_read_unlock();
>  
>         mutex_unlock(&i915->drm.struct_mutex);
>  
> -- 
> 2.14.1
> 

WARNING: multiple messages have this Message-ID (diff)
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Intel Graphics Development <intel-gfx@lists.freedesktop.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	LKML <linux-kernel@vger.kernel.org>,
	Daniel Vetter <daniel.vetter@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Mika Kuoppala <mika.kuoppala@intel.com>
Subject: Re: [PATCH 2/2] drm/i915: Use rcu instead of stop_machine in set_wedged
Date: Fri, 06 Oct 2017 10:17:58 +0100	[thread overview]
Message-ID: <150728147814.14660.6817347768155085898@mail.alporthouse.com> (raw)
In-Reply-To: <20171006090637.25545-2-daniel.vetter@ffwll.ch>

Quoting Daniel Vetter (2017-10-06 10:06:37)
> stop_machine is not really a locking primitive we should use, except
> when the hw folks tell us the hw is broken and that's the only way to
> work around it.
> 
> This patch tries to address the locking abuse of stop_machine() from
> 
> commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Nov 22 14:41:21 2016 +0000
> 
>     drm/i915: Stop the machine as we install the wedged submit_request handler
> 
> Chris said parts of the reasons for going with stop_machine() was that
> it's no overhead for the fast-path. But these callbacks use irqsave
> spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.

I still want a discussion of the reason why keeping the normal path clean
and why an alternative is sought, here. That design leads into vv

> To stay as close as possible to the stop_machine semantics we first
> update all the submit function pointers to the nop handler, then call
> synchronize_rcu() to make sure no new requests can be submitted. This
> should give us exactly the huge barrier we want.
> 
> I pondered whether we should annotate engine->submit_request as __rcu
> and use rcu_assign_pointer and rcu_dereference on it. But the reason
> behind those is to make sure the compiler/cpu barriers are there for
> when you have an actual data structure you point at, to make sure all
> the writes are seen correctly on the read side. But we just have a
> function pointer, and .text isn't changed, so no need for these
> barriers and hence no need for annotations.
> 
> This should fix the followwing lockdep splat:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
> ------------------------------------------------------
> kworker/3:4/562 is trying to acquire lock:
>  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40
> 
> but task is already holding lock:
>  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #6 (&dev->struct_mutex){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __mutex_lock+0x86/0x9b0
>        mutex_lock_interruptible_nested+0x1b/0x20
>        i915_mutex_lock_interruptible+0x51/0x130 [i915]
>        i915_gem_fault+0x209/0x650 [i915]
>        __do_fault+0x1e/0x80
>        __handle_mm_fault+0xa08/0xed0
>        handle_mm_fault+0x156/0x300
>        __do_page_fault+0x2c5/0x570
>        do_page_fault+0x28/0x250
>        page_fault+0x22/0x30
> 
> -> #5 (&mm->mmap_sem){++++}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __might_fault+0x68/0x90
>        _copy_to_user+0x23/0x70
>        filldir+0xa5/0x120
>        dcache_readdir+0xf9/0x170
>        iterate_dir+0x69/0x1a0
>        SyS_getdents+0xa5/0x140
>        entry_SYSCALL_64_fastpath+0x1c/0xb1
> 
> -> #4 (&sb->s_type->i_mutex_key#5){++++}:
>        down_write+0x3b/0x70
>        handle_create+0xcb/0x1e0
>        devtmpfsd+0x139/0x180
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
> 
> -> #3 ((complete)&req.done){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        wait_for_common+0x58/0x210
>        wait_for_completion+0x1d/0x20
>        devtmpfs_create_node+0x13d/0x160
>        device_add+0x5eb/0x620
>        device_create_groups_vargs+0xe0/0xf0
>        device_create+0x3a/0x40
>        msr_device_create+0x2b/0x40
>        cpuhp_invoke_callback+0xc9/0xbf0
>        cpuhp_thread_fun+0x17b/0x240
>        smpboot_thread_fn+0x18a/0x280
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
> 
> -> #2 (cpuhp_state-up){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        cpuhp_issue_call+0x133/0x1c0
>        __cpuhp_setup_state_cpuslocked+0x139/0x2a0
>        __cpuhp_setup_state+0x46/0x60
>        page_writeback_init+0x43/0x67
>        pagecache_init+0x3d/0x42
>        start_kernel+0x3a8/0x3fc
>        x86_64_start_reservations+0x2a/0x2c
>        x86_64_start_kernel+0x6d/0x70
>        verify_cpu+0x0/0xfb
> 
> -> #1 (cpuhp_state_mutex){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __mutex_lock+0x86/0x9b0
>        mutex_lock_nested+0x1b/0x20
>        __cpuhp_setup_state_cpuslocked+0x53/0x2a0
>        __cpuhp_setup_state+0x46/0x60
>        page_alloc_init+0x28/0x30
>        start_kernel+0x145/0x3fc
>        x86_64_start_reservations+0x2a/0x2c
>        x86_64_start_kernel+0x6d/0x70
>        verify_cpu+0x0/0xfb
> 
> -> #0 (cpu_hotplug_lock.rw_sem){++++}:
>        check_prev_add+0x430/0x840
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        cpus_read_lock+0x3d/0xb0
>        stop_machine+0x1c/0x40
>        i915_gem_set_wedged+0x1a/0x20 [i915]
>        i915_reset+0xb9/0x230 [i915]
>        i915_reset_device+0x1f6/0x260 [i915]
>        i915_handle_error+0x2d8/0x430 [i915]
>        hangcheck_declare_hang+0xd3/0xf0 [i915]
>        i915_hangcheck_elapsed+0x262/0x2d0 [i915]
>        process_one_work+0x233/0x660
>        worker_thread+0x4e/0x3b0
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&dev->struct_mutex);
>                                lock(&mm->mmap_sem);
>                                lock(&dev->struct_mutex);
>   lock(cpu_hotplug_lock.rw_sem);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kworker/3:4/562:
>  #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
>  #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
>  #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
> 
> stack backtrace:
> CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
> Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
> Workqueue: events_long i915_hangcheck_elapsed [i915]
> Call Trace:
>  dump_stack+0x68/0x9f
>  print_circular_bug+0x235/0x3c0
>  ? lockdep_init_map_crosslock+0x20/0x20
>  check_prev_add+0x430/0x840
>  ? irq_work_queue+0x86/0xe0
>  ? wake_up_klogd+0x53/0x70
>  __lock_acquire+0x1420/0x15e0
>  ? __lock_acquire+0x1420/0x15e0
>  ? lockdep_init_map_crosslock+0x20/0x20
>  lock_acquire+0xb0/0x200
>  ? stop_machine+0x1c/0x40
>  ? i915_gem_object_truncate+0x50/0x50 [i915]
>  cpus_read_lock+0x3d/0xb0
>  ? stop_machine+0x1c/0x40
>  stop_machine+0x1c/0x40
>  i915_gem_set_wedged+0x1a/0x20 [i915]
>  i915_reset+0xb9/0x230 [i915]
>  i915_reset_device+0x1f6/0x260 [i915]
>  ? gen8_gt_irq_ack+0x170/0x170 [i915]
>  ? work_on_cpu_safe+0x60/0x60
>  i915_handle_error+0x2d8/0x430 [i915]
>  ? vsnprintf+0xd1/0x4b0
>  ? scnprintf+0x3a/0x70
>  hangcheck_declare_hang+0xd3/0xf0 [i915]
>  ? intel_runtime_pm_put+0x56/0xa0 [i915]
>  i915_hangcheck_elapsed+0x262/0x2d0 [i915]
>  process_one_work+0x233/0x660
>  worker_thread+0x4e/0x3b0
>  kthread+0x152/0x190
>  ? process_one_work+0x660/0x660
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x27/0x40
> Setting dangerous option reset - tainting kernel
> i915 0000:00:02.0: Resetting chip after gpu hang
> Setting dangerous option reset - tainting kernel
> i915 0000:00:02.0: Resetting chip after gpu hang
> 
> v2: Have 1 global synchronize_rcu() barrier across all engines, and
> improve commit message.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Marta Lofstedt <marta.lofstedt@intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c                   | 31 +++++++++--------------
>  drivers/gpu/drm/i915/i915_gem_request.c           |  2 ++
>  drivers/gpu/drm/i915/selftests/i915_gem_request.c |  2 ++
>  3 files changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index ab8c6946fea4..e79a6ca60265 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3020,16 +3020,8 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>         intel_engine_init_global_seqno(request->engine, request->global_seqno);
>  }
>  
> -static void engine_set_wedged(struct intel_engine_cs *engine)
> +static void engine_complete_requests(struct intel_engine_cs *engine)
>  {
> -       /* We need to be sure that no thread is running the old callback as
> -        * we install the nop handler (otherwise we would submit a request
> -        * to hardware that will never complete). In order to prevent this
> -        * race, we wait until the machine is idle before making the swap
> -        * (using stop_machine()).
> -        */
> -       engine->submit_request = nop_submit_request;
> -
>         /* Mark all executing requests as skipped */
>         engine->cancel_requests(engine);
>  
> @@ -3041,24 +3033,25 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
>                                        intel_engine_last_submit(engine));
>  }
>  
> -static int __i915_gem_set_wedged_BKL(void *data)
> +void i915_gem_set_wedged(struct drm_i915_private *i915)
>  {
> -       struct drm_i915_private *i915 = data;
>         struct intel_engine_cs *engine;
>         enum intel_engine_id id;
>  
>         for_each_engine(engine, i915, id)
> -               engine_set_wedged(engine);
> +               engine->submit_request = nop_submit_request;
>  
> -       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> -       wake_up_all(&i915->gpu_error.reset_queue);
> +       /* Make sure no one is running the old callback before we proceed with
> +        * cancelling requests and resetting the completion tracking. Otherwise
> +        * we might submit a request to the hardware which never completes.
> +        */
> +       synchronize_rcu();
>  
> -       return 0;
> -}
> +       for_each_engine(engine, i915, id)
> +               engine_complete_requests(engine);
>  
> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> -{
> -       stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
> +       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> +       wake_up_all(&i915->gpu_error.reset_queue);
>  }
>  
>  bool i915_gem_unset_wedged(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index b100b38f1dd2..ef78a85cb845 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -556,7 +556,9 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>         switch (state) {
>         case FENCE_COMPLETE:
>                 trace_i915_gem_request_submit(request);
> +               rcu_read_lock();
>                 request->engine->submit_request(request);
> +               rcu_read_unlock();
>                 break;
>  
>         case FENCE_FREE:
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_request.c b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> index 78b9f811707f..a999161e8db1 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> @@ -215,7 +215,9 @@ static int igt_request_rewind(void *arg)
>         }
>         i915_gem_request_get(vip);
>         i915_add_request(vip);
> +       rcu_read_lock();
>         request->engine->submit_request(request);
> +       rcu_read_unlock();
>  
>         mutex_unlock(&i915->drm.struct_mutex);
>  
> -- 
> 2.14.1
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2017-10-06  9:18 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-06  9:06 [PATCH 1/2] drm/i915: Preallocate our mmu notifier workequeu to unbreak cpu hotplug deadlock Daniel Vetter
2017-10-06  9:06 ` Daniel Vetter
2017-10-06  9:06 ` [PATCH 2/2] drm/i915: Use rcu instead of stop_machine in set_wedged Daniel Vetter
2017-10-06  9:06   ` Daniel Vetter
2017-10-06  9:17   ` Chris Wilson [this message]
2017-10-06  9:17     ` Chris Wilson
2017-10-06 10:12     ` Thomas Gleixner
2017-10-06 10:12       ` Thomas Gleixner
2017-10-06 11:12       ` Peter Zijlstra
2017-10-06 11:12         ` Peter Zijlstra
2017-10-06 14:12       ` Daniel Vetter
2017-10-06 11:03   ` Chris Wilson
2017-10-06 11:03     ` Chris Wilson
2017-10-06 14:20     ` Daniel Vetter
2017-10-06 17:29       ` Chris Wilson
2017-10-09  9:12         ` Daniel Vetter
2017-10-09  9:12           ` Daniel Vetter
2017-10-06 17:37       ` Chris Wilson
2017-10-09  9:26         ` Daniel Vetter
2017-10-09  9:26           ` Daniel Vetter
2017-10-06 11:10   ` Peter Zijlstra
2017-10-06  9:23 ` [PATCH 1/2] drm/i915: Preallocate our mmu notifier workequeu to unbreak cpu hotplug deadlock Chris Wilson
2017-10-06  9:23   ` Chris Wilson
2017-10-06  9:48 ` Chris Wilson
2017-10-06  9:48   ` Chris Wilson
2017-10-06 11:34 ` [Intel-gfx] " Tvrtko Ursulin
2017-10-06 11:34   ` Tvrtko Ursulin
2017-10-06 14:23   ` [Intel-gfx] " Daniel Vetter
2017-10-06 14:23     ` Daniel Vetter
2017-10-06 14:44     ` [Intel-gfx] " Tvrtko Ursulin
2017-10-06 12:50 ` ✗ Fi.CI.BAT: warning for series starting with [1/2] " Patchwork
2017-10-06 15:52 ` [PATCH] " Daniel Vetter
2017-10-06 16:07   ` Tvrtko Ursulin
2017-10-06 16:29 ` ✓ Fi.CI.BAT: success for series starting with drm/i915: Preallocate our mmu notifier workequeu to unbreak cpu hotplug deadlock (rev2) Patchwork
2017-10-06 22:20 ` ✗ Fi.CI.IGT: warning " Patchwork
2017-10-09 16:44 [PATCH 1/2] drm/i915: Preallocate our mmu notifier workequeu to unbreak cpu hotplug deadlock Daniel Vetter
2017-10-09 16:44 ` [PATCH 2/2] drm/i915: Use rcu instead of stop_machine in set_wedged Daniel Vetter
2017-10-10  9:21   ` Chris Wilson
2017-10-10 12:30     ` Daniel Vetter
2017-10-10 13:29     ` Chris Wilson
2017-10-10  9:50   ` Mika Kuoppala
2017-10-10 12:37     ` Daniel Vetter
2017-10-10 13:39   ` Chris Wilson
2017-10-10 14:09     ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=150728147814.14660.6817347768155085898@mail.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=daniel.vetter@ffwll.ch \
    --cc=daniel.vetter@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marta.lofstedt@intel.com \
    --cc=mika.kuoppala@intel.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.