All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
Cc: "Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	amd-gfx@lists.freedesktop.org, rcu@vger.kernel.org,
	urezki@gmail.com, uladzislau.rezki@sony.com
Subject: Re: rcu_sched detected expedited stalls in amdgpu after suspend
Date: Mon, 27 Jun 2022 13:41:39 -0700	[thread overview]
Message-ID: <20220627204139.GL1790663@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <1656357116.rhe0mufk6a.none@localhost>

On Mon, Jun 27, 2022 at 03:22:24PM -0400, Alex Xu (Hello71) wrote:
> Hi,
> 
> Since Linux 5.19-ish, I consistently get these types of errors when 
> resuming from S3:
> 
> [15652.909157] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9981 root: 0x800/.
> [15652.909162] rcu: blocking rcu_node structures (internal RCU debug):
> [15652.909163] Task dump for CPU 11:
> [15652.909164] task:kworker/u24:65  state:R  running task     stack:    0 pid:210218 ppid:     2 flags:0x00004008
> [15652.909167] Workqueue: events_unbound async_run_entry_fn
> [15652.909172] Call Trace:
> [15652.909173]  <TASK>
> [15652.909174]  ? atom_get_src_int+0x38e/0x680
> [15652.909179]  ? atom_op_test+0x67/0x190
> [15652.909181]  ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909184]  ? atom_op_calltable+0xb1/0x110
> [15652.909186]  ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909189]  ? atom_op_calltable+0xb1/0x110
> [15652.909191]  ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909193]  ? __switch_to+0x137/0x440
> [15652.909195]  ? amdgpu_atom_asic_init+0xe0/0x100
> [15652.909198]  ? pci_bus_read_config_dword+0x36/0x50
> [15652.909201]  ? amdgpu_device_resume+0x10b/0x3e0
> [15652.909203]  ? amdgpu_pmops_resume+0x32/0x60
> [15652.909204]  ? pci_pm_suspend+0x2b0/0x2b0
> [15652.909206]  ? dpm_run_callback+0x35/0x1f0
> [15652.909209]  ? device_resume+0x1ca/0x220
> [15652.909211]  ? async_resume+0x19/0xe0
> [15652.909213]  ? async_run_entry_fn+0x33/0x120
> [15652.909215]  ? process_one_work+0x1d6/0x350
> [15652.909218]  ? worker_thread+0x24d/0x480
> [15652.909220]  ? kthread+0x137/0x150
> [15652.909221]  ? worker_clr_flags+0x40/0x40
> [15652.909224]  ? kthread_blkcg+0x30/0x30
> [15652.909226]  ? ret_from_fork+0x22/0x30
> [15652.909227]  </TASK>
> [15653.015808] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9985 root: 0x800/.
> [15653.015812] rcu: blocking rcu_node structures (internal RCU debug):
> [15653.015813] Task dump for CPU 11:
> [15653.015813] task:kworker/u24:65  state:R  running task     stack:    0 pid:210218 ppid:     2 flags:0x00004008
> [15653.015816] Workqueue: events_unbound async_run_entry_fn
> [15653.015820] Call Trace:
> [15653.015820]  <TASK>
> [15653.015821]  ? amdgpu_cgs_read_register+0x10/0x10
> [15653.015825]  ? smu7_copy_bytes_to_smc+0xd4/0x200
> [15653.015828]  ? polaris10_program_memory_timing_parameters+0x195/0x1b0
> [15653.015831]  ? sysvec_apic_timer_interrupt+0xa/0x80
> [15653.015834]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [15653.015836]  ? amdgpu_cgs_destroy_device+0x10/0x10
> [15653.015839]  ? sysvec_apic_timer_interrupt+0xa/0x80
> [15653.015841]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [15653.015843]  ? amdgpu_cgs_destroy_device+0x10/0x10
> [15653.015846]  ? amdgpu_device_rreg+0x8f/0xd0
> [15653.015847]  ? phm_wait_for_register_unequal+0x99/0xd0
> [15653.015850]  ? smu7_send_msg_to_smc+0x95/0x130
> [15653.015853]  ? smum_send_msg_to_smc+0x5d/0xa0
> [15653.015854]  ? amdgpu_cgs_read_ind_register+0xa0/0xa0
> [15653.015857]  ? smu7_enable_dpm_tasks+0x241f/0x28c0
> [15653.015859]  ? hwmgr_resume+0x31/0x70
> [15653.015861]  ? amdgpu_device_resume+0x1fa/0x3e0
> [15653.015863]  ? amdgpu_pmops_resume+0x32/0x60
> [15653.015864]  ? pci_pm_suspend+0x2b0/0x2b0
> [15653.015866]  ? dpm_run_callback+0x35/0x1f0
> [15653.015868]  ? device_resume+0x1ca/0x220
> [15653.015870]  ? async_resume+0x19/0xe0
> [15653.015872]  ? async_run_entry_fn+0x33/0x120
> [15653.015874]  ? process_one_work+0x1d6/0x350
> [15653.015877]  ? worker_thread+0x24d/0x480
> [15653.015878]  ? kthread+0x137/0x150
> [15653.015880]  ? worker_clr_flags+0x40/0x40
> [15653.015882]  ? kthread_blkcg+0x30/0x30
> [15653.015884]  ? ret_from_fork+0x22/0x30
> [15653.015886]  </TASK>
> 
> I have not noticed any resulting problems. I am reporting this in the 
> hope that it is easy to fix the issue and remove the error messages 
> which may obscure some future problem.

The usual way this happens is for a task to be spinning.  In this case,
that might be due to excessive lock contention on async_lock within the
async_run_entry_fn() function.  Or perhaps the problem is in one of the
functions preceded by "?" above.

One way to debug this is to place trace_printk()s or similar in the
functions called out above to track it down.

I bet that the reason this showed up in v5.19 is this guy:

28b3ae426598 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT")

So do you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to some small
number of milliseconds?  If so, you can override this by adjusting the
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option or by booting with a
longer timeout via the rcupdate.rcu_exp_cpu_stall_timeout= kernel boot
parameter.

But if you are running on an Android platform, Uladzislau will be
interested in addressing the underlying issue, so I have added him on CC.

							Thanx, Paul

WARNING: multiple messages have this Message-ID (diff)
From: "Paul E. McKenney" <paulmck@kernel.org>
To: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>,
	amd-gfx@lists.freedesktop.org, rcu@vger.kernel.org,
	urezki@gmail.com, "Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	uladzislau.rezki@sony.com
Subject: Re: rcu_sched detected expedited stalls in amdgpu after suspend
Date: Mon, 27 Jun 2022 13:41:39 -0700	[thread overview]
Message-ID: <20220627204139.GL1790663@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <1656357116.rhe0mufk6a.none@localhost>

On Mon, Jun 27, 2022 at 03:22:24PM -0400, Alex Xu (Hello71) wrote:
> Hi,
> 
> Since Linux 5.19-ish, I consistently get these types of errors when 
> resuming from S3:
> 
> [15652.909157] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9981 root: 0x800/.
> [15652.909162] rcu: blocking rcu_node structures (internal RCU debug):
> [15652.909163] Task dump for CPU 11:
> [15652.909164] task:kworker/u24:65  state:R  running task     stack:    0 pid:210218 ppid:     2 flags:0x00004008
> [15652.909167] Workqueue: events_unbound async_run_entry_fn
> [15652.909172] Call Trace:
> [15652.909173]  <TASK>
> [15652.909174]  ? atom_get_src_int+0x38e/0x680
> [15652.909179]  ? atom_op_test+0x67/0x190
> [15652.909181]  ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909184]  ? atom_op_calltable+0xb1/0x110
> [15652.909186]  ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909189]  ? atom_op_calltable+0xb1/0x110
> [15652.909191]  ? amdgpu_atom_execute_table_locked+0x19a/0x300
> [15652.909193]  ? __switch_to+0x137/0x440
> [15652.909195]  ? amdgpu_atom_asic_init+0xe0/0x100
> [15652.909198]  ? pci_bus_read_config_dword+0x36/0x50
> [15652.909201]  ? amdgpu_device_resume+0x10b/0x3e0
> [15652.909203]  ? amdgpu_pmops_resume+0x32/0x60
> [15652.909204]  ? pci_pm_suspend+0x2b0/0x2b0
> [15652.909206]  ? dpm_run_callback+0x35/0x1f0
> [15652.909209]  ? device_resume+0x1ca/0x220
> [15652.909211]  ? async_resume+0x19/0xe0
> [15652.909213]  ? async_run_entry_fn+0x33/0x120
> [15652.909215]  ? process_one_work+0x1d6/0x350
> [15652.909218]  ? worker_thread+0x24d/0x480
> [15652.909220]  ? kthread+0x137/0x150
> [15652.909221]  ? worker_clr_flags+0x40/0x40
> [15652.909224]  ? kthread_blkcg+0x30/0x30
> [15652.909226]  ? ret_from_fork+0x22/0x30
> [15652.909227]  </TASK>
> [15653.015808] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 11-... } 7 jiffies s: 9985 root: 0x800/.
> [15653.015812] rcu: blocking rcu_node structures (internal RCU debug):
> [15653.015813] Task dump for CPU 11:
> [15653.015813] task:kworker/u24:65  state:R  running task     stack:    0 pid:210218 ppid:     2 flags:0x00004008
> [15653.015816] Workqueue: events_unbound async_run_entry_fn
> [15653.015820] Call Trace:
> [15653.015820]  <TASK>
> [15653.015821]  ? amdgpu_cgs_read_register+0x10/0x10
> [15653.015825]  ? smu7_copy_bytes_to_smc+0xd4/0x200
> [15653.015828]  ? polaris10_program_memory_timing_parameters+0x195/0x1b0
> [15653.015831]  ? sysvec_apic_timer_interrupt+0xa/0x80
> [15653.015834]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [15653.015836]  ? amdgpu_cgs_destroy_device+0x10/0x10
> [15653.015839]  ? sysvec_apic_timer_interrupt+0xa/0x80
> [15653.015841]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [15653.015843]  ? amdgpu_cgs_destroy_device+0x10/0x10
> [15653.015846]  ? amdgpu_device_rreg+0x8f/0xd0
> [15653.015847]  ? phm_wait_for_register_unequal+0x99/0xd0
> [15653.015850]  ? smu7_send_msg_to_smc+0x95/0x130
> [15653.015853]  ? smum_send_msg_to_smc+0x5d/0xa0
> [15653.015854]  ? amdgpu_cgs_read_ind_register+0xa0/0xa0
> [15653.015857]  ? smu7_enable_dpm_tasks+0x241f/0x28c0
> [15653.015859]  ? hwmgr_resume+0x31/0x70
> [15653.015861]  ? amdgpu_device_resume+0x1fa/0x3e0
> [15653.015863]  ? amdgpu_pmops_resume+0x32/0x60
> [15653.015864]  ? pci_pm_suspend+0x2b0/0x2b0
> [15653.015866]  ? dpm_run_callback+0x35/0x1f0
> [15653.015868]  ? device_resume+0x1ca/0x220
> [15653.015870]  ? async_resume+0x19/0xe0
> [15653.015872]  ? async_run_entry_fn+0x33/0x120
> [15653.015874]  ? process_one_work+0x1d6/0x350
> [15653.015877]  ? worker_thread+0x24d/0x480
> [15653.015878]  ? kthread+0x137/0x150
> [15653.015880]  ? worker_clr_flags+0x40/0x40
> [15653.015882]  ? kthread_blkcg+0x30/0x30
> [15653.015884]  ? ret_from_fork+0x22/0x30
> [15653.015886]  </TASK>
> 
> I have not noticed any resulting problems. I am reporting this in the 
> hope that it is easy to fix the issue and remove the error messages 
> which may obscure some future problem.

The usual way this happens is for a task to be spinning.  In this case,
that might be due to excessive lock contention on async_lock within the
async_run_entry_fn() function.  Or perhaps the problem is in one of the
functions preceded by "?" above.

One way to debug this is to place trace_printk()s or similar in the
functions called out above to track it down.

I bet that the reason this showed up in v5.19 is this guy:

28b3ae426598 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT")

So do you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to some small
number of milliseconds?  If so, you can override this by adjusting the
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option or by booting with a
longer timeout via the rcupdate.rcu_exp_cpu_stall_timeout= kernel boot
parameter.

But if you are running on an Android platform, Uladzislau will be
interested in addressing the underlying issue, so I have added him on CC.

							Thanx, Paul

  reply	other threads:[~2022-06-27 20:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1656357116.rhe0mufk6a.none.ref@localhost>
2022-06-27 19:22 ` rcu_sched detected expedited stalls in amdgpu after suspend Alex Xu (Hello71)
2022-06-27 20:41   ` Paul E. McKenney [this message]
2022-06-27 20:41     ` Paul E. McKenney
     [not found]     ` <1656379893.q9yb069erk.none@localhost>
     [not found]       ` <20220628041252.GV1790663@paulmck-ThinkPad-P17-Gen-1>
2022-06-28 15:02         ` CONFIG_ANDROID (was: rcu_sched detected expedited stalls in amdgpu after suspend) Alex Xu (Hello71)
2022-06-28 15:02           ` Alex Xu (Hello71)
2022-06-28 15:13           ` Jason A. Donenfeld
2022-06-28 15:13             ` Jason A. Donenfeld
2022-06-28 18:54           ` Paul E. McKenney
2022-06-28 18:54             ` Paul E. McKenney
2022-06-28 19:28             ` Alex Xu (Hello71)
2022-06-28 19:28               ` Alex Xu (Hello71)
2022-06-28 20:11               ` Uladzislau Rezki
2022-06-28 20:11                 ` Uladzislau Rezki
2022-07-04 11:30                 ` Christian König
2022-07-04 11:30                   ` Christian König
2022-07-06 17:48                   ` Uladzislau Rezki
2022-07-06 17:48                     ` Uladzislau Rezki
2022-07-06 17:58                     ` Paul E. McKenney
2022-07-06 17:58                       ` Paul E. McKenney
2022-07-06 18:09                       ` Uladzislau Rezki
2022-07-06 18:09                         ` Uladzislau Rezki
2022-07-06 20:42                         ` Paul E. McKenney
2022-07-06 20:42                           ` Paul E. McKenney
2022-07-07  7:30                           ` Christian König
2022-07-07  7:30                             ` Christian König
2022-07-07 13:29                             ` Paul E. McKenney
2022-07-07 13:29                               ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220627204139.GL1790663@paulmck-ThinkPad-P17-Gen-1 \
    --to=paulmck@kernel.org \
    --cc=Xinhui.Pan@amd.com \
    --cc=alex_y_xu@yahoo.ca \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=rcu@vger.kernel.org \
    --cc=uladzislau.rezki@sony.com \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.