* Re: power-off delay/hang due to commit 6d25be57 (mainline) [not found] ` <87wo4dligz.fsf@gmx.net> @ 2020-06-12 11:01 ` Sebastian Andrzej Siewior 2020-06-14 12:12 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-12 11:01 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi + ACPI in case the ACPI folks see something obvious. On 2020-06-11 17:39:40 [+0200], Stephen Berman wrote: > > I've done that now. I've sent you screenshots offlist. Here's a brief > summary: The initial shutdown log output is essentially the same as the > transcription I posted upthread, except that it is interspersed with > messages from your patch like "Flush hid_reset+0x0/0x80". Then comes: > > ACPI: Preparing to enter system sleep state S5 > Flushing kacpid > Flushing kacpid_notify based on the screenshots and the backtrace you are stuck in is: - __do_sys_reboot() - acpi_power_off_prepare() - acpi_os_wait_events_complete() - flush_workqueue(kacpi_notify_wq); > Flushing kacpid_notify over, 0 > Showing busy workqueues and worker pools: > workqueue events: flags=0x0 > pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 > pending: drm_fb_helper_diry_work > workqueue kacpid_notify: flags=0x0 > pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 refcnt=4185 > in-flight: 1578:acpi_os_execute_deferred > delayed: acpi_os_execute_deferred, acpi_os_execute_deferred, ... based on this, the kacpid_notify workqueue has quite some worker piled up. > The "acpi_os_execute_deferred" messages were repeated many times in the > above line, then every 20-30 seconds again for several minutes. Then > suddenly a call trace appeared which was similar but not identical to > the one I posted upthread, and each line of the trace was followed by > the line ", acpi_os_execute_deferred". This went by quite quickly even > with the printk_delay I added, and I was unable to photograph the start > of it and couldn't get all of the subsequent output, but the screenshots > show some of it. After several minutes of this output I again did a > hard reboot. All good. I though that you will have one worker that is blocking but you have a lot of them. It appears that one is active and stuck and more are waiting. Could you look at acpi in /proc/interrupts 10 secs apart to see if it increments? grep -E 'acpi|smbus' /proc/interrupts You could also do "echo t > /proc/sysrq-trigger" which gives you a lot of task state information, but at the end you will also see "howing busy workqueues and worker pools:" regarding the workqueue state. I am curious to see if you already have worker stuck in kacpid_notify with acpi_os_execute_deferred. Now that we know that know that acpi_os_execute_deferred() is stuck, lets shed some light in what it is trying to do. The patch at the end will dump this information into the console buffer (The `dmesg' command will print the whole kernel buffer). I have no idea if this starts printing while the system is running or during shutdown. I would expect to see the Start line in acpi_os_execute_deferred() but not the End one. diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 41168c027a5a4..0e983c558bcb5 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -840,7 +840,9 @@ static void acpi_os_execute_deferred(struct work_struct *work) { struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); + pr_err("%s(%d) Start %px %pF(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); dpc->function(dpc->context); + pr_err("%s(%d) End %px %pF(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); kfree(dpc); } @@ -1096,6 +1098,8 @@ acpi_status acpi_os_execute(acpi_execute_type type, */ if (type == OSL_NOTIFY_HANDLER) { queue = kacpi_notify_wq; + pr_err("%s(%d) Adding %pS %px\n", __func__, __LINE__, function, + context); INIT_WORK(&dpc->work, acpi_os_execute_deferred); } else if (type == OSL_GPE_HANDLER) { queue = kacpid_wq; > Steve Berman Sebastian ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-12 11:01 ` power-off delay/hang due to commit 6d25be57 (mainline) Sebastian Andrzej Siewior @ 2020-06-14 12:12 ` Stephen Berman 2020-06-14 17:10 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-14 12:12 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Fri, 12 Jun 2020 13:01:22 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > + ACPI in case the ACPI folks see something obvious. [...] >> The "acpi_os_execute_deferred" messages were repeated many times in the >> above line, then every 20-30 seconds again for several minutes. Then >> suddenly a call trace appeared which was similar but not identical to >> the one I posted upthread, and each line of the trace was followed by >> the line ", acpi_os_execute_deferred". This went by quite quickly even >> with the printk_delay I added, and I was unable to photograph the start >> of it and couldn't get all of the subsequent output, but the screenshots >> show some of it. After several minutes of this output I again did a >> hard reboot. > > All good. I though that you will have one worker that is blocking but > you have a lot of them. It appears that one is active and stuck and more > are waiting. > > Could you look at acpi in /proc/interrupts 10 secs apart to see if it > increments? > > grep -E 'acpi|smbus' /proc/interrupts I tried this several times, 10 seconds apart and longer, but saw no change, it was always this: steve [ ~ ]$ grep -E 'acpi|smbus' /proc/interrupts 9: 0 5 0 0 0 0 0 0 0 0 0 0 IO-APIC 9-fasteoi acpi 16: 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC 16-fasteoi i801_smbus > You could also do "echo t > /proc/sysrq-trigger" which gives you a lot > of task state information, but at the end you will also see "howing busy > workqueues and worker pools:" regarding the workqueue state. I am > curious to see if you already have worker stuck in kacpid_notify with > acpi_os_execute_deferred. What am I supposed to do after "echo t > /proc/sysrq-trigger"? Both before and after doing that I get an error trying to open the file: root [ ~ ]# cat /proc/sysrq-trigger cat: /proc/sysrq-trigger: Input/output error > Now that we know that know that acpi_os_execute_deferred() is stuck, > lets shed some light in what it is trying to do. The patch at the end > will dump this information into the console buffer (The `dmesg' command > will print the whole kernel buffer). I have no idea if this starts > printing while the system is running or during shutdown. I would expect > to see the Start line in acpi_os_execute_deferred() but not the End one. > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > index 41168c027a5a4..0e983c558bcb5 100644 > --- a/drivers/acpi/osl.c > +++ b/drivers/acpi/osl.c > @@ -840,7 +840,9 @@ static void acpi_os_execute_deferred(struct work_struct *work) > { > struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); > > + pr_err("%s(%d) Start %px %pF(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); > dpc->function(dpc->context); > + pr_err("%s(%d) End %px %pF(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); > kfree(dpc); > } > > @@ -1096,6 +1098,8 @@ acpi_status acpi_os_execute(acpi_execute_type type, > */ > if (type == OSL_NOTIFY_HANDLER) { > queue = kacpi_notify_wq; > + pr_err("%s(%d) Adding %pS %px\n", __func__, __LINE__, function, > + context); > INIT_WORK(&dpc->work, acpi_os_execute_deferred); > } else if (type == OSL_GPE_HANDLER) { > queue = kacpid_wq; I applied this patch to 5.6.4 and recompiled, and on the next boot with that kernel, the kernel buffer (and kernel and system logs) began to get flooded with these messages: Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.943987] acpi_os_execute_deferred(843) Start ffff8fb82c7b6500 000000003edf1e05(ffff8fb82c492990) Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944102] acpi_os_execute(1101) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8fb82c492990 Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944104] acpi_os_execute_deferred(845) End ffff8fb82c7b6500 000000003edf1e05(ffff8fb82c492990) Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944105] acpi_os_execute_deferred(843) Start ffff8fb82b844800 000000002ba560ea(ffff8fb82c492990) Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944124] acpi_os_execute_deferred(845) End ffff8fb82b844800 000000002ba560ea(ffff8fb82c492990) Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944288] acpi_os_execute_deferred(843) Start ffff8fb82c7b6540 000000003edf1e05(ffff8fb82c492990) Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944387] acpi_os_execute(1101) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8fb82c492990 and so on without stopping. I could start X and avoid seeing the messages, but was afraid the logs would fill up the root partition if I let it keep going, so I rebooted with another kernel. Was this message flood because I booted with "ignore_loglevel initcall_debug"? In the logs there are also lots of messages like this: Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237628] initcall i915_init+0x0/0x66 returned 0 after 288657 usecs Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237644] calling cn_proc_init+0x0/0x35 @ 1 Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237646] initcall cn_proc_init+0x0/0x35 returned 0 after 0 usecs Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237648] calling _nvm_misc_init+0x0/0xc @ 1 Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237687] initcall _nvm_misc_init+0x0/0xc returned 0 after 36 usecs Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237690] calling topology_sysfs_init+0x0/0x30 @ 1 Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237720] initcall topology_sysfs_init+0x0/0x30 returned 0 after 28 usecs and so on. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-14 12:12 ` Stephen Berman @ 2020-06-14 17:10 ` Sebastian Andrzej Siewior 2020-06-15 7:58 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-14 17:10 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi [-- Attachment #1: Type: text/plain, Size: 6288 bytes --] On 2020-06-14 14:12:18 [+0200], Stephen Berman wrote: > On Fri, 12 Jun 2020 13:01:22 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > steve [ ~ ]$ grep -E 'acpi|smbus' /proc/interrupts > 9: 0 5 0 0 0 0 0 0 0 0 0 0 IO-APIC 9-fasteoi acpi > 16: 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC 16-fasteoi i801_smbus okay, so it is not increasing at runtime. > > You could also do "echo t > /proc/sysrq-trigger" which gives you a lot > > of task state information, but at the end you will also see "howing busy > > workqueues and worker pools:" regarding the workqueue state. I am > > curious to see if you already have worker stuck in kacpid_notify with > > acpi_os_execute_deferred. > > What am I supposed to do after "echo t > /proc/sysrq-trigger"? Both > before and after doing that I get an error trying to open the file: > > root [ ~ ]# cat /proc/sysrq-trigger > cat: /proc/sysrq-trigger: Input/output error echo "t > /proc/sysrq-trigger" not cat. > > Now that we know that know that acpi_os_execute_deferred() is stuck, > > lets shed some light in what it is trying to do. The patch at the end > > will dump this information into the console buffer (The `dmesg' command > > will print the whole kernel buffer). I have no idea if this starts > > printing while the system is running or during shutdown. I would expect > > to see the Start line in acpi_os_execute_deferred() but not the End one. > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > index 41168c027a5a4..0e983c558bcb5 100644 > > --- a/drivers/acpi/osl.c > > +++ b/drivers/acpi/osl.c > > @@ -840,7 +840,9 @@ static void acpi_os_execute_deferred(struct work_struct *work) > > { > > struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); > > > > + pr_err("%s(%d) Start %px %pF(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); > > dpc->function(dpc->context); > > + pr_err("%s(%d) End %px %pF(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); > > kfree(dpc); > > } argh, this should have been %pS not %pF. Sorry for that. > > @@ -1096,6 +1098,8 @@ acpi_status acpi_os_execute(acpi_execute_type type, > > */ > > if (type == OSL_NOTIFY_HANDLER) { > > queue = kacpi_notify_wq; > > + pr_err("%s(%d) Adding %pS %px\n", __func__, __LINE__, function, > > + context); > > INIT_WORK(&dpc->work, acpi_os_execute_deferred); > > } else if (type == OSL_GPE_HANDLER) { > > queue = kacpid_wq; > > I applied this patch to 5.6.4 and recompiled, and on the next boot with > that kernel, the kernel buffer (and kernel and system logs) began to get > flooded with these messages: > > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.943987] acpi_os_execute_deferred(843) Start ffff8fb82c7b6500 000000003edf1e05(ffff8fb82c492990) > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944102] acpi_os_execute(1101) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8fb82c492990 > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944104] acpi_os_execute_deferred(845) End ffff8fb82c7b6500 000000003edf1e05(ffff8fb82c492990) > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944105] acpi_os_execute_deferred(843) Start ffff8fb82b844800 000000002ba560ea(ffff8fb82c492990) > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944124] acpi_os_execute_deferred(845) End ffff8fb82b844800 000000002ba560ea(ffff8fb82c492990) > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944288] acpi_os_execute_deferred(843) Start ffff8fb82c7b6540 000000003edf1e05(ffff8fb82c492990) > Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944387] acpi_os_execute(1101) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8fb82c492990 > > and so on without stopping. I could start X and avoid seeing the > messages, but was afraid the logs would fill up the root partition if I > let it keep going, so I rebooted with another kernel. > > Was this message flood because I booted with "ignore_loglevel > initcall_debug"? In the logs there are also lots of messages like this: Is there a acpi_os_execute_* flood? The *few* at what appears to system startup and might be normal. If there appear *many* more and are constantly printing (check with dmesg) then we might be to something. > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237628] initcall i915_init+0x0/0x66 returned 0 after 288657 usecs > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237644] calling cn_proc_init+0x0/0x35 @ 1 > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237646] initcall cn_proc_init+0x0/0x35 returned 0 after 0 usecs > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237648] calling _nvm_misc_init+0x0/0xc @ 1 > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237687] initcall _nvm_misc_init+0x0/0xc returned 0 after 36 usecs > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237690] calling topology_sysfs_init+0x0/0x30 @ 1 > Jun 14 10:37:13 strobe-jhalfs kernel: [ 6.237720] initcall topology_sysfs_init+0x0/0x30 returned 0 after 28 usecs > > and so on. You have initcalls here which is due to "initcall_debug". The i915* message means that the i915 module was loaded. That "initcall_debug" prints you starts/stops of modules that are loaded (built-in or loaded modules) and shutdown callbacks which are invoked at system shutdown. The "ignore_loglevel" shows prints all messages to the console ignoring the current loglevel. Otherwise it would skip messages with a "minor" loglevel. With this we were able to see the shutdown callbacks (of the ahci module for instance). I attached the updated acpi patch. It limits the prints to the kacpi_notify_wq queue which appears to stuck at shutdown. The interesting part is to see if there is a acpi_os_execute() adding a specific event multiple times which does not complete. Maybe at runtime, maybe at shutdown time. If that is the case then ignoring this specific event might fix the shutdown problem. With all this information so far, I don't see a relation with this problem and the commit… > Steve Berman Sebastian [-- Attachment #2: acpi_dbg.patch --] [-- Type: text/x-diff, Size: 1216 bytes --] diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 762c5d50b8fe..a08789fb330e 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -844,6 +844,16 @@ static void acpi_os_execute_deferred(struct work_struct *work) kfree(dpc); } +static void acpi_os_execute_deferred_notify(struct work_struct *work) +{ + struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); + + pr_err("%s(%d) Start %px %pS(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); + dpc->function(dpc->context); + pr_err("%s(%d) End %px %pS(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); + kfree(dpc); +} + #ifdef CONFIG_ACPI_DEBUGGER static struct acpi_debugger acpi_debugger; static bool acpi_debugger_initialized; @@ -1096,7 +1106,9 @@ acpi_status acpi_os_execute(acpi_execute_type type, */ if (type == OSL_NOTIFY_HANDLER) { queue = kacpi_notify_wq; - INIT_WORK(&dpc->work, acpi_os_execute_deferred); + pr_err("%s(%d) Adding %pS %px <%px>\n", __func__, __LINE__, function, + context, dpc); + INIT_WORK(&dpc->work, acpi_os_execute_deferred_notify); } else if (type == OSL_GPE_HANDLER) { queue = kacpid_wq; INIT_WORK(&dpc->work, acpi_os_execute_deferred); [-- Attachment #3: acpi_dbg.patch --] [-- Type: text/x-diff, Size: 1216 bytes --] diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 762c5d50b8fe..a08789fb330e 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -844,6 +844,16 @@ static void acpi_os_execute_deferred(struct work_struct *work) kfree(dpc); } +static void acpi_os_execute_deferred_notify(struct work_struct *work) +{ + struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); + + pr_err("%s(%d) Start %px %pS(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); + dpc->function(dpc->context); + pr_err("%s(%d) End %px %pS(%px)\n", __func__, __LINE__, dpc, dpc->function, dpc->context); + kfree(dpc); +} + #ifdef CONFIG_ACPI_DEBUGGER static struct acpi_debugger acpi_debugger; static bool acpi_debugger_initialized; @@ -1096,7 +1106,9 @@ acpi_status acpi_os_execute(acpi_execute_type type, */ if (type == OSL_NOTIFY_HANDLER) { queue = kacpi_notify_wq; - INIT_WORK(&dpc->work, acpi_os_execute_deferred); + pr_err("%s(%d) Adding %pS %px <%px>\n", __func__, __LINE__, function, + context, dpc); + INIT_WORK(&dpc->work, acpi_os_execute_deferred_notify); } else if (type == OSL_GPE_HANDLER) { queue = kacpid_wq; INIT_WORK(&dpc->work, acpi_os_execute_deferred); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-14 17:10 ` Sebastian Andrzej Siewior @ 2020-06-15 7:58 ` Stephen Berman 2020-06-15 14:51 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-15 7:58 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Sun, 14 Jun 2020 19:10:05 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-14 14:12:18 [+0200], Stephen Berman wrote: [...] >> What am I supposed to do after "echo t > /proc/sysrq-trigger"? Both >> before and after doing that I get an error trying to open the file: >> >> root [ ~ ]# cat /proc/sysrq-trigger >> cat: /proc/sysrq-trigger: Input/output error > > echo "t > /proc/sysrq-trigger" > > not cat. Ok, sorry, I had misunderstood, but now I've looked at the documentation. I had in fact already done `echo t > /proc/sysrq-trigger' in an xterm (as root) and there was no output. Later, after booting kernel 5.1.0 because of the message flooding with 5.6.4, I did `echo t > /proc/sysrq-trigger' in a virtual tty (also as root) and the only output was: `[ <timestamp>] sysrq: Show State'. Is this expected? (In /proc/sys/kernel/sysrq there is `1'.) I couldn't try it in a 5.6.4 virtual tty because of the message flooding (see below). [...] >> I applied this patch to 5.6.4 and recompiled, and on the next boot with >> that kernel, the kernel buffer (and kernel and system logs) began to get >> flooded with these messages: >> >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.943987] >> acpi_os_execute_deferred(843) Start ffff8fb82c7b6500 >> 000000003edf1e05(ffff8fb82c492990) >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944102] acpi_os_execute(1101) >> Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8fb82c492990 >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944104] >> acpi_os_execute_deferred(845) End ffff8fb82c7b6500 >> 000000003edf1e05(ffff8fb82c492990) >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944105] >> acpi_os_execute_deferred(843) Start ffff8fb82b844800 >> 000000002ba560ea(ffff8fb82c492990) >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944124] >> acpi_os_execute_deferred(845) End ffff8fb82b844800 >> 000000002ba560ea(ffff8fb82c492990) >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944288] >> acpi_os_execute_deferred(843) Start ffff8fb82c7b6540 >> 000000003edf1e05(ffff8fb82c492990) >> Jun 14 10:37:13 strobe-jhalfs kernel: [ 5.944387] acpi_os_execute(1101) >> Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8fb82c492990 >> >> and so on without stopping. I could start X and avoid seeing the >> messages, but was afraid the logs would fill up the root partition if I >> let it keep going, so I rebooted with another kernel. >> >> Was this message flood because I booted with "ignore_loglevel >> initcall_debug"? In the logs there are also lots of messages like this: > > Is there a acpi_os_execute_* flood? The *few* at what appears to system > startup and might be normal. If there appear *many* more and are > constantly printing (check with dmesg) then we might be to something. The kernel log shows 305 of these messages in the 4 minutes and 17 seconds between the start of klogd and when I rebooted. [...] > I attached the updated acpi patch. It limits the prints to the > kacpi_notify_wq queue which appears to stuck at shutdown. I applied this patch and rebuilt the kernel. After booting 5.6.4 this morning there was again message flooding and this time I was unable to log in because of it, so had to do a hard reboot and booted with 5.1.0. The kernel log shows 207 acpi_os_execute_deferred_notify messages in 3 minutes and 6 seconds; here are the first: Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.943848] acpi_os_execute(1109) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8d7aaba18990 <ffff8d7aab157fc0> Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.943851] acpi_os_execute_deferred_notify(851) Start ffff8d7aab157fc0 acpi_ev_asynch_enable_gpe+0x0/0x2f(ffff8d7aaba18990) Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.943865] acpi_os_execute_deferred_notify(853) End ffff8d7aab157fc0 acpi_ev_asynch_enable_gpe+0x0/0x2f(ffff8d7aaba18990) Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.944104] acpi_os_execute(1109) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8d7aaba18990 <ffff8d7aab157fc0> Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.944106] acpi_os_execute_deferred_notify(851) Start ffff8d7aab157fc0 acpi_ev_asynch_enable_gpe+0x0/0x2f(ffff8d7aaba18990) Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.944115] acpi_os_execute_deferred_notify(853) End ffff8d7aab157fc0 acpi_ev_asynch_enable_gpe+0x0/0x2f(ffff8d7aaba18990) Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.944406] acpi_os_execute(1109) Adding acpi_ev_asynch_enable_gpe+0x0/0x2f ffff8d7aaba18990 <ffff8d7aab157fc0> Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.944408] acpi_os_execute_deferred_notify(851) Start ffff8d7aab157fc0 acpi_ev_asynch_enable_gpe+0x0/0x2f(ffff8d7aaba18990) Jun 15 08:55:22 strobe-jhalfs kernel: [ 5.944417] acpi_os_execute_deferred_notify(853) End ffff8d7aab157fc0 acpi_ev_asynch_enable_gpe+0x0/0x2f(ffff8d7aaba18990) > The interesting part is to see if there is a acpi_os_execute() adding a > specific event multiple times which does not complete. Maybe at runtime, > maybe at shutdown time. If that is the case then ignoring this specific > event might fix the shutdown problem. With all this information so far, > I don't see a relation with this problem and the commit… In those 3 minutes and 8 seconds there were 5 "Adding acpi_ev_asynch_enable_gpe" messages at the beginning, then 185 "Adding acpi_ev_notify_dispatch" messages, which kept coming until I rebooted. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-15 7:58 ` Stephen Berman @ 2020-06-15 14:51 ` Sebastian Andrzej Siewior 2020-06-15 15:41 ` Stephen Berman 2020-06-16 7:14 ` Stephen Berman 0 siblings, 2 replies; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-15 14:51 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi [-- Attachment #1: Type: text/plain, Size: 2395 bytes --] On 2020-06-15 09:58:00 [+0200], Stephen Berman wrote: > Ok, sorry, I had misunderstood, but now I've looked at the > documentation. I had in fact already done `echo t > > /proc/sysrq-trigger' in an xterm (as root) and there was no output. > Later, after booting kernel 5.1.0 because of the message flooding with > 5.6.4, I did `echo t > /proc/sysrq-trigger' in a virtual tty (also as > root) and the only output was: `[ <timestamp>] sysrq: Show State'. Is > this expected? (In /proc/sys/kernel/sysrq there is `1'.) I couldn't > try it in a 5.6.4 virtual tty because of the message flooding (see > below). If you do this "t" then there should be a lot of output on your console. If you do this from an xterm then you can see the output after typing "dmesg". The output should appear also in your system log. > The kernel log shows 305 of these messages in the 4 minutes and 17 > seconds between the start of klogd and when I rebooted. Okay, this is a lot. > > The interesting part is to see if there is a acpi_os_execute() adding a > > specific event multiple times which does not complete. Maybe at runtime, > > maybe at shutdown time. If that is the case then ignoring this specific > > event might fix the shutdown problem. With all this information so far, > > I don't see a relation with this problem and the commit… > > In those 3 minutes and 8 seconds there were 5 "Adding > acpi_ev_asynch_enable_gpe" messages at the beginning, then 185 "Adding > acpi_ev_notify_dispatch" messages, which kept coming until I rebooted. 5 acpi_ev_asynch_enable_gpe() in three minutes since boot isn't much. 185 of "Adding acpi_ev_notify_dispatch" is a lot. But to be clear, only acpi_os_execute(1109) Adding acpi_ev_notify_dispatch" messages? No "acpi_os_execute_deferred_notify() Start/End acpi_ev_notify_dispatch"? The syslog should have captured the log on disk. I attached a modified acpi_dbg.patch. Please enable: - CONFIG_ACPI_DEBUG=y Looking at your 5.1 you have tracing enabled (hope it still is). The attached patch will dump the date into the tracing buffer, so you console should remain "clean". Once it records 300 of those "requests for acpi_ev_notify_dispatch" it will stop recording. After 4-5 minutes please do cat /sys/kernel/debug/tracing/trace > file.txt compress it and send it. > Steve Berman Sebastian [-- Attachment #2: acpi_dbg.patch --] [-- Type: text/x-diff, Size: 3543 bytes --] diff --git a/drivers/acpi/acpica/evgpe.c b/drivers/acpi/acpica/evgpe.c index 3e39907fedd9..816e23d103c6 100644 --- a/drivers/acpi/acpica/evgpe.c +++ b/drivers/acpi/acpica/evgpe.c @@ -477,6 +477,7 @@ static void ACPI_SYSTEM_XFACE acpi_ev_asynch_execute_gpe_method(void *context) * notifies on multiple device objects. */ notify = gpe_event_info->dispatch.notify_list; + trace_printk("notify %px\n", gpe_event_info); while (ACPI_SUCCESS(status) && notify) { status = acpi_ev_queue_notify_request(notify->device_node, @@ -484,6 +485,7 @@ static void ACPI_SYSTEM_XFACE acpi_ev_asynch_execute_gpe_method(void *context) notify = notify->next; } + trace_printk("Event notify %px done %x\n", gpe_event_info, status); break; @@ -491,6 +493,9 @@ static void ACPI_SYSTEM_XFACE acpi_ev_asynch_execute_gpe_method(void *context) /* Allocate the evaluation information block */ + trace_printk("Evaluate %4.4s\n", + acpi_ut_get_node_name(gpe_event_info->dispatch.method_node)); + info = ACPI_ALLOCATE_ZEROED(sizeof(struct acpi_evaluate_info)); if (!info) { status = AE_NO_MEMORY; diff --git a/drivers/acpi/acpica/evmisc.c b/drivers/acpi/acpica/evmisc.c index ce1eda6beb84..735057419a8e 100644 --- a/drivers/acpi/acpica/evmisc.c +++ b/drivers/acpi/acpica/evmisc.c @@ -72,6 +72,7 @@ acpi_ev_queue_notify_request(struct acpi_namespace_node *node, u32 notify_value) union acpi_generic_state *info; u8 handler_list_id = 0; acpi_status status = AE_OK; + static int num_req = 0; ACPI_FUNCTION_NAME(ev_queue_notify_request); @@ -129,19 +130,19 @@ acpi_ev_queue_notify_request(struct acpi_namespace_node *node, u32 notify_value) info->notify.handler_list_head = handler_list_head; info->notify.global = &acpi_gbl_global_notify[handler_list_id]; - ACPI_DEBUG_PRINT((ACPI_DB_INFO, - "Dispatching Notify on [%4.4s] (%s) Value 0x%2.2X (%s) Node %p\n", + trace_printk("Dispatching Notify on [%4.4s] (%s) Value 0x%2.2X (%s) Node %px\n", acpi_ut_get_node_name(node), acpi_ut_get_type_name(node->type), notify_value, acpi_ut_get_notify_name(notify_value, ACPI_TYPE_ANY), - node)); + node); status = acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_ev_notify_dispatch, info); if (ACPI_FAILURE(status)) { acpi_ut_delete_generic_state(info); } - + if (num_req++ == 300) + tracing_off(); return (status); } diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 762c5d50b8fe..67d5ac9aa433 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -844,6 +844,16 @@ static void acpi_os_execute_deferred(struct work_struct *work) kfree(dpc); } +static void acpi_os_execute_deferred_notify(struct work_struct *work) +{ + struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); + + trace_printk("Start %px %pS(%px)\n", dpc, dpc->function, dpc->context); + dpc->function(dpc->context); + trace_printk("End %px %pS(%px)\n", dpc, dpc->function, dpc->context); + kfree(dpc); +} + #ifdef CONFIG_ACPI_DEBUGGER static struct acpi_debugger acpi_debugger; static bool acpi_debugger_initialized; @@ -1096,7 +1106,8 @@ acpi_status acpi_os_execute(acpi_execute_type type, */ if (type == OSL_NOTIFY_HANDLER) { queue = kacpi_notify_wq; - INIT_WORK(&dpc->work, acpi_os_execute_deferred); + trace_printk("Adding %pS %px <%px>\n", function, context, dpc); + INIT_WORK(&dpc->work, acpi_os_execute_deferred_notify); } else if (type == OSL_GPE_HANDLER) { queue = kacpid_wq; INIT_WORK(&dpc->work, acpi_os_execute_deferred); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-15 14:51 ` Sebastian Andrzej Siewior @ 2020-06-15 15:41 ` Stephen Berman 2020-06-15 15:58 ` Sebastian Andrzej Siewior 2020-06-16 7:14 ` Stephen Berman 1 sibling, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-15 15:41 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Mon, 15 Jun 2020 16:51:30 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-15 09:58:00 [+0200], Stephen Berman wrote: >> Ok, sorry, I had misunderstood, but now I've looked at the >> documentation. I had in fact already done `echo t > >> /proc/sysrq-trigger' in an xterm (as root) and there was no output. >> Later, after booting kernel 5.1.0 because of the message flooding with >> 5.6.4, I did `echo t > /proc/sysrq-trigger' in a virtual tty (also as >> root) and the only output was: `[ <timestamp>] sysrq: Show State'. Is >> this expected? (In /proc/sys/kernel/sysrq there is `1'.) I couldn't >> try it in a 5.6.4 virtual tty because of the message flooding (see >> below). > > If you do this "t" then there should be a lot of output on your console. > If you do this from an xterm then you can see the output after typing > "dmesg". The output should appear also in your system log. Ah, ok, I do see it in the log, it looks basically the same as the call trace I posted upthread. I wonder why there was no such output in the console, could there be some setting that suppresses it even though /proc/sys/kernel/sysrq has `1'? >> The kernel log shows 305 of these messages in the 4 minutes and 17 >> seconds between the start of klogd and when I rebooted. > > Okay, this is a lot. > >> > The interesting part is to see if there is a acpi_os_execute() adding a >> > specific event multiple times which does not complete. Maybe at runtime, >> > maybe at shutdown time. If that is the case then ignoring this specific >> > event might fix the shutdown problem. With all this information so far, >> > I don't see a relation with this problem and the commit… >> >> In those 3 minutes and 8 seconds there were 5 "Adding >> acpi_ev_asynch_enable_gpe" messages at the beginning, then 185 "Adding >> acpi_ev_notify_dispatch" messages, which kept coming until I rebooted. > > 5 acpi_ev_asynch_enable_gpe() in three minutes since boot isn't much. > 185 of "Adding acpi_ev_notify_dispatch" is a lot. > > But to be clear, only > acpi_os_execute(1109) Adding acpi_ev_notify_dispatch" messages? > No > "acpi_os_execute_deferred_notify() Start/End acpi_ev_notify_dispatch"? > > The syslog should have captured the log on disk. There were in fact 99 "Start .* acpi_ev_notify_dispatch" messages and 98 "End .* acpi_ev_notify_dispatch" messages. Here's the last of them before I rebooted: Jun 15 08:58:25 strobe-jhalfs kernel: [ 193.315014] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a66e0 <ffff8d7aabb38480> Jun 15 08:58:25 strobe-jhalfs kernel: [ 193.319929] acpi_os_execute_deferred_notify(853) End ffff8d7aa8758d00 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7eb0) Jun 15 08:58:25 strobe-jhalfs kernel: [ 193.321242] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758a80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e70a0) Jun 15 08:58:26 strobe-jhalfs kernel: [ 194.339017] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a6730 <ffff8d7aabb384c0> Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.363005] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a6780 <ffff8d7aabb38500> Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.367978] acpi_os_execute_deferred_notify(853) End ffff8d7aa8758a80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e70a0) Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.369234] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758240 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7a00) Jun 15 08:58:28 strobe-jhalfs kernel: [ 196.387017] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a68c0 <ffff8d7aabb38540> Jun 15 08:58:28 strobe-jhalfs kernel: [ 196.388508] acpi_os_execute_deferred_notify(853) End ffff8d7aa8758240 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7a00) Jun 15 08:58:28 strobe-jhalfs kernel: [ 196.390128] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758b80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7460) Jun 15 08:58:29 strobe-jhalfs kernel: [ 197.208714] wlan1: deauthenticating from 7c:ff:4d:08:df:22 by local choice (Reason: 3=DEAUTH_LEAVING) Jun 15 08:58:29 strobe-jhalfs kernel: [ 197.232214] ip (4614) used greatest stack depth: 11272 bytes left Jun 15 08:58:29 strobe-jhalfs kernel: Kernel logging (proc) stopped. Jun 15 08:58:29 strobe-jhalfs kernel: Kernel log daemon terminating. > I attached a modified acpi_dbg.patch. Please enable: > - CONFIG_ACPI_DEBUG=y > > Looking at your 5.1 you have tracing enabled (hope it still is). > > The attached patch will dump the date into the tracing buffer, so you > console should remain "clean". Once it records 300 of those "requests > for acpi_ev_notify_dispatch" it will stop recording. > After 4-5 minutes please do > cat /sys/kernel/debug/tracing/trace > file.txt > > compress it and send it. Will do as soon as I can, hopefully later today or tomorrow. Thanks. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-15 15:41 ` Stephen Berman @ 2020-06-15 15:58 ` Sebastian Andrzej Siewior 2020-06-15 16:19 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-15 15:58 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On 2020-06-15 17:41:06 [+0200], Stephen Berman wrote: > > If you do this "t" then there should be a lot of output on your console. > > If you do this from an xterm then you can see the output after typing > > "dmesg". The output should appear also in your system log. > > Ah, ok, I do see it in the log, it looks basically the same as the call > trace I posted upthread. I wonder why there was no such output in the > console, could there be some setting that suppresses it even though > /proc/sys/kernel/sysrq has `1'? You have ignore_loglevel soo everything should appear on your console. The is true for your tty / ctrl+alt+f1 but your xterm shell. > There were in fact 99 "Start .* acpi_ev_notify_dispatch" messages and 98 > "End .* acpi_ev_notify_dispatch" messages. Here's the last of them > before I rebooted: > > Jun 15 08:58:25 strobe-jhalfs kernel: [ 193.315014] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a66e0 <ffff8d7aabb38480> > Jun 15 08:58:25 strobe-jhalfs kernel: [ 193.319929] acpi_os_execute_deferred_notify(853) End ffff8d7aa8758d00 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7eb0) > Jun 15 08:58:25 strobe-jhalfs kernel: [ 193.321242] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758a80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e70a0) > Jun 15 08:58:26 strobe-jhalfs kernel: [ 194.339017] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a6730 <ffff8d7aabb384c0> > Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.363005] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a6780 <ffff8d7aabb38500> > Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.367978] acpi_os_execute_deferred_notify(853) End ffff8d7aa8758a80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e70a0) > Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.369234] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758240 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7a00) > Jun 15 08:58:28 strobe-jhalfs kernel: [ 196.387017] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a68c0 <ffff8d7aabb38540> > Jun 15 08:58:28 strobe-jhalfs kernel: [ 196.388508] acpi_os_execute_deferred_notify(853) End ffff8d7aa8758240 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7a00) > Jun 15 08:58:28 strobe-jhalfs kernel: [ 196.390128] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758b80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e7460) > Jun 15 08:58:29 strobe-jhalfs kernel: [ 197.208714] wlan1: deauthenticating from 7c:ff:4d:08:df:22 by local choice (Reason: 3=DEAUTH_LEAVING) > Jun 15 08:58:29 strobe-jhalfs kernel: [ 197.232214] ip (4614) used greatest stack depth: 11272 bytes left > Jun 15 08:58:29 strobe-jhalfs kernel: Kernel logging (proc) stopped. > Jun 15 08:58:29 strobe-jhalfs kernel: Kernel log daemon terminating. Okay. So there is "Adding" of four events/work items in total, each almost every second. Processing of one work-item took two seconds, the other one took a second. These events are "old" so I don't see the "adding" line for the Start/End. You could look in the log for ffff8d7aa8758a80 to seen when it was enqueued but it feels like ACPI enqueues more events then it is capable of processing. > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-15 15:58 ` Sebastian Andrzej Siewior @ 2020-06-15 16:19 ` Stephen Berman 2020-06-15 16:32 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-15 16:19 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Mon, 15 Jun 2020 17:58:46 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-15 17:41:06 [+0200], Stephen Berman wrote: >> > If you do this "t" then there should be a lot of output on your console. >> > If you do this from an xterm then you can see the output after typing >> > "dmesg". The output should appear also in your system log. >> >> Ah, ok, I do see it in the log, it looks basically the same as the call >> trace I posted upthread. I wonder why there was no such output in the >> console, could there be some setting that suppresses it even though >> /proc/sys/kernel/sysrq has `1'? > > You have ignore_loglevel soo everything should appear on your console. > The is true for your tty / ctrl+alt+f1 but your xterm shell. Yet that appears not to be the case here, or I'm still doing something wrong 🤷 >> There were in fact 99 "Start .* acpi_ev_notify_dispatch" messages and 98 >> "End .* acpi_ev_notify_dispatch" messages. Here's the last of them >> before I rebooted: [...] >> acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758a80 >> acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e70a0) >> Jun 15 08:58:26 strobe-jhalfs kernel: [ 194.339017] acpi_os_execute(1109) >> Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a6730 <ffff8d7aabb384c0> >> Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.363005] acpi_os_execute(1109) >> Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa99a6780 <ffff8d7aabb38500> >> Jun 15 08:58:27 strobe-jhalfs kernel: [ 195.367978] >> acpi_os_execute_deferred_notify(853) End ffff8d7aa8758a80 [...] > Okay. So there is "Adding" of four events/work items in total, each > almost every second. > Processing of one work-item took two seconds, the other one took a > second. These events are "old" so I don't see the "adding" line for the > Start/End. > You could look in the log for ffff8d7aa8758a80 to seen when it was > enqueued but it feels like ACPI enqueues more events then it is capable > of processing. It's there, 196 lines above the Start ffff8d7aa8758a80 line: Jun 15 08:56:58 strobe-jhalfs kernel: [ 106.275356] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa84e70a0 <ffff8d7aa8758a80> Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-15 16:19 ` Stephen Berman @ 2020-06-15 16:32 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-15 16:32 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On 2020-06-15 18:19:18 [+0200], Stephen Berman wrote: > It's there, 196 lines above the Start ffff8d7aa8758a80 line: > > Jun 15 08:56:58 strobe-jhalfs kernel: [ 106.275356] acpi_os_execute(1109) Adding acpi_ev_notify_dispatch+0x0/0x55 ffff8d7aa84e70a0 <ffff8d7aa8758a80> So compared with | [ 193.321242] acpi_os_execute_deferred_notify(851) Start ffff8d7aa8758a80 acpi_ev_notify_dispatch+0x0/0x55(ffff8d7aa84e70a0) it took ~87 seconds for the item to start since it was enqueued. Assuming that workqueue still makes progress on shutdown it may just a long time to flush all worker on shutdown. I still have no idea how the commit in question is responsible for that behaviour… > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-15 14:51 ` Sebastian Andrzej Siewior 2020-06-15 15:41 ` Stephen Berman @ 2020-06-16 7:14 ` Stephen Berman 2020-06-16 7:38 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-16 7:14 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Mon, 15 Jun 2020 16:51:30 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > I attached a modified acpi_dbg.patch. Please enable: > - CONFIG_ACPI_DEBUG=y > > Looking at your 5.1 you have tracing enabled (hope it still is). > > The attached patch will dump the date into the tracing buffer, so you > console should remain "clean". Once it records 300 of those "requests > for acpi_ev_notify_dispatch" it will stop recording. > After 4-5 minutes please do > cat /sys/kernel/debug/tracing/trace > file.txt > > compress it and send it. I set CONFIG_ACPI_DEBUG=y, applied the patch and rebuilt 5.6.4, but: $ cat /sys/kernel/debug/tracing/trace > trace.txt cat: /sys/kernel/debug/tracing/trace: No such file or directory Here are all the 5.6.4 config options with "TRACING" or "TRACE"; do I need to set any that aren't set? CONFIG_SYSCTL_EXCEPTION_TRACE=y CONFIG_TRACEPOINTS=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_PM_TRACE=y CONFIG_PM_TRACE_RTC=y CONFIG_HAVE_KPROBES_ON_FTRACE=y CONFIG_HAVE_ARCH_TRACEHOOK=y CONFIG_HAVE_RELIABLE_STACKTRACE=y # CONFIG_MAC80211_MESSAGE_TRACING is not set # CONFIG_DMA_FENCE_TRACE is not set CONFIG_IWLWIFI_DEVICE_TRACING=y # CONFIG_TRACE_SINK is not set # HW tracing support # end of HW tracing support CONFIG_STACKTRACE=y CONFIG_RCU_TRACE=y CONFIG_USER_STACKTRACE_SUPPORT=y CONFIG_NOP_TRACER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y CONFIG_HAVE_SYSCALL_TRACEPOINTS=y CONFIG_TRACE_CLOCK=y CONFIG_EVENT_TRACING=y CONFIG_CONTEXT_SWITCH_TRACER=y CONFIG_TRACING=y CONFIG_GENERIC_TRACER=y CONFIG_TRACING_SUPPORT=y CONFIG_FTRACE=y # CONFIG_BOOTTIME_TRACING is not set # CONFIG_FUNCTION_TRACER is not set # CONFIG_STACK_TRACER is not set # CONFIG_IRQSOFF_TRACER is not set # CONFIG_SCHED_TRACER is not set # CONFIG_HWLAT_TRACER is not set # CONFIG_MMIOTRACE is not set # CONFIG_FTRACE_SYSCALLS is not set # CONFIG_TRACER_SNAPSHOT is not set CONFIG_BLK_DEV_IO_TRACE=y # CONFIG_TRACE_EVENT_INJECT is not set # CONFIG_TRACEPOINT_BENCHMARK is not set # CONFIG_TRACE_EVAL_MAP_FILE is not set # CONFIG_FTRACE_STARTUP_TEST is not set CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_HAVE_MMIOTRACE_SUPPORT=y CONFIG_CC_HAS_SANCOV_TRACE_PC=y # CONFIG_BACKTRACE_SELF_TEST is not set Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-16 7:14 ` Stephen Berman @ 2020-06-16 7:38 ` Sebastian Andrzej Siewior 2020-06-16 8:13 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-16 7:38 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On 2020-06-16 09:14:37 [+0200], Stephen Berman wrote: > > I set CONFIG_ACPI_DEBUG=y, applied the patch and rebuilt 5.6.4, but: > > $ cat /sys/kernel/debug/tracing/trace > trace.txt > cat: /sys/kernel/debug/tracing/trace: No such file or directory > > Here are all the 5.6.4 config options with "TRACING" or "TRACE"; do I > need to set any that aren't set? I see that "Kernel hacking ---> Tracers" is enabled. You should have one tracer enabled for that trace_printk() to work. You have context switch tracer so it should be enough. You might need CONFIG_DEBUG_FS (but it is set in your 5.1 config) or it is just not mounted. So I have here: | debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime) | tracefs on /sys/kernel/debug/tracing type tracefs (rw,nosuid,nodev,noexec,relatime) Does it work for you if you mount it? > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-16 7:38 ` Sebastian Andrzej Siewior @ 2020-06-16 8:13 ` Stephen Berman 2020-06-16 15:55 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-16 8:13 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi [-- Attachment #1: Type: text/plain, Size: 1296 bytes --] On Tue, 16 Jun 2020 09:38:27 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-16 09:14:37 [+0200], Stephen Berman wrote: >> >> I set CONFIG_ACPI_DEBUG=y, applied the patch and rebuilt 5.6.4, but: >> >> $ cat /sys/kernel/debug/tracing/trace > trace.txt >> cat: /sys/kernel/debug/tracing/trace: No such file or directory >> >> Here are all the 5.6.4 config options with "TRACING" or "TRACE"; do I >> need to set any that aren't set? > > I see that "Kernel hacking ---> Tracers" is enabled. You should have > one tracer enabled for that trace_printk() to work. You have context > switch tracer so it should be enough. > > You might need CONFIG_DEBUG_FS (but it is set in your 5.1 config) or it > is just not mounted. So I have here: > > | debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime) > | tracefs on /sys/kernel/debug/tracing type tracefs (rw,nosuid,nodev,noexec,relatime) > > Does it work for you if you mount it? # mount -t debugfs debugfs /sys/kernel/debug # mount -t tracefs tracefs /sys/kernel/debug/tracing # mount ... debugfs on /sys/kernel/debug type debugfs (rw,relatime) tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime) Yes, thanks, that did it. Trace attached. Steve Berman [-- Attachment #2: trace.txt.gz --] [-- Type: application/octet-stream, Size: 10705 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-16 8:13 ` Stephen Berman @ 2020-06-16 15:55 ` Sebastian Andrzej Siewior 2020-06-16 20:28 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-16 15:55 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On 2020-06-16 10:13:27 [+0200], Stephen Berman wrote: > Yes, thanks, that did it. Trace attached. So TZ10 is a temperature sensor of some kind on your motherboard. In your v5.6 dmesg there is: | thermal LNXTHERM:00: registered as thermal_zone0 | ACPI: Thermal Zone [TZ10] (17 C) So. In /sys/class/thermal/thermal_zone0/device/path you should also see TZ10. And /sys/class/thermal/thermal_zone0/temp should show the actual value. This comes from the "thermal" module. Looking at the trace, might query the temperature every second which somehow results in "Dispatching Notify on". I don't understand how it gets from reading of the temperature to the notify part, maybe it is part of the ACPI… However. Could you please make sure that the thermal module is not loaded at system startup? Adding thermal.off=1 to the kernel commandline should do the trick. And you should see thermal control disabled in dmesg. That means your thermal_zone0 with TZ10 does not show up in /sys and nothing should schedule the work-items. This in turn should allow you to shutdown your system without the delay. If this works, could you please try to load the module with tzp=300? If you add this thermal.tzp=300 to the kernel commandline then it should do the trick. You can verify it by cat /sys/module/thermal/parameters/tzp This should change the polling interval from what ACPI says to 30secs. This should ensure that you don't have so many worker waiting. So you should also be able to shutdown the system. > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-16 15:55 ` Sebastian Andrzej Siewior @ 2020-06-16 20:28 ` Stephen Berman 2020-06-17 14:27 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-16 20:28 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Tue, 16 Jun 2020 17:55:01 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-16 10:13:27 [+0200], Stephen Berman wrote: >> Yes, thanks, that did it. Trace attached. > > So TZ10 is a temperature sensor of some kind on your motherboard. In > your v5.6 dmesg there is: > | thermal LNXTHERM:00: registered as thermal_zone0 > | ACPI: Thermal Zone [TZ10] (17 C) > > So. In /sys/class/thermal/thermal_zone0/device/path you should also see > TZ10. And /sys/class/thermal/thermal_zone0/temp should show the actual > value. > This comes from the "thermal" module. Yes, TZ10 was in the thermal_zone0/device/path and the value in thermal_zone0/temp was 16800. > Looking at the trace, might query the temperature every second which > somehow results in "Dispatching Notify on". I don't understand how it > gets from reading of the temperature to the notify part, maybe it is > part of the ACPI… > > However. Could you please make sure that the thermal module is not > loaded at system startup? Adding > thermal.off=1 > > to the kernel commandline should do the trick. And you should see > thermal control disabled > > in dmesg. Confirmed. And the value in thermal_zone0/temp was now 33000. > That means your thermal_zone0 with TZ10 does not show up in > /sys and nothing should schedule the work-items. This in turn should > allow you to shutdown your system without the delay. It did! > If this works, could you please try to load the module with tzp=300? > If you add this > thermal.tzp=300 > > to the kernel commandline then it should do the trick. You can verify it > by > cat /sys/module/thermal/parameters/tzp > > This should change the polling interval from what ACPI says to 30secs. > This should ensure that you don't have so many worker waiting. So you > should also be able to shutdown the system. Your assessment and predictions are right on the mark! I'm fine with the thermal.tzp=300 workaround, but it would be good to find out why this problem started with commit 6d25be57, if my git bisection was correct, or if it wasn't, then at least somewhere between 5.1.0 and 5.2.0. Or can you already deduce why? If not, I'd be more than happy to continue applying any patches or trying any suggestions you have, if you want to continue debugging this issue. In any case, thanks for pursuing it to this point. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-16 20:28 ` Stephen Berman @ 2020-06-17 14:27 ` Sebastian Andrzej Siewior 2020-06-17 21:09 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-17 14:27 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On 2020-06-16 22:28:43 [+0200], Stephen Berman wrote: > Your assessment and predictions are right on the mark! perfect. > I'm fine with the thermal.tzp=300 workaround, but it would be good to > find out why this problem started with commit 6d25be57, if my git > bisection was correct, or if it wasn't, then at least somewhere between > 5.1.0 and 5.2.0. Or can you already deduce why? If not, I'd be more > than happy to continue applying any patches or trying any suggestions > you have, if you want to continue debugging this issue. In any case, > thanks for pursuing it to this point. I have no idea why the commit in question should make any difference. Could please apply the tracing patch on v5.1 and send the trace? > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-17 14:27 ` Sebastian Andrzej Siewior @ 2020-06-17 21:09 ` Stephen Berman 2020-06-24 20:11 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-17 21:09 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi [-- Attachment #1: Type: text/plain, Size: 864 bytes --] On Wed, 17 Jun 2020 16:27:34 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-16 22:28:43 [+0200], Stephen Berman wrote: >> Your assessment and predictions are right on the mark! > perfect. > >> I'm fine with the thermal.tzp=300 workaround, but it would be good to >> find out why this problem started with commit 6d25be57, if my git >> bisection was correct, or if it wasn't, then at least somewhere between >> 5.1.0 and 5.2.0. Or can you already deduce why? If not, I'd be more >> than happy to continue applying any patches or trying any suggestions >> you have, if you want to continue debugging this issue. In any case, >> thanks for pursuing it to this point. > > I have no idea why the commit in question should make any difference. > Could please apply the tracing patch on v5.1 and send the trace? Attached. Steve Berman [-- Attachment #2: trace-5.1.0.txt.gz --] [-- Type: application/octet-stream, Size: 10005 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-17 21:09 ` Stephen Berman @ 2020-06-24 20:11 ` Sebastian Andrzej Siewior 2020-06-24 21:49 ` Stephen Berman 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-06-24 20:11 UTC (permalink / raw) To: Stephen Berman; +Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On 2020-06-17 23:09:44 [+0200], Stephen Berman wrote: > > Attached. I did not forget about this but had recently little time to look into this. > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-24 20:11 ` Sebastian Andrzej Siewior @ 2020-06-24 21:49 ` Stephen Berman 2020-07-14 13:44 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-06-24 21:49 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi On Wed, 24 Jun 2020 22:11:56 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-06-17 23:09:44 [+0200], Stephen Berman wrote: >> >> Attached. > > I did not forget about this but had recently little time to look into > this. No problem! Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-06-24 21:49 ` Stephen Berman @ 2020-07-14 13:44 ` Sebastian Andrzej Siewior 2020-07-14 13:54 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-07-14 13:44 UTC (permalink / raw) To: Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-acpi, Stephen Berman, devel On 2020-06-24 23:49:52 [+0200], Stephen Berman wrote: Let me summarize the thread here: On Stephen's system, ACPI informs the thermal zone driver to poll the temperature every second and the driver does so. The driver queries the temperature by invoking acpi_evaluate_integer() which invokes (at some point) acpi_ev_queue_notify_request(). This then invokes acpi_os_execute_deferred() via queue_work_on(, kacpi_notify_wq, ) acpi_os_execute_deferred() invokes acpi_ev_notify_dispatch() and this is no longer synchronised with the initial acpi_evaluate_integer() request. Before commit 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock") that function took on average 1.023993 seconds to complete. The interval when the thermal driver invokes acpi_evaluate_integer() isn't exactly 1 second but almost (it is not an absolute timer so). Still it looks that one function slowly overtakes the other. After 5 Minutes uptime there is: | kworker/0:1-12 [000] 312.315565: acpi_ev_queue_notify_request: Dispatching Notify on [TZ10] (Thermal) Value 0x81 (Information Change) Node ffff9935ab20f7a8 | kworker/0:1-12 [000] 312.315567: acpi_os_execute: Adding acpi_ev_notify_dispatch+0x0/0x5a ffff9935a6c64050 <ffff9935a7eb5e00> Enqueue worker with job ffff9935a7eb5e00 | kworker/0:1-12 [000] 312.315596: acpi_os_execute_deferred_notify: End ffff9935a7eb5c80 acpi_ev_notify_dispatch+0x0/0x5a(ffff9935a6c64c80) | kworker/0:1-12 [000] 312.315607: acpi_os_execute_deferred_notify: Start ffff9935a7eb5d80 acpi_ev_notify_dispatch+0x0/0x5a(ffff9935a6c64dc0) previous worker completed, another (already enqueued) started. | kworker/0:1-12 [000] 313.339564: acpi_ev_queue_notify_request: Dispatching Notify on [TZ10] (Thermal) Value 0x81 (Information Change) Node ffff9935ab20f7a8 | kworker/0:1-12 [000] 313.339566: acpi_os_execute: Adding acpi_ev_notify_dispatch+0x0/0x5a ffff9935a6c64f00 <ffff9935a7eb5c80> another one enqueued. | kworker/0:1-12 [000] 313.339595: acpi_os_execute_deferred_notify: End ffff9935a7eb5d80 acpi_ev_notify_dispatch+0x0/0x5a(ffff9935a6c64dc0) | kworker/0:1-12 [000] 313.339597: acpi_os_execute_deferred_notify: Start ffff9935a7eb5e00 acpi_ev_notify_dispatch+0x0/0x5a(ffff9935a6c64050) finally, job ffff9935a7eb5e00 started (one second after enqueue). | kworker/0:1-12 [000] 314.363571: acpi_ev_queue_notify_request: Dispatching Notify on [TZ10] (Thermal) Value 0x81 (Information Change) Node ffff9935ab20f7a8 | kworker/0:1-12 [000] 314.363574: acpi_os_execute: Adding acpi_ev_notify_dispatch+0x0/0x5a ffff9935a6c646e0 <ffff9935a7eb5d80> | kworker/0:1-12 [000] 314.363608: acpi_os_execute_deferred_notify: End ffff9935a7eb5e00 acpi_ev_notify_dispatch+0x0/0x5a(ffff9935a6c64050) and ended, two seconds after enqueue. Before it ended, the system enqueued two more jobs. And this just within 5 Minutes of uptime. The worker pile slowly up. According to Stephen, after previously mentioned commit the situation gets worse. According to tracing the execution time of acpi_ev_notify_dispatch() varies between 1.023990 (like before) and 2.048005 seconds. While I don't have an explanation for the changed behaviour (or see anything wrong the commit in question), the pile up of worker increased. Stephen noticed this because system shutdown flushes kacpi_notify_wq and this takes quite some time (with that amount of worker pending). I tried a few test cases to reproduce that behaviour but failed. However, even before that commit in question the situation is far from perfect: - There is the lack of synchronisation between thermal driver's requests and its tail (that acpi_ev_notify_dispatch() part). - Do we have to respect ACPI interval of one seconds. Is the hardware really so accurate that it can change noticeable in one second. - The requests are already back to back which keeps the CPU busy (maybe sched-switch will reveal that the CPU is idle most of the time). So... Is there a simple way to synchronize the ACPI part? The obvious thing would be flush_workqueue(kacpi_notify_wq); or acpi_os_wait_events_complete() in thermal_get_temp(). Would it make sense to ensure that the polling interval is no less than 10 seconds? > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-14 13:44 ` Sebastian Andrzej Siewior @ 2020-07-14 13:54 ` Rafael J. Wysocki 2020-07-14 14:11 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Rafael J. Wysocki @ 2020-07-14 13:54 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, Stephen Berman, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Tue, Jul 14, 2020 at 3:44 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:> > On 2020-06-24 23:49:52 [+0200], Stephen Berman wrote: > > Let me summarize the thread here: > > On Stephen's system, ACPI informs the thermal zone driver to poll the > temperature every second and the driver does so. > The driver queries the temperature by invoking acpi_evaluate_integer() > which invokes (at some point) acpi_ev_queue_notify_request(). Well, I don't quite see how acpi_ev_queue_notify_request() can be invoked from the acpi_evaluate_integer() code path. Do you have a call trace showing that? > This then invokes acpi_os_execute_deferred() via > queue_work_on(, kacpi_notify_wq, ) > > acpi_os_execute_deferred() invokes acpi_ev_notify_dispatch() and this is > no longer synchronised with the initial acpi_evaluate_integer() request. That indeed would be the case, but I first need to understand what's going on with the acpi_evaluate_integer() causing a Notify () to be queued up. Cheers! ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-14 13:54 ` Rafael J. Wysocki @ 2020-07-14 14:11 ` Sebastian Andrzej Siewior 2020-07-14 15:53 ` Rafael J. Wysocki 2020-07-19 10:07 ` Stephen Berman 0 siblings, 2 replies; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-07-14 14:11 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, Stephen Berman, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) [-- Attachment #1: Type: text/plain, Size: 1292 bytes --] On 2020-07-14 15:54:57 [+0200], Rafael J. Wysocki wrote: > On Tue, Jul 14, 2020 at 3:44 PM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote:> > > On 2020-06-24 23:49:52 [+0200], Stephen Berman wrote: > > > > Let me summarize the thread here: > > > > On Stephen's system, ACPI informs the thermal zone driver to poll the > > temperature every second and the driver does so. > > The driver queries the temperature by invoking acpi_evaluate_integer() > > which invokes (at some point) acpi_ev_queue_notify_request(). > > Well, I don't quite see how acpi_ev_queue_notify_request() can be > invoked from the acpi_evaluate_integer() code path. > > Do you have a call trace showing that? So the trace in https://lore.kernel.org/linux-acpi/87o8pjh1i0.fsf@gmx.net/ shows the pattern and we nailed it down that it comes from thermal_get_temp(). I assumed acpi_ex_opcode_2A_0T_0R() since the other candidate was acpi_ev_asynch_execute_gpe_method(). Stephen, the patch attached adds a WARN_ON() statement which will produce a stack trace (4 or so). Could please run 'dmesg' after a while and send it back. There should be a "WARNING in drivers/acpi/acpica/evmisc.c" statement or something along the lines. Rafael, are you also interested in an ACPI dump? > Cheers! Sebastian [-- Attachment #2: acpi_dbg.patch --] [-- Type: text/x-diff, Size: 4074 bytes --] diff --git a/drivers/acpi/acpica/evgpe.c b/drivers/acpi/acpica/evgpe.c index 3e39907fedd9f..816e23d103c63 100644 --- a/drivers/acpi/acpica/evgpe.c +++ b/drivers/acpi/acpica/evgpe.c @@ -477,6 +477,7 @@ static void ACPI_SYSTEM_XFACE acpi_ev_asynch_execute_gpe_method(void *context) * notifies on multiple device objects. */ notify = gpe_event_info->dispatch.notify_list; + trace_printk("notify %px\n", gpe_event_info); while (ACPI_SUCCESS(status) && notify) { status = acpi_ev_queue_notify_request(notify->device_node, @@ -484,6 +485,7 @@ static void ACPI_SYSTEM_XFACE acpi_ev_asynch_execute_gpe_method(void *context) notify = notify->next; } + trace_printk("Event notify %px done %x\n", gpe_event_info, status); break; @@ -491,6 +493,9 @@ static void ACPI_SYSTEM_XFACE acpi_ev_asynch_execute_gpe_method(void *context) /* Allocate the evaluation information block */ + trace_printk("Evaluate %4.4s\n", + acpi_ut_get_node_name(gpe_event_info->dispatch.method_node)); + info = ACPI_ALLOCATE_ZEROED(sizeof(struct acpi_evaluate_info)); if (!info) { status = AE_NO_MEMORY; diff --git a/drivers/acpi/acpica/evmisc.c b/drivers/acpi/acpica/evmisc.c index ce1eda6beb845..d8e3f426729db 100644 --- a/drivers/acpi/acpica/evmisc.c +++ b/drivers/acpi/acpica/evmisc.c @@ -72,6 +72,7 @@ acpi_ev_queue_notify_request(struct acpi_namespace_node *node, u32 notify_value) union acpi_generic_state *info; u8 handler_list_id = 0; acpi_status status = AE_OK; + static int num_req = 0; ACPI_FUNCTION_NAME(ev_queue_notify_request); @@ -129,19 +130,20 @@ acpi_ev_queue_notify_request(struct acpi_namespace_node *node, u32 notify_value) info->notify.handler_list_head = handler_list_head; info->notify.global = &acpi_gbl_global_notify[handler_list_id]; - ACPI_DEBUG_PRINT((ACPI_DB_INFO, - "Dispatching Notify on [%4.4s] (%s) Value 0x%2.2X (%s) Node %p\n", + trace_printk("Dispatching Notify on [%4.4s] (%s) Value 0x%2.2X (%s) Node %px\n", acpi_ut_get_node_name(node), acpi_ut_get_type_name(node->type), notify_value, acpi_ut_get_notify_name(notify_value, ACPI_TYPE_ANY), - node)); + node); + WARN_ON(num_req > 10 && num_req < 15); status = acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_ev_notify_dispatch, info); if (ACPI_FAILURE(status)) { acpi_ut_delete_generic_state(info); } - + if (num_req++ == 300) + tracing_off(); return (status); } diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 762c5d50b8fe2..67d5ac9aa433d 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -844,6 +844,16 @@ static void acpi_os_execute_deferred(struct work_struct *work) kfree(dpc); } +static void acpi_os_execute_deferred_notify(struct work_struct *work) +{ + struct acpi_os_dpc *dpc = container_of(work, struct acpi_os_dpc, work); + + trace_printk("Start %px %pS(%px)\n", dpc, dpc->function, dpc->context); + dpc->function(dpc->context); + trace_printk("End %px %pS(%px)\n", dpc, dpc->function, dpc->context); + kfree(dpc); +} + #ifdef CONFIG_ACPI_DEBUGGER static struct acpi_debugger acpi_debugger; static bool acpi_debugger_initialized; @@ -1096,7 +1106,8 @@ acpi_status acpi_os_execute(acpi_execute_type type, */ if (type == OSL_NOTIFY_HANDLER) { queue = kacpi_notify_wq; - INIT_WORK(&dpc->work, acpi_os_execute_deferred); + trace_printk("Adding %pS %px <%px>\n", function, context, dpc); + INIT_WORK(&dpc->work, acpi_os_execute_deferred_notify); } else if (type == OSL_GPE_HANDLER) { queue = kacpid_wq; INIT_WORK(&dpc->work, acpi_os_execute_deferred); diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index 19067a5e52934..883eb37592639 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -191,7 +191,9 @@ static int acpi_thermal_get_temperature(struct acpi_thermal *tz) tz->last_temperature = tz->temperature; + printk("Query temp\n"); status = acpi_evaluate_integer(tz->device->handle, "_TMP", NULL, &tmp); + printk("Query temp done: %d\n", status); if (ACPI_FAILURE(status)) return -ENODEV; ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-14 14:11 ` Sebastian Andrzej Siewior @ 2020-07-14 15:53 ` Rafael J. Wysocki 2020-07-14 16:10 ` Sebastian Andrzej Siewior 2020-08-11 10:27 ` Sebastian Andrzej Siewior 2020-07-19 10:07 ` Stephen Berman 1 sibling, 2 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2020-07-14 15:53 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, Stephen Berman, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Tue, Jul 14, 2020 at 4:11 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2020-07-14 15:54:57 [+0200], Rafael J. Wysocki wrote: > > On Tue, Jul 14, 2020 at 3:44 PM Sebastian Andrzej Siewior > > <bigeasy@linutronix.de> wrote:> > > > On 2020-06-24 23:49:52 [+0200], Stephen Berman wrote: > > > > > > Let me summarize the thread here: > > > > > > On Stephen's system, ACPI informs the thermal zone driver to poll the > > > temperature every second and the driver does so. > > > The driver queries the temperature by invoking acpi_evaluate_integer() > > > which invokes (at some point) acpi_ev_queue_notify_request(). > > > > Well, I don't quite see how acpi_ev_queue_notify_request() can be > > invoked from the acpi_evaluate_integer() code path. > > > > Do you have a call trace showing that? > > So the trace in > https://lore.kernel.org/linux-acpi/87o8pjh1i0.fsf@gmx.net/ > > shows the pattern and we nailed it down that it comes from > thermal_get_temp(). acpi_evaluate_integer() doesn't show up in the trace, though, AFAICS. > I assumed acpi_ex_opcode_2A_0T_0R() since the other > candidate was acpi_ev_asynch_execute_gpe_method(). Which probably is the case. Specifically acpi_ev_asynch_execute_gpe_method: Evaluate _L66 is likely to cause the Notify() to be dispatched. > Stephen, the patch attached adds a WARN_ON() statement which will > produce a stack trace (4 or so). Could please run 'dmesg' after a while > and send it back. There should be a > "WARNING in drivers/acpi/acpica/evmisc.c" > > statement or something along the lines. > > Rafael, are you also interested in an ACPI dump? That might help a bit. So what probably happens is that poking at the TZ causes a GPE to trigger and a Notify() to be dispatched which then goes into the workqueue for execution. Now, I'm not sure what happens to those Notify() items, though. They each should cause a handler (in the thermal driver) to be executed, but does that happen? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-14 15:53 ` Rafael J. Wysocki @ 2020-07-14 16:10 ` Sebastian Andrzej Siewior 2020-08-11 10:27 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-07-14 16:10 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, Stephen Berman, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-07-14 17:53:15 [+0200], Rafael J. Wysocki wrote: > > shows the pattern and we nailed it down that it comes from > > thermal_get_temp(). > > acpi_evaluate_integer() doesn't show up in the trace, though, AFAICS. I deducted it. acpi_thermal_get_temperature() -> acpi_evaluate_integer() and the dmesg showed | ACPI: Invalid passive threshold | thermal LNXTHERM:00: registered as thermal_zone0 | ACPI: Thermal Zone [TZ10] (17 C) | thermal LNXTHERM:01: registered as thermal_zone1 | ACPI: Thermal Zone [TZ00] (28 C) and overriding thermal.tzp to a higher value made the issue go away. There is a complete dmesg https://lore.kernel.org/lkml/87v9l65d2y.fsf@gmx.net/ > > I assumed acpi_ex_opcode_2A_0T_0R() since the other > > candidate was acpi_ev_asynch_execute_gpe_method(). > > Which probably is the case. Specifically > > acpi_ev_asynch_execute_gpe_method: Evaluate _L66 > > is likely to cause the Notify() to be dispatched. Okay. I hope Stephen replies later with a backtrace. > > Stephen, the patch attached adds a WARN_ON() statement which will > > produce a stack trace (4 or so). Could please run 'dmesg' after a while > > and send it back. There should be a > > "WARNING in drivers/acpi/acpica/evmisc.c" > > > > statement or something along the lines. > > > > Rafael, are you also interested in an ACPI dump? > > That might help a bit. > > So what probably happens is that poking at the TZ causes a GPE to > trigger and a Notify() to be dispatched which then goes into the > workqueue for execution. > > Now, I'm not sure what happens to those Notify() items, though. They > each should cause a handler (in the thermal driver) to be executed, > but does that happen? Are you asking for a trace point in acpi_ev_notify_dispatch() to see what it actually calls as in info->notify.global->handler / handler_obj->notify.handler ? There is no handler in the thermal driver or do I just not recognize it? Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-14 15:53 ` Rafael J. Wysocki 2020-07-14 16:10 ` Sebastian Andrzej Siewior @ 2020-08-11 10:27 ` Sebastian Andrzej Siewior 2020-08-11 14:02 ` Rafael J. Wysocki 1 sibling, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-08-11 10:27 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, Stephen Berman, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-07-14 17:53:15 [+0200], Rafael J. Wysocki wrote: > acpi_evaluate_integer() doesn't show up in the trace, though, AFAICS. > > > I assumed acpi_ex_opcode_2A_0T_0R() since the other > > candidate was acpi_ev_asynch_execute_gpe_method(). > > Which probably is the case. Specifically > > acpi_ev_asynch_execute_gpe_method: Evaluate _L66 > > is likely to cause the Notify() to be dispatched. … > > Rafael, are you also interested in an ACPI dump? > > That might help a bit. > > So what probably happens is that poking at the TZ causes a GPE to > trigger and a Notify() to be dispatched which then goes into the > workqueue for execution. > > Now, I'm not sure what happens to those Notify() items, though. They > each should cause a handler (in the thermal driver) to be executed, > but does that happen? Stephen's trace contains a few backtraces, all of them look like this: | Call Trace: | acpi_ex_opcode_2A_0T_0R+0x93/0xdf | acpi_ds_exec_end_op+0x10d/0x701 | acpi_ps_parse_loop+0x7f2/0x8c3 | acpi_ps_parse_aml+0x1a5/0x540 | acpi_ps_execute_method+0x1fe/0x2ba | acpi_ns_evaluate+0x345/0x4e2 | acpi_evaluate_object+0x177/0x39f | acpi_evaluate_integer+0x4f/0x110 | acpi_thermal_get_temperature.part.0+0x45/0xc4 | thermal_get_temp.cold+0xc/0x2e | thermal_zone_get_temp+0x4c/0x70 | thermal_zone_device_update.part.0+0x2a/0x110 | acpi_thermal_notify+0xcf/0x140 | acpi_ev_notify_dispatch+0x45/0x5a | acpi_os_execute_deferred_notify+0x34/0x60 | process_one_work+0x1d2/0x3a0 | worker_thread+0x45/0x3c0 | kthread+0xf6/0x130 | ret_from_fork+0x35/0x40 so no GPE and it comes the notify callback while parsing the ACPI table. Any ideas? I guess acpi_ex_opcode_2A_0T_0R() uses the workqueue because it may sleep and it might be invoked from non-preemptible context. Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-08-11 10:27 ` Sebastian Andrzej Siewior @ 2020-08-11 14:02 ` Rafael J. Wysocki 0 siblings, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2020-08-11 14:02 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, Stephen Berman, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Tue, Aug 11, 2020 at 12:27 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2020-07-14 17:53:15 [+0200], Rafael J. Wysocki wrote: > > acpi_evaluate_integer() doesn't show up in the trace, though, AFAICS. > > > > > I assumed acpi_ex_opcode_2A_0T_0R() since the other > > > candidate was acpi_ev_asynch_execute_gpe_method(). > > > > Which probably is the case. Specifically > > > > acpi_ev_asynch_execute_gpe_method: Evaluate _L66 > > > > is likely to cause the Notify() to be dispatched. > … > > > Rafael, are you also interested in an ACPI dump? > > > > That might help a bit. > > > > So what probably happens is that poking at the TZ causes a GPE to > > trigger and a Notify() to be dispatched which then goes into the > > workqueue for execution. > > > > Now, I'm not sure what happens to those Notify() items, though. They > > each should cause a handler (in the thermal driver) to be executed, > > but does that happen? > > Stephen's trace contains a few backtraces, all of them look like this: > > | Call Trace: > | acpi_ex_opcode_2A_0T_0R+0x93/0xdf > | acpi_ds_exec_end_op+0x10d/0x701 > | acpi_ps_parse_loop+0x7f2/0x8c3 > | acpi_ps_parse_aml+0x1a5/0x540 > | acpi_ps_execute_method+0x1fe/0x2ba > | acpi_ns_evaluate+0x345/0x4e2 > | acpi_evaluate_object+0x177/0x39f > | acpi_evaluate_integer+0x4f/0x110 > | acpi_thermal_get_temperature.part.0+0x45/0xc4 > | thermal_get_temp.cold+0xc/0x2e > | thermal_zone_get_temp+0x4c/0x70 > | thermal_zone_device_update.part.0+0x2a/0x110 > | acpi_thermal_notify+0xcf/0x140 > | acpi_ev_notify_dispatch+0x45/0x5a > | acpi_os_execute_deferred_notify+0x34/0x60 This is Notify () processing. The handler is acpi_thermal_notify() which calls acpi_thermal_check() which is just a wrapper around thermal_zone_device_update() doing update_temperature() and that calls thermal_zone_get_temp() (among other things) which invokes the ->get_temp() callback for the target thermal zone. In the ACPI thermal driver the ->get_temp callback is thermal_get_temp() which basically calls acpi_thermal_get_temperature() and that evaluates _TMP (for the target thermal zone). It looks like on the machine in question the _TMP method contains another Notify () targeting the same thermal zone which gets queued up via the acpi_ex_opcode_2A_0T_0R() at the top. Obviously that Notify () (when it gets executed) will cause acpi_thermal_notify() to be executed again and so on, ad infinitum unless the Notify () in _TMP is conditional on something. > | process_one_work+0x1d2/0x3a0 > | worker_thread+0x45/0x3c0 > | kthread+0xf6/0x130 > | ret_from_fork+0x35/0x40 > > so no GPE and it comes the notify callback while parsing the ACPI table. Right. > Any ideas? I guess acpi_ex_opcode_2A_0T_0R() uses the workqueue because > it may sleep and it might be invoked from non-preemptible context. No, it uses the workqueue because it queues up a Notify () which always goes through a workqueue (kacpi_notify_wq to be precise) and basically in order to avoid deadlocks (it runs under locks which may need to be acquired again to handle the notification). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-14 14:11 ` Sebastian Andrzej Siewior 2020-07-14 15:53 ` Rafael J. Wysocki @ 2020-07-19 10:07 ` Stephen Berman 2020-08-11 11:58 ` Stephen Berman 1 sibling, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-07-19 10:07 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) [-- Attachment #1: Type: text/plain, Size: 1495 bytes --] On Tue, 14 Jul 2020 16:11:35 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2020-07-14 15:54:57 [+0200], Rafael J. Wysocki wrote: >> On Tue, Jul 14, 2020 at 3:44 PM Sebastian Andrzej Siewior >> <bigeasy@linutronix.de> wrote:> >> > On 2020-06-24 23:49:52 [+0200], Stephen Berman wrote: >> > >> > Let me summarize the thread here: >> > >> > On Stephen's system, ACPI informs the thermal zone driver to poll the >> > temperature every second and the driver does so. >> > The driver queries the temperature by invoking acpi_evaluate_integer() >> > which invokes (at some point) acpi_ev_queue_notify_request(). >> >> Well, I don't quite see how acpi_ev_queue_notify_request() can be >> invoked from the acpi_evaluate_integer() code path. >> >> Do you have a call trace showing that? > > So the trace in > https://lore.kernel.org/linux-acpi/87o8pjh1i0.fsf@gmx.net/ > > shows the pattern and we nailed it down that it comes from > thermal_get_temp(). I assumed acpi_ex_opcode_2A_0T_0R() since the other > candidate was acpi_ev_asynch_execute_gpe_method(). > > Stephen, the patch attached adds a WARN_ON() statement which will > produce a stack trace (4 or so). Could please run 'dmesg' after a while > and send it back. There should be a > "WARNING in drivers/acpi/acpica/evmisc.c" > > statement or something along the lines. I've now rebuilt 5.6.4 with the new patch; sorry for the delay. Output of dmesg attached. Steve Berman [-- Attachment #2: dmesg-5.6.4.gz --] [-- Type: application/octet-stream, Size: 37287 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-07-19 10:07 ` Stephen Berman @ 2020-08-11 11:58 ` Stephen Berman 2020-08-11 13:29 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2020-08-11 11:58 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Sun, 19 Jul 2020 12:07:14 +0200 Stephen Berman <stephen.berman@gmx.net> wrote: > On Tue, 14 Jul 2020 16:11:35 +0200 Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: > [...] >> Stephen, the patch attached adds a WARN_ON() statement which will >> produce a stack trace (4 or so). Could please run 'dmesg' after a while >> and send it back. There should be a >> "WARNING in drivers/acpi/acpica/evmisc.c" >> >> statement or something along the lines. > > I've now rebuilt 5.6.4 with the new patch; sorry for the delay. Output > of dmesg attached. FWIW, while you were away I got an additional datapoint: someone who saw my original post about this problem to one of the Linux From Scratch mailing lists wrote me that he had the same problem, also getting messages about the cdrom timing out ("sr 3:0:0:0" in his case); I told him about your workaround of adding 'thermal.tzp=300' to the kernel commandline, and he replied that this works for him too. And it turns out we have similar motherboards: I have a Gigabyte Z390 M Gaming Rev. 1001 board and he has Gigabyte Z390 Designare rev 1.0. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-08-11 11:58 ` Stephen Berman @ 2020-08-11 13:29 ` Sebastian Andrzej Siewior 2020-08-11 14:34 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-08-11 13:29 UTC (permalink / raw) To: Stephen Berman Cc: Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-08-11 13:58:39 [+0200], Stephen Berman wrote: > him about your workaround of adding 'thermal.tzp=300' to the kernel > commandline, and he replied that this works for him too. And it turns > out we have similar motherboards: I have a Gigabyte Z390 M Gaming > Rev. 1001 board and he has Gigabyte Z390 Designare rev 1.0. Yes. Based on latest dmesg, the ACPI tables contain code which schedules the worker and takes so long. It is possible / likely that his board contains the same tables which leads to the same effect. After all those two boards are very similar from the naming part :) Would you mind to dump the ACPI tables and send them? There might be some hints. It might be possible that a BIOS update fixes the problem but I would prefer very much to fix this in kernel to ensure that such a BIOS does not lead to this problem again. > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-08-11 13:29 ` Sebastian Andrzej Siewior @ 2020-08-11 14:34 ` Rafael J. Wysocki 2020-08-11 15:25 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Rafael J. Wysocki @ 2020-08-11 14:34 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Stephen Berman, Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Tue, Aug 11, 2020 at 3:29 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2020-08-11 13:58:39 [+0200], Stephen Berman wrote: > > him about your workaround of adding 'thermal.tzp=300' to the kernel > > commandline, and he replied that this works for him too. And it turns > > out we have similar motherboards: I have a Gigabyte Z390 M Gaming > > Rev. 1001 board and he has Gigabyte Z390 Designare rev 1.0. > > Yes. Based on latest dmesg, the ACPI tables contain code which schedules > the worker and takes so long. It is possible / likely that his board > contains the same tables which leads to the same effect. After all those > two boards are very similar from the naming part :) > Would you mind to dump the ACPI tables and send them? There might be > some hints. Do we have a BZ for this? It would be useful to open one if not. > It might be possible that a BIOS update fixes the problem but I would > prefer very much to fix this in kernel to ensure that such a BIOS does > not lead to this problem again. I agree. It looks like one way to address this issue might be to add a rate limit for thermal notifications on a given zone. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-08-11 14:34 ` Rafael J. Wysocki @ 2020-08-11 15:25 ` Sebastian Andrzej Siewior [not found] ` <87ft8tayic.fsf@gmx.net> 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-08-11 15:25 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-08-11 16:34:09 [+0200], Rafael J. Wysocki wrote: > On Tue, Aug 11, 2020 at 3:29 PM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: > > > > On 2020-08-11 13:58:39 [+0200], Stephen Berman wrote: > > > him about your workaround of adding 'thermal.tzp=300' to the kernel > > > commandline, and he replied that this works for him too. And it turns > > > out we have similar motherboards: I have a Gigabyte Z390 M Gaming > > > Rev. 1001 board and he has Gigabyte Z390 Designare rev 1.0. > > > > Yes. Based on latest dmesg, the ACPI tables contain code which schedules > > the worker and takes so long. It is possible / likely that his board > > contains the same tables which leads to the same effect. After all those > > two boards are very similar from the naming part :) > > Would you mind to dump the ACPI tables and send them? There might be > > some hints. > > Do we have a BZ for this? It would be useful to open one if not. no, it came via lkml and I looked at it since it was bisected to a workqueue commit with my signoff… Stephen, can you open a bug on https://bugzilla.kernel.org/? > > It might be possible that a BIOS update fixes the problem but I would > > prefer very much to fix this in kernel to ensure that such a BIOS does > > not lead to this problem again. > > I agree. > > It looks like one way to address this issue might be to add a rate > limit for thermal notifications on a given zone. So one thing is that ACPI says to poll every second and driver is doing it. This could be increased to something like 15 or 30 seconds as lower sane level. I don't think there is much value in polling this sensor every second. As workaound, Stephen is using `thermal.tzp=300' now. Would it make sense to flush the workqueue before checking the temperature? I have no idea what the ACPI is doing there but there is no upper limit on time how long in may take, right? Doing this inline (and avoiding the worker) is probably causing other trouble, right? Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <87ft8tayic.fsf@gmx.net>]
* Re: power-off delay/hang due to commit 6d25be57 (mainline) [not found] ` <87ft8tayic.fsf@gmx.net> @ 2020-08-11 18:49 ` Sebastian Andrzej Siewior 2020-10-06 21:49 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-08-11 18:49 UTC (permalink / raw) To: Stephen Berman Cc: Rafael J. Wysocki, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-08-11 19:22:19 [+0200], Stephen Berman wrote: > Attached. ssdt6.dsl: | ThermalZone (TZ10) | { … | Method (_TSP, 0, Serialized) // _TSP: Thermal Sampling Period | { | Return (0x0A) | } | | Method (_TZP, 0, Serialized) // _TZP: Thermal Zone Polling | { | Return (0x0A) | } | | Method (_TMP, 0, Serialized) // _TMP: Temperature | { | \GSA1.ZRC3 () | Sleep (0x03E8) | Notify (\_TZ.TZ10, 0x81) // Thermal Trip Point Change | Return (0x0B54) | } So if I read this correctly then TZ10 should be polled every second (check) and the temperature function contains a `sleep(1 second)' (which explains the mysteries delay) followed by the Notify() (which schedules the worker). Now I'm curious if Windows has the same ACPI tables and if so how they behave here. And what the actual plan here was. And where is the border when one is actual allowed to make fun of someone. So many questions. > Done as Bug 208877. Thank you. > Steve Berman Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-08-11 18:49 ` Sebastian Andrzej Siewior @ 2020-10-06 21:49 ` Sebastian Andrzej Siewior 2020-10-07 16:18 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-10-06 21:49 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Rafael J. Wysocki, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-08-11 20:49:05 [+0200], To Stephen Berman wrote: > On 2020-08-11 19:22:19 [+0200], Stephen Berman wrote: > > Attached. > > ssdt6.dsl: > | ThermalZone (TZ10) > | { > … > | Method (_TSP, 0, Serialized) // _TSP: Thermal Sampling Period > | { > | Return (0x0A) > | } > | > | Method (_TZP, 0, Serialized) // _TZP: Thermal Zone Polling > | { > | Return (0x0A) > | } > | > | Method (_TMP, 0, Serialized) // _TMP: Temperature > | { > | \GSA1.ZRC3 () > | Sleep (0x03E8) > | Notify (\_TZ.TZ10, 0x81) // Thermal Trip Point Change > | Return (0x0B54) > | } > > So if I read this correctly then TZ10 should be polled every second > (check) and the temperature function contains a `sleep(1 second)' (which > explains the mysteries delay) followed by the Notify() (which schedules > the worker). > > > Done as Bug 208877. Rafael, do you have any suggestions? Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-10-06 21:49 ` Sebastian Andrzej Siewior @ 2020-10-07 16:18 ` Rafael J. Wysocki 2020-10-26 17:20 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Rafael J. Wysocki @ 2020-10-07 16:18 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Rafael J. Wysocki Cc: Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 10/6/2020 11:49 PM, Sebastian Andrzej Siewior wrote: > On 2020-08-11 20:49:05 [+0200], To Stephen Berman wrote: >> On 2020-08-11 19:22:19 [+0200], Stephen Berman wrote: >>> Attached. >> ssdt6.dsl: >> | ThermalZone (TZ10) >> | { >> … >> | Method (_TSP, 0, Serialized) // _TSP: Thermal Sampling Period >> | { >> | Return (0x0A) >> | } >> | >> | Method (_TZP, 0, Serialized) // _TZP: Thermal Zone Polling >> | { >> | Return (0x0A) >> | } >> | >> | Method (_TMP, 0, Serialized) // _TMP: Temperature >> | { >> | \GSA1.ZRC3 () >> | Sleep (0x03E8) >> | Notify (\_TZ.TZ10, 0x81) // Thermal Trip Point Change >> | Return (0x0B54) >> | } >> >> So if I read this correctly then TZ10 should be polled every second >> (check) and the temperature function contains a `sleep(1 second)' (which >> explains the mysteries delay) followed by the Notify() (which schedules >> the worker). >> >>> Done as Bug 208877. > Rafael, do you have any suggestions? > I've lost track of this sorry. I have ideas, let me get back to this next week. Cheers! ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-10-07 16:18 ` Rafael J. Wysocki @ 2020-10-26 17:20 ` Sebastian Andrzej Siewior 2020-12-02 18:03 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-10-26 17:20 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-10-07 18:18:03 [+0200], Rafael J. Wysocki wrote: > On 10/6/2020 11:49 PM, Sebastian Andrzej Siewior wrote: > > On 2020-08-11 20:49:05 [+0200], To Stephen Berman wrote: > > > On 2020-08-11 19:22:19 [+0200], Stephen Berman wrote: > > > > Attached. > > > ssdt6.dsl: > > > | ThermalZone (TZ10) > > > | { > > > … > > > | Method (_TSP, 0, Serialized) // _TSP: Thermal Sampling Period > > > | { > > > | Return (0x0A) > > > | } > > > | > > > | Method (_TZP, 0, Serialized) // _TZP: Thermal Zone Polling > > > | { > > > | Return (0x0A) > > > | } > > > | > > > | Method (_TMP, 0, Serialized) // _TMP: Temperature > > > | { > > > | \GSA1.ZRC3 () > > > | Sleep (0x03E8) > > > | Notify (\_TZ.TZ10, 0x81) // Thermal Trip Point Change > > > | Return (0x0B54) > > > | } > > > > > > So if I read this correctly then TZ10 should be polled every second > > > (check) and the temperature function contains a `sleep(1 second)' (which > > > explains the mysteries delay) followed by the Notify() (which schedules > > > the worker). > > > > > > > Done as Bug 208877. > > Rafael, do you have any suggestions? > > I've lost track of this sorry. > > I have ideas, let me get back to this next week. :) > Cheers! Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-10-26 17:20 ` Sebastian Andrzej Siewior @ 2020-12-02 18:03 ` Sebastian Andrzej Siewior 2020-12-02 18:31 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Sebastian Andrzej Siewior @ 2020-12-02 18:03 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On 2020-10-26 18:20:59 [+0100], To Rafael J. Wysocki wrote: > > > > > Done as Bug 208877. > > > Rafael, do you have any suggestions? > > > > I've lost track of this sorry. > > > > I have ideas, let me get back to this next week. > > :) Rafael, any update? If you outline an idea or so then I may be able to form a patch out of it. Otherwise I have no idea how to fix this - other than telling the driver to not poll in smaller intervals than 30secs. > > Cheers! > Sebastian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-12-02 18:03 ` Sebastian Andrzej Siewior @ 2020-12-02 18:31 ` Rafael J. Wysocki 2020-12-02 19:13 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Rafael J. Wysocki @ 2020-12-02 18:31 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Rafael J. Wysocki, Rafael J. Wysocki, Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Wed, Dec 2, 2020 at 7:03 PM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2020-10-26 18:20:59 [+0100], To Rafael J. Wysocki wrote: > > > > > > Done as Bug 208877. > > > > Rafael, do you have any suggestions? > > > > > > I've lost track of this sorry. > > > > > > I have ideas, let me get back to this next week. > > > > :) > > Rafael, any update? If you outline an idea or so then I may be able to > form a patch out of it. Otherwise I have no idea how to fix this - other > than telling the driver to not poll in smaller intervals than > 30secs. The idea, roughly speaking, is to limit the number of outstanding work items in the queue (basically, if there's a notification occurring before the previous one can be handled, there is no need to queue up another work item for it). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-12-02 18:31 ` Rafael J. Wysocki @ 2020-12-02 19:13 ` Rafael J. Wysocki 2020-12-31 20:46 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Rafael J. Wysocki @ 2020-12-02 19:13 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Rafael J. Wysocki, Rafael J. Wysocki, Stephen Berman, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA) On Wed, Dec 2, 2020 at 7:31 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Dec 2, 2020 at 7:03 PM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: > > > > On 2020-10-26 18:20:59 [+0100], To Rafael J. Wysocki wrote: > > > > > > > Done as Bug 208877. > > > > > Rafael, do you have any suggestions? > > > > > > > > I've lost track of this sorry. > > > > > > > > I have ideas, let me get back to this next week. > > > > > > :) > > > > Rafael, any update? If you outline an idea or so then I may be able to > > form a patch out of it. Otherwise I have no idea how to fix this - other > > than telling the driver to not poll in smaller intervals than > > 30secs. > > The idea, roughly speaking, is to limit the number of outstanding work > items in the queue (basically, if there's a notification occurring > before the previous one can be handled, there is no need to queue up > another work item for it). That's easier said than done, though, because of the way the work item queue-up is hooked up into the ACPICA code. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-12-02 19:13 ` Rafael J. Wysocki @ 2020-12-31 20:46 ` Rafael J. Wysocki 2021-01-02 11:03 ` Rafael J. Wysocki 2021-01-04 15:38 ` Stephen Berman 0 siblings, 2 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2020-12-31 20:46 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Stephen Berman Cc: Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA), Rafael J. Wysocki On Wednesday, December 2, 2020 8:13:38 PM CET Rafael J. Wysocki wrote: > On Wed, Dec 2, 2020 at 7:31 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Wed, Dec 2, 2020 at 7:03 PM Sebastian Andrzej Siewior > > <bigeasy@linutronix.de> wrote: > > > > > > On 2020-10-26 18:20:59 [+0100], To Rafael J. Wysocki wrote: > > > > > > > > Done as Bug 208877. > > > > > > Rafael, do you have any suggestions? > > > > > > > > > > I've lost track of this sorry. > > > > > > > > > > I have ideas, let me get back to this next week. > > > > > > > > :) > > > > > > Rafael, any update? If you outline an idea or so then I may be able to > > > form a patch out of it. Otherwise I have no idea how to fix this - other > > > than telling the driver to not poll in smaller intervals than > > > 30secs. > > > > The idea, roughly speaking, is to limit the number of outstanding work > > items in the queue (basically, if there's a notification occurring > > before the previous one can be handled, there is no need to queue up > > another work item for it). > > That's easier said than done, though, because of the way the work item > queue-up is hooked up into the ACPICA code. So scratch this and it wouldn't work in general anyway AFAICS. ATM, I'm tempted to do something like the patch below (with the rationale that it shouldn't be necessary to read the temperature right after updating the trip points if polling is in use, because the next update through polling will cause it to be read anyway and it will trigger trip point actions as needed). Stephen, can you give it a go, please? --- drivers/acpi/thermal.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) Index: linux-pm/drivers/acpi/thermal.c =================================================================== --- linux-pm.orig/drivers/acpi/thermal.c +++ linux-pm/drivers/acpi/thermal.c @@ -911,24 +911,25 @@ static void acpi_thermal_notify(struct a switch (event) { case ACPI_THERMAL_NOTIFY_TEMPERATURE: acpi_thermal_check(tz); - break; + return; case ACPI_THERMAL_NOTIFY_THRESHOLDS: acpi_thermal_trips_update(tz, ACPI_TRIPS_REFRESH_THRESHOLDS); - acpi_thermal_check(tz); - acpi_bus_generate_netlink_event(device->pnp.device_class, - dev_name(&device->dev), event, 0); break; case ACPI_THERMAL_NOTIFY_DEVICES: acpi_thermal_trips_update(tz, ACPI_TRIPS_REFRESH_DEVICES); - acpi_thermal_check(tz); - acpi_bus_generate_netlink_event(device->pnp.device_class, - dev_name(&device->dev), event, 0); break; default: ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Unsupported event [0x%x]\n", event)); - break; + return; } + + /* Trigger an update of the thermal zone unless polling is in use. */ + if (!tz->polling_frequency) + acpi_thermal_check(tz); + + acpi_bus_generate_netlink_event(device->pnp.device_class, + dev_name(&device->dev), event, 0); } /* ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-12-31 20:46 ` Rafael J. Wysocki @ 2021-01-02 11:03 ` Rafael J. Wysocki 2021-01-04 15:38 ` Stephen Berman 1 sibling, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2021-01-02 11:03 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Stephen Berman Cc: Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA), Rafael J. Wysocki On Thursday, December 31, 2020 9:46:11 PM CET Rafael J. Wysocki wrote: > On Wednesday, December 2, 2020 8:13:38 PM CET Rafael J. Wysocki wrote: > > On Wed, Dec 2, 2020 at 7:31 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Wed, Dec 2, 2020 at 7:03 PM Sebastian Andrzej Siewior > > > <bigeasy@linutronix.de> wrote: > > > > > > > > On 2020-10-26 18:20:59 [+0100], To Rafael J. Wysocki wrote: > > > > > > > > > Done as Bug 208877. > > > > > > > Rafael, do you have any suggestions? > > > > > > > > > > > > I've lost track of this sorry. > > > > > > > > > > > > I have ideas, let me get back to this next week. > > > > > > > > > > :) > > > > > > > > Rafael, any update? If you outline an idea or so then I may be able to > > > > form a patch out of it. Otherwise I have no idea how to fix this - other > > > > than telling the driver to not poll in smaller intervals than > > > > 30secs. > > > > > > The idea, roughly speaking, is to limit the number of outstanding work > > > items in the queue (basically, if there's a notification occurring > > > before the previous one can be handled, there is no need to queue up > > > another work item for it). > > > > That's easier said than done, though, because of the way the work item > > queue-up is hooked up into the ACPICA code. > > So scratch this and it wouldn't work in general anyway AFAICS. > > ATM, I'm tempted to do something like the patch below (with the rationale > that it shouldn't be necessary to read the temperature right after updating > the trip points if polling is in use, because the next update through polling > will cause it to be read anyway and it will trigger trip point actions as > needed). There is one more way to address this, probably better: instead of checking the temperature right away in acpi_thermal_notify(), queue that on acpi_thermal_pm_queue and so only if another thermal check is not pending. This way there will be at most one temperature check coming from acpi_thermal_notify() queued up at any time which should prevent the build-up of work items from taking place. So something like this: --- drivers/acpi/thermal.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) Index: linux-pm/drivers/acpi/thermal.c =================================================================== --- linux-pm.orig/drivers/acpi/thermal.c +++ linux-pm/drivers/acpi/thermal.c @@ -900,6 +900,12 @@ static void acpi_thermal_unregister_ther Driver Interface -------------------------------------------------------------------------- */ +static void acpi_queue_thermal_check(struct acpi_thermal *tz) +{ + if (!work_pending(&tz->thermal_check_work)) + queue_work(acpi_thermal_pm_queue, &tz->thermal_check_work); +} + static void acpi_thermal_notify(struct acpi_device *device, u32 event) { struct acpi_thermal *tz = acpi_driver_data(device); @@ -910,17 +916,17 @@ static void acpi_thermal_notify(struct a switch (event) { case ACPI_THERMAL_NOTIFY_TEMPERATURE: - acpi_thermal_check(tz); + acpi_queue_thermal_check(tz); break; case ACPI_THERMAL_NOTIFY_THRESHOLDS: acpi_thermal_trips_update(tz, ACPI_TRIPS_REFRESH_THRESHOLDS); - acpi_thermal_check(tz); + acpi_queue_thermal_check(tz); acpi_bus_generate_netlink_event(device->pnp.device_class, dev_name(&device->dev), event, 0); break; case ACPI_THERMAL_NOTIFY_DEVICES: acpi_thermal_trips_update(tz, ACPI_TRIPS_REFRESH_DEVICES); - acpi_thermal_check(tz); + acpi_queue_thermal_check(tz); acpi_bus_generate_netlink_event(device->pnp.device_class, dev_name(&device->dev), event, 0); break; @@ -1117,7 +1123,7 @@ static int acpi_thermal_resume(struct de tz->state.active |= tz->trips.active[i].flags.enabled; } - queue_work(acpi_thermal_pm_queue, &tz->thermal_check_work); + acpi_queue_thermal_check(tz); return AE_OK; } ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2020-12-31 20:46 ` Rafael J. Wysocki 2021-01-02 11:03 ` Rafael J. Wysocki @ 2021-01-04 15:38 ` Stephen Berman 2021-01-24 13:49 ` Stephen Berman 1 sibling, 1 reply; 42+ messages in thread From: Stephen Berman @ 2021-01-04 15:38 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Sebastian Andrzej Siewior, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA), Rafael J. Wysocki On Thu, 31 Dec 2020 21:46:11 +0100 "Rafael J. Wysocki" <rjw@rjwysocki.net> wrote: > ATM, I'm tempted to do something like the patch below (with the rationale > that it shouldn't be necessary to read the temperature right after updating > the trip points if polling is in use, because the next update through polling > will cause it to be read anyway and it will trigger trip point actions as > needed). > > Stephen, can you give it a go, please? On Sat, 02 Jan 2021 12:03:17 +0100 "Rafael J. Wysocki" <rjw@rjwysocki.net> wrote: > There is one more way to address this, probably better: instead of checking the > temperature right away in acpi_thermal_notify(), queue that on acpi_thermal_pm_queue > and so only if another thermal check is not pending. > > This way there will be at most one temperature check coming from > acpi_thermal_notify() queued up at any time which should prevent the > build-up of work items from taking place. > > So something like this: Thanks for the patches. I'll try them as soon as I can. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2021-01-04 15:38 ` Stephen Berman @ 2021-01-24 13:49 ` Stephen Berman 2021-01-25 16:25 ` Rafael J. Wysocki 0 siblings, 1 reply; 42+ messages in thread From: Stephen Berman @ 2021-01-24 13:49 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Sebastian Andrzej Siewior, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA), Rafael J. Wysocki On Mon, 04 Jan 2021 16:38:43 +0100 Stephen Berman <stephen.berman@gmx.net> wrote: > On Thu, 31 Dec 2020 21:46:11 +0100 "Rafael J. Wysocki" <rjw@rjwysocki.net> wrote: > >> ATM, I'm tempted to do something like the patch below (with the rationale >> that it shouldn't be necessary to read the temperature right after updating >> the trip points if polling is in use, because the next update through polling >> will cause it to be read anyway and it will trigger trip point actions as >> needed). >> >> Stephen, can you give it a go, please? > > On Sat, 02 Jan 2021 12:03:17 +0100 "Rafael J. Wysocki" <rjw@rjwysocki.net> wrote: > >> There is one more way to address this, probably better: instead of checking the >> temperature right away in acpi_thermal_notify(), queue that on >> acpi_thermal_pm_queue >> and so only if another thermal check is not pending. >> >> This way there will be at most one temperature check coming from >> acpi_thermal_notify() queued up at any time which should prevent the >> build-up of work items from taking place. >> >> So something like this: > > Thanks for the patches. I'll try them as soon as I can. FTR, since this is the thread I started for this bug, I've confirmed in https://lore.kernel.org/lkml/87y2gi78sg.fsf@gmx.net/T/#t that the latest patch fixes the bug. Steve Berman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: power-off delay/hang due to commit 6d25be57 (mainline) 2021-01-24 13:49 ` Stephen Berman @ 2021-01-25 16:25 ` Rafael J. Wysocki 0 siblings, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2021-01-25 16:25 UTC (permalink / raw) To: Stephen Berman Cc: Rafael J. Wysocki, Sebastian Andrzej Siewior, Zhang Rui, Robert Moore, Erik Kaneda, Len Brown, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, ACPI Devel Maling List, open list:ACPI COMPONENT ARCHITECTURE (ACPICA), Rafael J. Wysocki On Sun, Jan 24, 2021 at 2:49 PM Stephen Berman <stephen.berman@gmx.net> wrote: > > On Mon, 04 Jan 2021 16:38:43 +0100 Stephen Berman <stephen.berman@gmx.net> wrote: > > > On Thu, 31 Dec 2020 21:46:11 +0100 "Rafael J. Wysocki" <rjw@rjwysocki.net> wrote: > > > >> ATM, I'm tempted to do something like the patch below (with the rationale > >> that it shouldn't be necessary to read the temperature right after updating > >> the trip points if polling is in use, because the next update through polling > >> will cause it to be read anyway and it will trigger trip point actions as > >> needed). > >> > >> Stephen, can you give it a go, please? > > > > On Sat, 02 Jan 2021 12:03:17 +0100 "Rafael J. Wysocki" <rjw@rjwysocki.net> wrote: > > > >> There is one more way to address this, probably better: instead of checking the > >> temperature right away in acpi_thermal_notify(), queue that on > >> acpi_thermal_pm_queue > >> and so only if another thermal check is not pending. > >> > >> This way there will be at most one temperature check coming from > >> acpi_thermal_notify() queued up at any time which should prevent the > >> build-up of work items from taking place. > >> > >> So something like this: > > > > Thanks for the patches. I'll try them as soon as I can. > > FTR, since this is the thread I started for this bug, I've confirmed in > https://lore.kernel.org/lkml/87y2gi78sg.fsf@gmx.net/T/#t that the latest > patch fixes the bug. OK, thanks! The patch has been applied as 5.11-rc material. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2021-01-25 16:26 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <87v9l65d2y.fsf@gmx.net> [not found] ` <20200513220428.4nksinis2qs5dtmh@linutronix.de> [not found] ` <87mu6aurfn.fsf@gmx.net> [not found] ` <20200522164012.ynyvrjompv42jtmx@linutronix.de> [not found] ` <87y2owwo2o.fsf@rub.de> [not found] ` <20200609202339.cgy57twm2zdtjhje@linutronix.de> [not found] ` <87tuzjcovq.fsf@gmx.net> [not found] ` <20200610102514.4vdzu5u7d6vnpicn@linutronix.de> [not found] ` <87imfyh6yx.fsf@gmx.net> [not found] ` <87wo4dligz.fsf@gmx.net> 2020-06-12 11:01 ` power-off delay/hang due to commit 6d25be57 (mainline) Sebastian Andrzej Siewior 2020-06-14 12:12 ` Stephen Berman 2020-06-14 17:10 ` Sebastian Andrzej Siewior 2020-06-15 7:58 ` Stephen Berman 2020-06-15 14:51 ` Sebastian Andrzej Siewior 2020-06-15 15:41 ` Stephen Berman 2020-06-15 15:58 ` Sebastian Andrzej Siewior 2020-06-15 16:19 ` Stephen Berman 2020-06-15 16:32 ` Sebastian Andrzej Siewior 2020-06-16 7:14 ` Stephen Berman 2020-06-16 7:38 ` Sebastian Andrzej Siewior 2020-06-16 8:13 ` Stephen Berman 2020-06-16 15:55 ` Sebastian Andrzej Siewior 2020-06-16 20:28 ` Stephen Berman 2020-06-17 14:27 ` Sebastian Andrzej Siewior 2020-06-17 21:09 ` Stephen Berman 2020-06-24 20:11 ` Sebastian Andrzej Siewior 2020-06-24 21:49 ` Stephen Berman 2020-07-14 13:44 ` Sebastian Andrzej Siewior 2020-07-14 13:54 ` Rafael J. Wysocki 2020-07-14 14:11 ` Sebastian Andrzej Siewior 2020-07-14 15:53 ` Rafael J. Wysocki 2020-07-14 16:10 ` Sebastian Andrzej Siewior 2020-08-11 10:27 ` Sebastian Andrzej Siewior 2020-08-11 14:02 ` Rafael J. Wysocki 2020-07-19 10:07 ` Stephen Berman 2020-08-11 11:58 ` Stephen Berman 2020-08-11 13:29 ` Sebastian Andrzej Siewior 2020-08-11 14:34 ` Rafael J. Wysocki 2020-08-11 15:25 ` Sebastian Andrzej Siewior [not found] ` <87ft8tayic.fsf@gmx.net> 2020-08-11 18:49 ` Sebastian Andrzej Siewior 2020-10-06 21:49 ` Sebastian Andrzej Siewior 2020-10-07 16:18 ` Rafael J. Wysocki 2020-10-26 17:20 ` Sebastian Andrzej Siewior 2020-12-02 18:03 ` Sebastian Andrzej Siewior 2020-12-02 18:31 ` Rafael J. Wysocki 2020-12-02 19:13 ` Rafael J. Wysocki 2020-12-31 20:46 ` Rafael J. Wysocki 2021-01-02 11:03 ` Rafael J. Wysocki 2021-01-04 15:38 ` Stephen Berman 2021-01-24 13:49 ` Stephen Berman 2021-01-25 16:25 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).