* [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug @ 2020-08-27 6:47 Kajol Jain 2020-09-02 15:05 ` Arnaldo Carvalho de Melo 2020-10-08 12:25 ` kajoljain 0 siblings, 2 replies; 5+ messages in thread From: Kajol Jain @ 2020-08-27 6:47 UTC (permalink / raw) To: acme, peterz Cc: jolsa, linux-kernel, linux-perf-users, maddy, mingo, mark.rutland, alexander.shishkin, namhyung, daniel, brho, srikar, kjain Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") added assignment of ret value as -EAGAIN in case function call to 'smp_call_function_single' fails. For non-zero ret value, it did 'ret = !ret ? data.ret : -EAGAIN;', which always assign -EAGAIN to ret and make second if condition useless. In scenarios like when executing a perf stat with --per-thread option, and if any of the monitoring cpu goes offline, the 'smp_call_function_single' function could return -ENXIO, and with the above check, task_function_call hung and increases CPU usage (because of repeated 'smp_call_function_single()') Recration scenario: # perf stat -a --per-thread && (offline a CPU ) Patch here removes the tertiary condition added as part of that commit and added a check for NULL and -EAGAIN. Fixes: 2ed6edd33a21("perf: Add cond_resched() to task_function_call()") Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Barret Rhoden <brho@google.com> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- kernel/events/core.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Changelog: - Remove RFC tag - Resolve some nits issues like space after if and added -ENXIO in comment msg of function 'task_function_call' as suggested by Barret Rhoden. Link to the RFC: https://lkml.org/lkml/2020/8/26/896 diff --git a/kernel/events/core.c b/kernel/events/core.c index 5bfe8e3c6e44..cef646084198 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -99,7 +99,7 @@ static void remote_function(void *data) * retry due to any failures in smp_call_function_single(), such as if the * task_cpu() goes offline concurrently. * - * returns @func return value or -ESRCH when the process isn't running + * returns @func return value or -ESRCH or -ENXIO when the process isn't running */ static int task_function_call(struct task_struct *p, remote_function_f func, void *info) @@ -115,7 +115,8 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info) for (;;) { ret = smp_call_function_single(task_cpu(p), remote_function, &data, 1); - ret = !ret ? data.ret : -EAGAIN; + if (!ret) + ret = data.ret; if (ret != -EAGAIN) break; -- 2.26.2 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug 2020-08-27 6:47 [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug Kajol Jain @ 2020-09-02 15:05 ` Arnaldo Carvalho de Melo 2020-09-02 15:57 ` kajoljain 2020-10-08 12:25 ` kajoljain 1 sibling, 1 reply; 5+ messages in thread From: Arnaldo Carvalho de Melo @ 2020-09-02 15:05 UTC (permalink / raw) To: Peter Zijlstra, Kajol Jain Cc: jolsa, linux-kernel, linux-perf-users, maddy, mingo, mark.rutland, alexander.shishkin, namhyung, daniel, brho, srikar Em Thu, Aug 27, 2020 at 12:17:32PM +0530, Kajol Jain escreveu: > Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") > added assignment of ret value as -EAGAIN in case function > call to 'smp_call_function_single' fails. > For non-zero ret value, it did > 'ret = !ret ? data.ret : -EAGAIN;', which always > assign -EAGAIN to ret and make second if condition useless. > > In scenarios like when executing a perf stat with --per-thread option, and > if any of the monitoring cpu goes offline, the 'smp_call_function_single' > function could return -ENXIO, and with the above check, > task_function_call hung and increases CPU > usage (because of repeated 'smp_call_function_single()') > > Recration scenario: > # perf stat -a --per-thread && (offline a CPU ) > > Patch here removes the tertiary condition added as part of that > commit and added a check for NULL and -EAGAIN. I reproduced this issue with v5.9-rc3, now have to reboot for a conf call, will try to test the patch afterwards, [65108.467416] IRQ 165: no longer affine to CPU23 [65108.468495] smpboot: CPU 23 is now offline [65129.003879] Uhhuh. NMI received for unknown reason 1c on CPU 20. [65129.003880] Do you have a strange power saving mode enabled? [65129.003880] Dazed and confused, but trying to continue [65156.155539] Uhhuh. NMI received for unknown reason 1c on CPU 2. [65156.155539] Do you have a strange power saving mode enabled? [65156.155540] Dazed and confused, but trying to continue [65161.985284] Uhhuh. NMI received for unknown reason 1c on CPU 21. [65161.985285] Do you have a strange power saving mode enabled? [65161.985285] Dazed and confused, but trying to continue [65183.154444] Uhhuh. NMI received for unknown reason 1c on CPU 1. [65183.154445] Do you have a strange power saving mode enabled? [65183.154446] Dazed and confused, but trying to continue [65189.724797] Uhhuh. NMI received for unknown reason 0c on CPU 4. [65189.724798] Do you have a strange power saving mode enabled? [65189.724799] Dazed and confused, but trying to continue [65196.259918] Uhhuh. NMI received for unknown reason 1c on CPU 11. [65196.259918] Do you have a strange power saving mode enabled? [65196.259918] Dazed and confused, but trying to continue [65234.794490] Uhhuh. NMI received for unknown reason 0c on CPU 5. [65234.794491] Do you have a strange power saving mode enabled? [65234.794491] Dazed and confused, but trying to continue [65454.559831] Uhhuh. NMI received for unknown reason 1c on CPU 19. [65454.559832] Do you have a strange power saving mode enabled? [65454.559832] Dazed and confused, but trying to continue [65529.657789] Uhhuh. NMI received for unknown reason 1c on CPU 3. [65529.657790] Do you have a strange power saving mode enabled? [65529.657790] Dazed and confused, but trying to continue [acme@five perf]$ Things seems to be working again after bringing that CPU back online: [root@five ~]# perf top --stdio -C 0-22 Error: The sys_perf_event_open() syscall returned with 16 (Device or resource busy) for event (cycles). /bin/dmesg | grep -i perf may provide additional information. [root@five ~]# perf stat -e cycles sleep 1 Error: The sys_perf_event_open() syscall returned with 16 (Device or resource busy) for event (cycles). /bin/dmesg | grep -i perf may provide additional information. [root@five ~]# perf record -e cycles sleep 1 Error: The sys_perf_event_open() syscall returned with 16 (Device or resource busy) for event (cycles). /bin/dmesg | grep -i perf may provide additional information. [root@five ~]# echo 1 > /sys/devices/system/cpu/cpu23/online [root@five ~]# perf record -e cycles sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.039 MB perf.data (7 samples) ] [root@five ~]# perf stat -e cycles sleep 1 Performance counter stats for 'sleep 1': 842,743 cycles 1.000903853 seconds time elapsed 0.000902000 seconds user 0.000000000 seconds sys [root@five ~]# perf stat -e cycles sleep 1 - Arnaldo > Fixes: 2ed6edd33a21("perf: Add cond_resched() to task_function_call()") > Signed-off-by: Kajol Jain <kjain@linux.ibm.com> > Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Reviewed-by: Barret Rhoden <brho@google.com> > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > --- > kernel/events/core.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > Changelog: > - Remove RFC tag > - Resolve some nits issues like space after if and > added -ENXIO in comment msg of function 'task_function_call' > as suggested by Barret Rhoden. > > Link to the RFC: https://lkml.org/lkml/2020/8/26/896 > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 5bfe8e3c6e44..cef646084198 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -99,7 +99,7 @@ static void remote_function(void *data) > * retry due to any failures in smp_call_function_single(), such as if the > * task_cpu() goes offline concurrently. > * > - * returns @func return value or -ESRCH when the process isn't running > + * returns @func return value or -ESRCH or -ENXIO when the process isn't running > */ > static int > task_function_call(struct task_struct *p, remote_function_f func, void *info) > @@ -115,7 +115,8 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info) > for (;;) { > ret = smp_call_function_single(task_cpu(p), remote_function, > &data, 1); > - ret = !ret ? data.ret : -EAGAIN; > + if (!ret) > + ret = data.ret; > > if (ret != -EAGAIN) > break; > -- > 2.26.2 > -- - Arnaldo ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug 2020-09-02 15:05 ` Arnaldo Carvalho de Melo @ 2020-09-02 15:57 ` kajoljain 0 siblings, 0 replies; 5+ messages in thread From: kajoljain @ 2020-09-02 15:57 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Peter Zijlstra Cc: jolsa, linux-kernel, linux-perf-users, maddy, mingo, mark.rutland, alexander.shishkin, namhyung, daniel, brho, srikar On 9/2/20 8:35 PM, Arnaldo Carvalho de Melo wrote: > Em Thu, Aug 27, 2020 at 12:17:32PM +0530, Kajol Jain escreveu: >> Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") >> added assignment of ret value as -EAGAIN in case function >> call to 'smp_call_function_single' fails. >> For non-zero ret value, it did >> 'ret = !ret ? data.ret : -EAGAIN;', which always >> assign -EAGAIN to ret and make second if condition useless. >> >> In scenarios like when executing a perf stat with --per-thread option, and >> if any of the monitoring cpu goes offline, the 'smp_call_function_single' >> function could return -ENXIO, and with the above check, >> task_function_call hung and increases CPU >> usage (because of repeated 'smp_call_function_single()') >> >> Recration scenario: >> # perf stat -a --per-thread && (offline a CPU ) >> >> Patch here removes the tertiary condition added as part of that >> commit and added a check for NULL and -EAGAIN. > > I reproduced this issue with v5.9-rc3, now have to reboot for a conf > call, will try to test the patch afterwards, > > [65108.467416] IRQ 165: no longer affine to CPU23 > [65108.468495] smpboot: CPU 23 is now offline > [65129.003879] Uhhuh. NMI received for unknown reason 1c on CPU 20. > [65129.003880] Do you have a strange power saving mode enabled? > [65129.003880] Dazed and confused, but trying to continue > [65156.155539] Uhhuh. NMI received for unknown reason 1c on CPU 2. > [65156.155539] Do you have a strange power saving mode enabled? > [65156.155540] Dazed and confused, but trying to continue > [65161.985284] Uhhuh. NMI received for unknown reason 1c on CPU 21. > [65161.985285] Do you have a strange power saving mode enabled? > [65161.985285] Dazed and confused, but trying to continue > [65183.154444] Uhhuh. NMI received for unknown reason 1c on CPU 1. > [65183.154445] Do you have a strange power saving mode enabled? > [65183.154446] Dazed and confused, but trying to continue > [65189.724797] Uhhuh. NMI received for unknown reason 0c on CPU 4. > [65189.724798] Do you have a strange power saving mode enabled? > [65189.724799] Dazed and confused, but trying to continue > [65196.259918] Uhhuh. NMI received for unknown reason 1c on CPU 11. > [65196.259918] Do you have a strange power saving mode enabled? > [65196.259918] Dazed and confused, but trying to continue > [65234.794490] Uhhuh. NMI received for unknown reason 0c on CPU 5. > [65234.794491] Do you have a strange power saving mode enabled? > [65234.794491] Dazed and confused, but trying to continue > [65454.559831] Uhhuh. NMI received for unknown reason 1c on CPU 19. > [65454.559832] Do you have a strange power saving mode enabled? > [65454.559832] Dazed and confused, but trying to continue > [65529.657789] Uhhuh. NMI received for unknown reason 1c on CPU 3. > [65529.657790] Do you have a strange power saving mode enabled? > [65529.657790] Dazed and confused, but trying to continue > [acme@five perf]$ > > > Things seems to be working again after bringing that CPU back online: Hi Arnaldo, You are right, once we bring back the CPU again, things will start working as our 'smp_call_function_single' will not fail and we will come out of the loop. But till then, task_function_call will be hung. Thanks, Kajol Jain > > [root@five ~]# perf top --stdio -C 0-22 > Error: > The sys_perf_event_open() syscall returned with 16 (Device or resource busy) for event (cycles). > /bin/dmesg | grep -i perf may provide additional information. > > [root@five ~]# perf stat -e cycles sleep 1 > Error: > The sys_perf_event_open() syscall returned with 16 (Device or resource busy) for event (cycles). > /bin/dmesg | grep -i perf may provide additional information. > > [root@five ~]# perf record -e cycles sleep 1 > Error: > The sys_perf_event_open() syscall returned with 16 (Device or resource busy) for event (cycles). > /bin/dmesg | grep -i perf may provide additional information. > > [root@five ~]# echo 1 > /sys/devices/system/cpu/cpu23/online > [root@five ~]# perf record -e cycles sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.039 MB perf.data (7 samples) ] > [root@five ~]# perf stat -e cycles sleep 1 > > Performance counter stats for 'sleep 1': > > 842,743 cycles > > 1.000903853 seconds time elapsed > > 0.000902000 seconds user > 0.000000000 seconds sys > > > [root@five ~]# perf stat -e cycles sleep 1 > > > - Arnaldo > > >> Fixes: 2ed6edd33a21("perf: Add cond_resched() to task_function_call()") >> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> >> Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> >> Reviewed-by: Barret Rhoden <brho@google.com> >> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> >> --- >> kernel/events/core.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> Changelog: >> - Remove RFC tag >> - Resolve some nits issues like space after if and >> added -ENXIO in comment msg of function 'task_function_call' >> as suggested by Barret Rhoden. >> >> Link to the RFC: https://lkml.org/lkml/2020/8/26/896 >> >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 5bfe8e3c6e44..cef646084198 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -99,7 +99,7 @@ static void remote_function(void *data) >> * retry due to any failures in smp_call_function_single(), such as if the >> * task_cpu() goes offline concurrently. >> * >> - * returns @func return value or -ESRCH when the process isn't running >> + * returns @func return value or -ESRCH or -ENXIO when the process isn't running >> */ >> static int >> task_function_call(struct task_struct *p, remote_function_f func, void *info) >> @@ -115,7 +115,8 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info) >> for (;;) { >> ret = smp_call_function_single(task_cpu(p), remote_function, >> &data, 1); >> - ret = !ret ? data.ret : -EAGAIN; >> + if (!ret) >> + ret = data.ret; >> >> if (ret != -EAGAIN) >> break; >> -- >> 2.26.2 >> > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug 2020-08-27 6:47 [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug Kajol Jain 2020-09-02 15:05 ` Arnaldo Carvalho de Melo @ 2020-10-08 12:25 ` kajoljain 2020-10-08 13:15 ` Peter Zijlstra 1 sibling, 1 reply; 5+ messages in thread From: kajoljain @ 2020-10-08 12:25 UTC (permalink / raw) To: acme, peterz Cc: jolsa, linux-kernel, linux-perf-users, maddy, mingo, mark.rutland, alexander.shishkin, namhyung, daniel, brho, srikar On 8/27/20 12:17 PM, Kajol Jain wrote: > Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") > added assignment of ret value as -EAGAIN in case function > call to 'smp_call_function_single' fails. > For non-zero ret value, it did > 'ret = !ret ? data.ret : -EAGAIN;', which always > assign -EAGAIN to ret and make second if condition useless. > > In scenarios like when executing a perf stat with --per-thread option, and > if any of the monitoring cpu goes offline, the 'smp_call_function_single' > function could return -ENXIO, and with the above check, > task_function_call hung and increases CPU > usage (because of repeated 'smp_call_function_single()') > > Recration scenario: > # perf stat -a --per-thread && (offline a CPU ) > > Patch here removes the tertiary condition added as part of that > commit and added a check for NULL and -EAGAIN. Hi Peter, Please let me know if you have any comment on this patch. Thanks, Kajol Jain > > Fixes: 2ed6edd33a21("perf: Add cond_resched() to task_function_call()") > Signed-off-by: Kajol Jain <kjain@linux.ibm.com> > Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Reviewed-by: Barret Rhoden <brho@google.com> > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > --- > kernel/events/core.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > Changelog: > - Remove RFC tag > - Resolve some nits issues like space after if and > added -ENXIO in comment msg of function 'task_function_call' > as suggested by Barret Rhoden. > > Link to the RFC: https://lkml.org/lkml/2020/8/26/896 > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 5bfe8e3c6e44..cef646084198 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -99,7 +99,7 @@ static void remote_function(void *data) > * retry due to any failures in smp_call_function_single(), such as if the > * task_cpu() goes offline concurrently. > * > - * returns @func return value or -ESRCH when the process isn't running > + * returns @func return value or -ESRCH or -ENXIO when the process isn't running > */ > static int > task_function_call(struct task_struct *p, remote_function_f func, void *info) > @@ -115,7 +115,8 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info) > for (;;) { > ret = smp_call_function_single(task_cpu(p), remote_function, > &data, 1); > - ret = !ret ? data.ret : -EAGAIN; > + if (!ret) > + ret = data.ret; > > if (ret != -EAGAIN) > break; > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug 2020-10-08 12:25 ` kajoljain @ 2020-10-08 13:15 ` Peter Zijlstra 0 siblings, 0 replies; 5+ messages in thread From: Peter Zijlstra @ 2020-10-08 13:15 UTC (permalink / raw) To: kajoljain Cc: acme, jolsa, linux-kernel, linux-perf-users, maddy, mingo, mark.rutland, alexander.shishkin, namhyung, daniel, brho, srikar On Thu, Oct 08, 2020 at 05:55:35PM +0530, kajoljain wrote: > > > On 8/27/20 12:17 PM, Kajol Jain wrote: > > Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") > > added assignment of ret value as -EAGAIN in case function > > call to 'smp_call_function_single' fails. > > For non-zero ret value, it did > > 'ret = !ret ? data.ret : -EAGAIN;', which always > > assign -EAGAIN to ret and make second if condition useless. > > > > In scenarios like when executing a perf stat with --per-thread option, and > > if any of the monitoring cpu goes offline, the 'smp_call_function_single' > > function could return -ENXIO, and with the above check, > > task_function_call hung and increases CPU > > usage (because of repeated 'smp_call_function_single()') > > > > Recration scenario: > > # perf stat -a --per-thread && (offline a CPU ) > > > > Patch here removes the tertiary condition added as part of that > > commit and added a check for NULL and -EAGAIN. > > Hi Peter, > Please let me know if you have any comment on this patch. Yes, sorry. I've got it now. Thanks! --- Subject: perf: Fix task_function_call() error handling From: Kajol Jain <kjain@linux.ibm.com> Date: Thu, 27 Aug 2020 12:17:32 +0530 From: Kajol Jain <kjain@linux.ibm.com> The error handling introduced by commit: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") looses any return value from smp_call_function_single() that is not {0, -EINVAL}. This is a problem because it will return -EXNIO when the target CPU is offline. Worse, in that case it'll turn into an infinite loop. Fixes: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Barret Rhoden <brho@google.com> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20200827064732.20860-1-kjain@linux.ibm.com --- kernel/events/core.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -99,7 +99,7 @@ static void remote_function(void *data) * retry due to any failures in smp_call_function_single(), such as if the * task_cpu() goes offline concurrently. * - * returns @func return value or -ESRCH when the process isn't running + * returns @func return value or -ESRCH or -ENXIO when the process isn't running */ static int task_function_call(struct task_struct *p, remote_function_f func, void *info) @@ -115,7 +115,8 @@ task_function_call(struct task_struct *p for (;;) { ret = smp_call_function_single(task_cpu(p), remote_function, &data, 1); - ret = !ret ? data.ret : -EAGAIN; + if (!ret) + ret = data.ret; if (ret != -EAGAIN) break; ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-10-08 13:15 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-27 6:47 [PATCH] perf/core: Fix hung issue on perf stat command during cpu hotplug Kajol Jain 2020-09-02 15:05 ` Arnaldo Carvalho de Melo 2020-09-02 15:57 ` kajoljain 2020-10-08 12:25 ` kajoljain 2020-10-08 13:15 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).