* NOHZ: local_softirq_pending
@ 2009-06-14 15:28 Iratxo Pichel Ortiz
2009-06-14 19:50 ` Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-14 15:28 UTC (permalink / raw)
To: linux-rt-users; +Cc: Iratxo Pichel Ortiz
Hi there,
I hope that someone in this list could help me trace down this issue I
am seeing.
I am using 2.6.30-rt16 RT Kernel on a IXP425 system. The system is
running an experimental comm stack that has been working quite well with
2.6.26-rt2 for a very long time (> 100days).
When the system boots, every 180s aprox, I see this traces in the dmesg:
42949422.930000] NOHZ: local_softirq_pending 02
[42949422.940000] NOHZ: local_softirq_pending 02
[42949422.950000] NOHZ: local_softirq_pending 02
[42949422.960000] ISR HIHG LATENCY 30073
[42949596.920000] NOHZ: local_softirq_pending 02
[42949596.930000] NOHZ: local_softirq_pending 02
[42949596.940000] NOHZ: local_softirq_pending 02
[42949596.950000] NOHZ: local_softirq_pending 02
[42949596.960000] ISR HIHG LATENCY 40066
The "ISR HIGH LATENCY" is small latency measurement that I have done in
the ISR function of the driver. Hw is ticking the system every 2.5ms and
the usual run-time of the ISR handler is well under 200us. Every aprox
180seconds, suddenly this run-time get as high as 50ms.
It is very highly probable that the problem is inside my driver, because
if I dont load the modules, this "NOHZ..." trace is not shown. Anyway, I
would like to know if this "issue" is triggered by some known cause, or
at least any idea where to start looking at.
Please, CC me as I am not subscribed to the list.
Thank you very much in advance and best regards!
Iratxo.
Another run:
[ 96.950000] NOHZ: local_softirq_pending 02
[ 96.960000] NOHZ: local_softirq_pending 02
[ 96.970000] NOHZ: local_softirq_pending 02
[ 97.000000] ISR HIGH LATENCY 50068
[ 270.000000] ISR HIGH LATENCY 50048
[ 463.000000] ISR HIGH LATENCY 50049
[ 643.000000] ISR HIGH LATENCY 50047
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-14 15:28 NOHZ: local_softirq_pending Iratxo Pichel Ortiz
@ 2009-06-14 19:50 ` Iratxo Pichel Ortiz
2009-06-15 13:19 ` Thomas Gleixner
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-14 19:50 UTC (permalink / raw)
To: Iratxo Pichel Ortiz; +Cc: linux-rt-users
I have some more info about this issue, just in case someone can lead
me to the light. I have changed my .config file, disabling hi-prec
timers and setting HZ to 1000. The problem seems to be gone until I
get a big load to the box. When I ping the board from my PC with a lot
of traffic, the hi-latency messages start again.
Thanks in advance!
Iratxo.
On Sun, Jun 14, 2009 at 5:28 PM, Iratxo Pichel
Ortiz<ipichel@albentia.com> wrote:
> Hi there,
>
> I hope that someone in this list could help me trace down this issue I am
> seeing.
>
> I am using 2.6.30-rt16 RT Kernel on a IXP425 system. The system is running
> an experimental comm stack that has been working quite well with 2.6.26-rt2
> for a very long time (> 100days).
>
> When the system boots, every 180s aprox, I see this traces in the dmesg:
> 42949422.930000] NOHZ: local_softirq_pending 02
> [42949422.940000] NOHZ: local_softirq_pending 02
> [42949422.950000] NOHZ: local_softirq_pending 02
> [42949422.960000] ISR HIHG LATENCY 30073
> [42949596.920000] NOHZ: local_softirq_pending 02
> [42949596.930000] NOHZ: local_softirq_pending 02
> [42949596.940000] NOHZ: local_softirq_pending 02
> [42949596.950000] NOHZ: local_softirq_pending 02
> [42949596.960000] ISR HIHG LATENCY 40066
>
> The "ISR HIGH LATENCY" is small latency measurement that I have done in the
> ISR function of the driver. Hw is ticking the system every 2.5ms and the
> usual run-time of the ISR handler is well under 200us. Every aprox
> 180seconds, suddenly this run-time get as high as 50ms.
>
> It is very highly probable that the problem is inside my driver, because if
> I dont load the modules, this "NOHZ..." trace is not shown. Anyway, I would
> like to know if this "issue" is triggered by some known cause, or at least
> any idea where to start looking at.
>
> Please, CC me as I am not subscribed to the list.
>
> Thank you very much in advance and best regards!
>
> Iratxo.
>
>
> Another run:
> [ 96.950000] NOHZ: local_softirq_pending 02
> [ 96.960000] NOHZ: local_softirq_pending 02
> [ 96.970000] NOHZ: local_softirq_pending 02
> [ 97.000000] ISR HIGH LATENCY 50068
> [ 270.000000] ISR HIGH LATENCY 50048
> [ 463.000000] ISR HIGH LATENCY 50049
> [ 643.000000] ISR HIGH LATENCY 50047
>
> --
> Iratxo Pichel Ortiz
> Software Development Manager
>
> Albentia Systems S.A.
> http://www.albentia.com
>
> Tel: +34 914400567
> Cel: +34 663808405
> Fax: +34 914400569
>
> C\Margarita Salas 22
> Parque Tecnológico de Leganés
> Leganés (28918)
> Madrid
> Spain
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-14 19:50 ` Iratxo Pichel Ortiz
@ 2009-06-15 13:19 ` Thomas Gleixner
2009-06-15 14:38 ` Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-15 13:19 UTC (permalink / raw)
To: Iratxo Pichel Ortiz; +Cc: Iratxo Pichel Ortiz, linux-rt-users
Iratxo,
please do not top post.
On Sun, 14 Jun 2009, Iratxo Pichel Ortiz wrote:
> > I am using 2.6.30-rt16 RT Kernel on a IXP425 system. The system is running
2.6.29-rt16, right ?
> > an experimental comm stack that has been working quite well with 2.6.26-rt2
> > for a very long time (> 100days).
> >
> > When the system boots, every 180s aprox, I see this traces in the dmesg:
> > 42949422.930000] NOHZ: local_softirq_pending 02
Hmm, so something is marking the timer softirq pending, but does not
wake the timer softirq.
> > [42949422.940000] NOHZ: local_softirq_pending 02
> > [42949422.950000] NOHZ: local_softirq_pending 02
> > [42949422.960000] ISR HIHG LATENCY 30073
> > [42949596.920000] NOHZ: local_softirq_pending 02
> > [42949596.930000] NOHZ: local_softirq_pending 02
> > [42949596.940000] NOHZ: local_softirq_pending 02
> > [42949596.950000] NOHZ: local_softirq_pending 02
So this repeats itself roughly every 3min and stays that way for a
couple of jiffies. Does your code fiddle with timers ?
> > [42949596.960000] ISR HIHG LATENCY 40066
The latency is exaclty the number of NOHZ messages * 10ms. I assume
you have HZ=100. Looks like your code is waiting for a timer, which
does not fire due to the local_softirq_pending 02 problem.
Does it work when you disable CONFIG_NOHZ ?
Thanks,
tglx
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-15 13:19 ` Thomas Gleixner
@ 2009-06-15 14:38 ` Iratxo Pichel Ortiz
[not found] ` <8e6b7a710906160101x6a8ae9d5qa7638627f513278@mail.gmail.com>
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-15 14:38 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: Iratxo Pichel Ortiz, linux-rt-users, Noelia Morón
Thomas,
Thomas Gleixner wrote:
> Iratxo,
>
> please do not top post.
>
Thanks for you answer and sorry for my top-posting.
> 2.6.29-rt16, right ?
>
I have also tried rt18, but as I had thought seeing change-logs, the
behavior is exactly the same.
>
> Hmm, so something is marking the timer softirq pending, but does not
> wake the timer softirq.
>
>
>>> [42949422.940000] NOHZ: local_softirq_pending 02
>>> [42949422.950000] NOHZ: local_softirq_pending 02
>>> [42949422.960000] ISR HIHG LATENCY 30073
>>> [42949596.920000] NOHZ: local_softirq_pending 02
>>> [42949596.930000] NOHZ: local_softirq_pending 02
>>> [42949596.940000] NOHZ: local_softirq_pending 02
>>> [42949596.950000] NOHZ: local_softirq_pending 02
>>>
>
> So this repeats itself roughly every 3min and stays that way for a
> couple of jiffies. Does your code fiddle with timers ?
>
>
There are some timers in the system, but I believe that those are quite
out of the main path, but there could be some problem with delayed
works, as they use also timers there.
And, well, I have managed to avoid this 180s dependency, and make it
happen only under very heavy load. To do do this I have changed HZ from
100 to 1000. Once I do this the problem arises only when the load, for
example heavy pinging the ethernet interface.
>>> [42949596.960000] ISR HIHG LATENCY 40066
>>>
>
> The latency is exaclty the number of NOHZ messages * 10ms. I assume
> you have HZ=100. Looks like your code is waiting for a timer, which
> does not fire due to the local_softirq_pending 02 problem.
>
You are right, I hadn't noticed this. This pattern also repeats with HZ
1000, (see dmesg excerpt below), but I only see the first 10 NOHZ
because of the ratelimit in that function. Is this right? At least the
time in the printks summed all-together sum up the aprox 50msec delay.
In this case is not the ISR that gets delayed, but a RT task, it depends
on the moment.
[4294750.246000] NOHZ: local_softirq_pending 06
[4294750.246000] NOHZ: local_softirq_pending 06
[4294750.247000] NOHZ: local_softirq_pending 06
[4294750.248000] NOHZ: local_softirq_pending 06
[4294750.249000] NOHZ: local_softirq_pending 06
[4294750.250000] NOHZ: local_softirq_pending 06
[4294750.251000] NOHZ: local_softirq_pending 06
[4294750.252000] NOHZ: local_softirq_pending 06
[4294750.253000] NOHZ: local_softirq_pending 06
[4294750.254000] NOHZ: local_softirq_pending 06
[4294750.296000] SCHED Max time reached 51368
[4294750.296000] SCHED Hi time reached 51368
Do you know what could be causing this issue. I have managed to repeat
this traces (NOHZ...) without using my code, using a workqueue and in
the work just by doing something like:
work_func() {
mdelay(10);
msleep(10);
queue_work(myqueue, mywork);
}
And then by heavy loading the box from the outside.
> Does it work when you disable CONFIG_NOHZ ?
>
I will try this and let the list know.
> Thanks,
>
> tglx
>
>
Thanks a really lot.
BR,
Iratxo.
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending - More info without need for specific code.
[not found] ` <8e6b7a710906160101x6a8ae9d5qa7638627f513278@mail.gmail.com>
@ 2009-06-16 8:26 ` Iratxo Pichel Ortiz
[not found] ` <4A376450.5020209@albentia.com>
1 sibling, 0 replies; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16 8:26 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Iratxo Pichel Ortiz, linux-rt-users, Noelia Morón,
'Rodrigo Partearroyo'
Thomas,
Iratxo Pichel Ortiz wrote:
>>> 2.6.29-rt16, right ?
>>>
>> I have also tried rt18, but as I had thought seeing change-logs, the
>> behavior is exactly the same.
>>
You were right here :) It was 29.
>> Do you know what could be causing this issue. I have managed to repeat this
>> traces (NOHZ...) without using my code, using a workqueue and in the work
>> just by doing something like:
>>
>> work_func() {
>> mdelay(10);
>> msleep(10);
>>
>> queue_work(myqueue, mywork);
>> }
>>
>> And then by heavy loading the box from the outside.
>>
I have written a very small module that causes the
"local_softirq_pending" under not some load. Please find code at the end
of this email. Here is pasted some traces of dmesg (I have increased the
ratelimit of the "NOHZ: local..." trace to 250.
The only strange thing here is that I am calling "set_workqueue_prio" (I
have hacked source to export this symbol), and I am starting to think
that this could not be a good idea. Any hints about this?
[ 648.954000] NOHZ: local_softirq_pending
0e
[ 648.955000] NOHZ: local_softirq_pending
0e
[ 648.956000] NOHZ: local_softirq_pending
0e
>>> Does it work when you disable CONFIG_NOHZ ?
>>>
Still pending to test.
>> I will try this and let the list know.
>>
>>> Thanks,
>>>
>>> tglx
>>>
#include <linux/module.h>
#include <linux/workqueue.h>
#include <linux/sched.h>
#include <linux/delay.h>
#define LATENCY_PRIO 98
static struct workqueue_struct* wq;
static struct work_struct w;
void LatencyzerWork(struct work_struct* work)
{
static int count;
/* Busy and normal waiting */
mdelay(10);
msleep(1);
/* Requeue work */
if(count++ < 400000)
queue_work(wq, &w);
else
printk("LAT: Count reached %d, stopped latencying\n", count);
}
int __init LatencyzerInit(void)
{
/* Prepare workqueue */
printk("LAT: Initializing latencyzer\n");
if(!(wq = create_singlethread_workqueue("LATENCY"))) {
printk("LAT: Error creating task\n");
return -1;
}
/* Change workqueue prio */
set_workqueue_prio(wq, SCHED_FIFO, LATENCY_PRIO, 1);
printk("LAT: Task prio set to %u\n", LATENCY_PRIO);
/* Run first iteration */
INIT_WORK(&w, LatencyzerWork);
queue_work(wq, &w);
return 0;
}
void __exit LatencyzerExit(void)
{
printk("Latencyzer exiting...\n");
/* Done here */
cancel_work_sync(&w);
destroy_workqueue(wq);
return;
}
module_init(LatencyzerInit);
module_exit(LatencyzerExit);
MODULE_LICENSE("GPL");
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending - More info without need for specific code.
[not found] ` <4A376450.5020209@albentia.com>
@ 2009-06-16 9:26 ` Iratxo Pichel Ortiz
2009-06-16 14:45 ` NOHZ: local_softirq_pending Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16 9:26 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-rt-users, Iratxo Pichel Ortiz, Noelia Morón,
'Rodrigo Partearroyo'
Thomas,
More info below, I hope it helps.
Iratxo Pichel Ortiz wrote:
> Thomas,
>
> Iratxo Pichel Ortiz wrote:
>>>> 2.6.29-rt16, right ?
>>>>
>>> I have also tried rt18, but as I had thought seeing change-logs, the
>>> behavior is exactly the same.
>>>
> You were right here :) It was 29.
>>> Do you know what could be causing this issue. I have managed to
>>> repeat this
>>> traces (NOHZ...) without using my code, using a workqueue and in the
>>> work
>>> just by doing something like:
>>>
>>> work_func() {
>>> mdelay(10);
>>> msleep(10);
>>>
>>> queue_work(myqueue, mywork);
>>> }
>>>
>>> And then by heavy loading the box from the outside.
>>>
> I have written a very small module that causes the
> "local_softirq_pending" under not some load. Please find code at the end
> of this email. Here is pasted some traces of dmesg (I have increased the
> ratelimit of the "NOHZ: local..." trace to 250.
>
> The only strange thing here is that I am calling "set_workqueue_prio" (I
> have hacked source to export this symbol), and I am starting to think
> that this could not be a good idea. Any hints about this?
>
> [ 648.954000] NOHZ: local_softirq_pending
> 0e
>
> [ 648.955000] NOHZ: local_softirq_pending
> 0e
>
> [ 648.956000] NOHZ: local_softirq_pending
> 0e
I have changed the implementation of the module test to use kthreads
instead of workqueues. The behavior is exactly the same. I have tried
with prios from 1 to 99. Please find the code below as before. I have
also atached the differente softirqs codes that had been pending in some
of the tests.
>
>
>>>> Does it work when you disable CONFIG_NOHZ ?
>>>>
> Still pending to test.
>>> I will try this and let the list know.
>>>
>>>> Thanks,
>>>>
>>>> tglx
>>>>
Thanks a lot again!
Iratxo.
---- DMESG ----
[4294833.294000] NOHZ: local_softirq_pending
06
[4294833.295000] NOHZ: local_softirq_pending
06
[4294834.246000] NOHZ: local_softirq_pending
02
[4294834.247000] NOHZ: local_softirq_pending
02
[4294834.248000] NOHZ: local_softirq_pending 02
---- Test module code ----
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/delay.h>
#define LATENCY_PRIO 1
struct task_struct *t;
int LatencyzerWork(void* data)
{
static int count;
while(1) {
/* Busy and normal waiting */
mdelay(10);
msleep(1);
/* Requeue work */
if(count++ > 400000)
break;
}
printk("LAT: Count reached %d, stopped latencying\n", count);
return 0;
}
int __init LatencyzerInit(void)
{
struct sched_param schedp;
/* Prepare Thread */
printk("LAT: Initializing latencyzer\n");
if(!(t = kthread_create(LatencyzerWork, NULL, "LATENCY"))) {
printk("LAT: Error creating thread\n");
return -1;
}
/* Change task prio */
schedp.sched_priority = LATENCY_PRIO;
sched_setscheduler(t, SCHED_FIFO, &schedp);
printk("LAT: Task prio set to %u\n", LATENCY_PRIO);
/* Run thread */
wake_up_process(t);
return 0;
}
void __exit LatencyzerExit(void)
{
printk("LAT: Latencyzer exiting...");
/* Done here */
kthread_stop(t);
printk(" DONE\n");
return;
}
module_init(LatencyzerInit);
module_exit(LatencyzerExit);
MODULE_LICENSE("GPL");
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-16 9:26 ` Iratxo Pichel Ortiz
@ 2009-06-16 14:45 ` Iratxo Pichel Ortiz
2009-06-16 18:39 ` Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16 14:45 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo',
Iratxo Pichel Ortiz
Please find below more detailed infor regarding this "NOHZ:
local_softirq_pending".
Iratxo Pichel Ortiz wrote:
> Thomas,
>
> More info below, I hope it helps.
>
> Iratxo Pichel Ortiz wrote:
>> Thomas,
>>
>> Iratxo Pichel Ortiz wrote:
>>
>>>> Do you know what could be causing this issue. I have managed to
>>>> repeat this
>>>> traces (NOHZ...) without using my code, using a workqueue and in
>>>> the work
>>>> just by doing something like:
>>>>
>>>> work_func() {
>>>> mdelay(10);
>>>> msleep(10);
>>>>
>>>> queue_work(myqueue, mywork);
>>>> }
>>>>
>>>> And then by heavy loading the box from the outside.
>>>>
>> I have written a very small module that causes the
>> "local_softirq_pending" under not some load. Please find code at the end
>> of this email. Here is pasted some traces of dmesg (I have increased the
>> ratelimit of the "NOHZ: local..." trace to 250.
>>
>> The only strange thing here is that I am calling "set_workqueue_prio" (I
>> have hacked source to export this symbol), and I am starting to think
>> that this could not be a good idea. Any hints about this?
>>
>> [ 648.954000] NOHZ: local_softirq_pending
>> 0e
>>
>> [ 648.955000] NOHZ: local_softirq_pending
>> 0e
>>
>> [ 648.956000] NOHZ: local_softirq_pending
>> 0e
> I have changed the implementation of the module test to use kthreads
> instead of workqueues. The behavior is exactly the same. I have tried
> with prios from 1 to 99. Please find the code below as before. I have
> also atached the differente softirqs codes that had been pending in some
> of the tests.
I have even tried this without any system-loader module. Just by booting
the kernel and pinging the box very heavily, there are a lot of NOHZ...
traces in dmesg. Indeed they follow a very strange pattern that I cannot
match without any part of the kernel. The pattern is the following (NOHZ
and HZ=1000):
[4294715.247000] NOHZ: local_softirq_pending
06
[4294715.248000] NOHZ: local_softirq_pending
06
[4294715.249000] NOHZ: local_softirq_pending
06
... It repeats every jiffy ...
[4294715.290000] NOHZ: local_softirq_pending
06
[4294715.291000] NOHZ: local_softirq_pending
06
[4294715.292000] NOHZ: local_softirq_pending 06
And then back again at some seconds later:
[4294723.246000] NOHZ: local_softirq_pending
0e
[4294723.246000] NOHZ: local_softirq_pending
0e
[4294723.247000] NOHZ: local_softirq_pending
0e
... It repeats every jiffy ...
[4294723.293000] NOHZ: local_softirq_pending
10e
[4294723.294000] NOHZ: local_softirq_pending
10e
[4294723.295000] NOHZ: local_softirq_pending 10e
The patter always starts about X.246/X.248 and always stops at X.295, so
I believe that this points to some subsystem, but dont know where to
look for.
So it seems that my "RT" tasks is delayed, as you said in your original
mail, when the 02 SIRQ is delayed, but the rest of the time is correctly
running. This problem appears to be a Kernel or RT patch issue, so
please let me know which tests would you like me to do, I have a couple
of boxes here and some time to build and test kernels. Alternatively, if
you would like me to look at any part of the system, let me know and I
will try my best.
>>
>>
>>>>> Does it work when you disable CONFIG_NOHZ ?
>>>>>
>> Still pending to test.
>>>> I will try this and let the list know.
>>>>
>>>>> Thanks,
>>>>>
>>>>> tglx
>>>>>
Thanks,
Iratxo.
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-16 14:45 ` NOHZ: local_softirq_pending Iratxo Pichel Ortiz
@ 2009-06-16 18:39 ` Iratxo Pichel Ortiz
2009-06-17 22:03 ` Thomas Gleixner
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16 18:39 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
Thomas,
Please find below experiment results of disabling CONFIG_NOHZ.
Iratxo Pichel Ortiz wrote:
> Please find below more detailed infor regarding this "NOHZ:
> local_softirq_pending".
>
> Iratxo Pichel Ortiz wrote:
>> Thomas,
>>
>> More info below, I hope it helps.
>>
>> Iratxo Pichel Ortiz wrote:
>>> Thomas,
>>>
>>> Iratxo Pichel Ortiz wrote:
>>>
>>>>> Do you know what could be causing this issue. I have managed to
>>>>> repeat this
>>>>> traces (NOHZ...) without using my code, using a workqueue and in
>>>>> the work
>>>>> just by doing something like:
>>>>>
>>>>> work_func() {
>>>>> mdelay(10);
>>>>> msleep(10);
>>>>>
>>>>> queue_work(myqueue, mywork);
>>>>> }
>>>>>
>>>>> And then by heavy loading the box from the outside.
>>>>>
>>> I have written a very small module that causes the
>>> "local_softirq_pending" under not some load. Please find code at the
>>> end
>>> of this email. Here is pasted some traces of dmesg (I have increased
>>> the
>>> ratelimit of the "NOHZ: local..." trace to 250.
>>>
>>> The only strange thing here is that I am calling
>>> "set_workqueue_prio" (I
>>> have hacked source to export this symbol), and I am starting to think
>>> that this could not be a good idea. Any hints about this?
>>>
>>> [ 648.954000] NOHZ: local_softirq_pending
>>> 0e
>>>
>>> [ 648.955000] NOHZ: local_softirq_pending
>>> 0e
>>>
>>> [ 648.956000] NOHZ: local_softirq_pending
>>> 0e
>> I have changed the implementation of the module test to use kthreads
>> instead of workqueues. The behavior is exactly the same. I have tried
>> with prios from 1 to 99. Please find the code below as before. I have
>> also atached the differente softirqs codes that had been pending in some
>> of the tests.
> I have even tried this without any system-loader module. Just by
> booting the kernel and pinging the box very heavily, there are a lot
> of NOHZ... traces in dmesg. Indeed they follow a very strange pattern
> that I cannot match without any part of the kernel. The pattern is the
> following (NOHZ and HZ=1000):
> [...]
> So it seems that my "RT" tasks is delayed, as you said in your
> original mail, when the 02 SIRQ is delayed, but the rest of the time
> is correctly running. This problem appears to be a Kernel or RT patch
> issue, so please let me know which tests would you like me to do, I
> have a couple of boxes here and some time to build and test kernels.
> Alternatively, if you would like me to look at any part of the system,
> let me know and I will try my best.
>
>>>
>>>
>>>>>> Does it work when you disable CONFIG_NOHZ ?
>>>>>>
>>> Still pending to test.
I have tried disabling the CONFIG_NOHZ kernel option. Of course the
trace is gone, but the weird behavior is still there. When I run my
software without load from the network, the main task of the system
experiences runtimes of about 700us. When I load the system, there are
latencies of 50700us, so the 50ms delay is again there, and again the
time when the task finishes is always X.296, 1 jiffy after the "NOHZ:
pending..." was shown with CONFIG_NOHZ enabled.
Please let me know what tests would you like me to do.
>>>>> I will try this and let the list know.
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> tglx
>>>>>>
> Thanks,
>
> Iratxo.
>
Thanks,
Iratxo.
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-16 18:39 ` Iratxo Pichel Ortiz
@ 2009-06-17 22:03 ` Thomas Gleixner
2009-06-17 22:35 ` Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-17 22:03 UTC (permalink / raw)
To: Iratxo Pichel Ortiz
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
On Tue, 16 Jun 2009, Iratxo Pichel Ortiz wrote:
> I have tried disabling the CONFIG_NOHZ kernel option. Of course the trace is
> gone, but the weird behavior is still there. When I run my software without
> load from the network, the main task of the system experiences runtimes of
> about 700us. When I load the system, there are latencies of 50700us, so the
> 50ms delay is again there, and again the time when the task finishes is always
> X.296, 1 jiffy after the "NOHZ: pending..." was shown with CONFIG_NOHZ
> enabled.
What kind of system is this ?
Can you please check whether you can reproduce with NOHZ=y and function
tracer enabled ?
That would be great.
Thanks,
tglx
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-17 22:03 ` Thomas Gleixner
@ 2009-06-17 22:35 ` Iratxo Pichel Ortiz
2009-06-17 23:46 ` Thomas Gleixner
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-17 22:35 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
Thomas Gleixner wrote:
> On Tue, 16 Jun 2009, Iratxo Pichel Ortiz wrote:
>
>> I have tried disabling the CONFIG_NOHZ kernel option. Of course the trace is
>> gone, but the weird behavior is still there. When I run my software without
>> load from the network, the main task of the system experiences runtimes of
>> about 700us. When I load the system, there are latencies of 50700us, so the
>> 50ms delay is again there, and again the time when the task finishes is always
>> X.296, 1 jiffy after the "NOHZ: pending..." was shown with CONFIG_NOHZ
>> enabled.
>>
>
> What kind of system is this ?
>
It is an IXP425 custom board, mainly equivalent to the Intel's IXDP425
development board, so it is ARM (xscale) architecture.
The problem is perfectly repeatable without the need to have other code
loaded in the system than the standard kernel, but I first detected it
with a custom development I am working on, that run a RT task (prio 98)
every 2.5ms to schedule a radio link.
To detect the problem, with my attached .config, I just need to
massively ping the ethernet interface and I start to see "NOHZ:
pending..." every second or couple of seconds from jiffie XXX950 to
XXX999 (HZ 1000). The SOFTIRQ pendings that I have seen pending are
almost all of them: RCU, NETTX, NETRX, TIMER...
Having tested this I first thought that it could be related with
ixp4xx_eth driver but I used a different interface and driver and the
problem is also there. I have also, at least apparently, discarded the
bridging code and vlan code, that I was using at first.
> Can you please check whether you can reproduce with NOHZ=y and function
> tracer enabled ?
>
>
I have already tested with NOHZ, and the behavior is exactly the same,
there is a delay of about 50ms that always finishes at trace with
timestamp XXX.296 seconds buf, of course, the "NOHZ: ..." trace is not
shown.
I have not tried this with the function tracer enabled, I will build the
kernel again and enable this tracer. Which function should I trace? Any
hint about this? I have a couple of systems running so I can make any
trace you would like. I will try to figure out something and give
feedback later.
I have also tested the system with the latest 2.29.5-rt21 kernel and RT
patch.
Thanks a lot,
Iratxo.
> That would be great.
>
> Thanks,
>
> tglx
>
>
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-17 22:35 ` Iratxo Pichel Ortiz
@ 2009-06-17 23:46 ` Thomas Gleixner
2009-06-18 0:33 ` Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-17 23:46 UTC (permalink / raw)
To: Iratxo Pichel Ortiz
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
On Thu, 18 Jun 2009, Iratxo Pichel Ortiz wrote:
> I have not tried this with the function tracer enabled, I will build the
> kernel again and enable this tracer. Which function should I trace? Any hint
> about this? I have a couple of systems running so I can make any trace you
> would like. I will try to figure out something and give feedback later.
The point is not which function to trace. I don't know. The point is
whether the problem is reproducible with function tracer and nohz
enabled.
Please apply following patch and enable NOHZ and function tracer in
.config. After bootup enable the function tracer in debugfs and start
your load test.
Once the NOHZ.... message appears the function tracer is frozen and
you can read the trace from mountpoint_of_debugfs/tracing/trace.
Please bzip2 it and upload it somewhere as it's probably too large for
the mailinglist.
Thanks,
tglx
-------------->
Subject: debug-nohz-issue.patch
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 18 Jun 2009 01:41:01 +0200
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
kernel/time/tick-sched.c | 2 ++
1 file changed, 2 insertions(+)
Index: linux-2.6-tip/kernel/time/tick-sched.c
===================================================================
--- linux-2.6-tip.orig/kernel/time/tick-sched.c
+++ linux-2.6-tip/kernel/time/tick-sched.c
@@ -250,6 +250,8 @@ void tick_nohz_stop_sched_tick(int inidl
if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
static int ratelimit;
+ tracing_off();
+
if (ratelimit < 10) {
printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
local_softirq_pending());
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-17 23:46 ` Thomas Gleixner
@ 2009-06-18 0:33 ` Iratxo Pichel Ortiz
2009-06-18 8:57 ` Thomas Gleixner
0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-18 0:33 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
Thomas Gleixner wrote:
> Please bzip2 it and upload it somewhere as it's probably too large for
> the mailinglist.
>
Please find 3 realizations at:
http://www.proyectoradio.com/webdocs/rtkernel/trace.bz2
http://www.proyectoradio.com/webdocs/rtkernel/trace2.bz2
http://www.proyectoradio.com/webdocs/rtkernel/trace3.bz2
Just in case they get lost sometime in the future, at the end of this
email is the result of "tail -30 trace3", hopefully it is enough of that
file.
Having read the three of them, regardless of the last process before the
"NOHZ...", there is always a call to the FPE emulator. I am going to do
some tests changing the FPE emulator, in case it helps.
Thanks a lot,
Iratxo.
> Thanks,
>
> tglx
> -------------->
> Subject: debug-nohz-issue.patch
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 18 Jun 2009 01:41:01 +0200
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> kernel/time/tick-sched.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux-2.6-tip/kernel/time/tick-sched.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/time/tick-sched.c
> +++ linux-2.6-tip/kernel/time/tick-sched.c
> @@ -250,6 +250,8 @@ void tick_nohz_stop_sched_tick(int inidl
> if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> static int ratelimit;
>
> + tracing_off();
> +
> if (ratelimit < 10) {
> printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
> local_softirq_pending());
>
>
sh-1162 [000] 4154504576.502616: n_tty_set_room <-n_tty_read
sh-1162 [000] 4154504576.502616: schedule_timeout
<-n_tty_read
sh-1162 [000] 4154504576.502616: schedule <-schedule_timeout
sh-1162 [000] 4154504576.502616: __schedule <-schedule
sh-1162 [000] 4154504576.502616: rcu_qsctr_inc <-__schedule
sh-1162 [000] 4154504576.502616: deactivate_task <-__schedule
sh-1162 [000] 4154504576.502616: dequeue_task
<-deactivate_task
sh-1162 [000] 4154504576.502616: update_avg <-dequeue_task
sh-1162 [000] 4154504576.502616: dequeue_task_fair
<-dequeue_task
sh-1162 [000] 4154504576.502616: update_curr
<-dequeue_task_fair
sh-1162 [000] 4154504576.502616: clear_buddies
<-dequeue_task_fair
sh-1162 [000] 4154504576.502616: __clear_buddies
<-clear_buddies
sh-1162 [000] 4154504576.502616: update_min_vruntime
<-dequeue_task_fair
sh-1162 [000] 4154504576.502616: put_prev_task_fair
<-__schedule
sh-1162 [000] 4154504576.502616: check_spread
<-put_prev_task_fair
sh-1162 [000] 4154504576.502616: pick_next_task_rt
<-__schedule
sh-1162 [000] 4154504576.502616: pick_next_task_fair
<-__schedule
sh-1162 [000] 4154504576.502616: pick_next_task_idle
<-__schedule
sh-1162 [000] 4154504576.502616: calc_load_account_active
<-pick_next_task_idle
sh-1162 [000] 4154504576.502616: __rcu_read_lock
<-__atomic_notifier_call_chain
sh-1162 [000] 4154504576.502616: nwfpe_notify
<-notifier_call_chain
sh-1162 [000] 4154504576.502616: dsp_do <-notifier_call_chain
sh-1162 [000] 4154504576.502616: __rcu_read_unlock
<-__atomic_notifier_call_chain
<idle>-0 [000] 4154504576.502616: finish_task_switch
<-__schedule
<idle>-0 [000] 4154504576.502616:
tick_nohz_stop_sched_tick <-cpu_idle
<idle>-0 [000] 4154504576.502616: ktime_get
<-tick_nohz_stop_sched_tick
<idle>-0 [000] 4154504576.502616: ktime_get_ts <-ktime_get
<idle>-0 [000] 4154504576.502616: getnstimeofday
<-ktime_get_ts
<idle>-0 [000] 4154504576.502616: ixp4xx_get_cycles
<-getnstimeofday
<idle>-0 [000] 4154504576.502616: set_normalized_timespec
<-ktime_get_ts
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-18 0:33 ` Iratxo Pichel Ortiz
@ 2009-06-18 8:57 ` Thomas Gleixner
2009-06-18 16:52 ` Iratxo Pichel Ortiz
0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-18 8:57 UTC (permalink / raw)
To: Iratxo Pichel Ortiz
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
Iratxo,
On Thu, 18 Jun 2009, Iratxo Pichel Ortiz wrote:
> Having read the three of them, regardless of the last process before the
> "NOHZ...", there is always a call to the FPE emulator. I am going to do some
> tests changing the FPE emulator, in case it helps.
No, what happens is that the -RT starvation protector kicks in. That's
a mechanism which checks whether the CPU is monopolized by rt tasks
for longer than 950ms. It then blocks execution of rt tasks for
50ms. The silly thing is that it goes idle, but that's a different
problem.
It's default on and can be disabled via:
echo 0 > /proc/sys/kernel/rt_runtime_us
Sorry that I did not notice that earlier, but when that triggers, then
something is really wrong. There should never be a situation where -rt
tasks (neither irq threads nor softirqs) should monopolize the CPU.
Thanks,
tglx
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: NOHZ: local_softirq_pending
2009-06-18 8:57 ` Thomas Gleixner
@ 2009-06-18 16:52 ` Iratxo Pichel Ortiz
0 siblings, 0 replies; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-18 16:52 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'
Thomas,
Thomas Gleixner wrote:
> No, what happens is that the -RT starvation protector kicks in. That's
> a mechanism which checks whether the CPU is monopolized by rt tasks
> for longer than 950ms. It then blocks execution of rt tasks for
> 50ms. The silly thing is that it goes idle, but that's a different
> problem.
>
> It's default on and can be disabled via:
>
> echo 0 > /proc/sys/kernel/rt_runtime_us
>
> Sorry that I did not notice that earlier, but when that triggers, then
> something is really wrong. There should never be a situation where -rt
> tasks (neither irq threads nor softirqs) should monopolize the CPU.
>
No problem at all! Please let me know if the following assumption is
right: This protection mechanism is dependent on the available CPU
power. I mean, if the system has low CPU power, and if, for example, the
network has very very high load, the total time taken to process all the
data will be long. Generally much longer than in more powerful systems,
so this protection arises. Is this right or it should not happen by
design of the softirqs/RT tasks?
Anyway, seems to be correctly working. I still have to check why with
HZ=100 it was periodically happening this events even with low load, but
that could be very probably related to my software.
Again, thank you very much.
Best regards,
Iratxo.
> Thanks,
>
> tglx
>
>
--
Iratxo Pichel Ortiz
Software Development Manager
Albentia Systems S.A.
http://www.albentia.com
Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569
C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-06-18 16:52 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-14 15:28 NOHZ: local_softirq_pending Iratxo Pichel Ortiz
2009-06-14 19:50 ` Iratxo Pichel Ortiz
2009-06-15 13:19 ` Thomas Gleixner
2009-06-15 14:38 ` Iratxo Pichel Ortiz
[not found] ` <8e6b7a710906160101x6a8ae9d5qa7638627f513278@mail.gmail.com>
2009-06-16 8:26 ` NOHZ: local_softirq_pending - More info without need for specific code Iratxo Pichel Ortiz
[not found] ` <4A376450.5020209@albentia.com>
2009-06-16 9:26 ` Iratxo Pichel Ortiz
2009-06-16 14:45 ` NOHZ: local_softirq_pending Iratxo Pichel Ortiz
2009-06-16 18:39 ` Iratxo Pichel Ortiz
2009-06-17 22:03 ` Thomas Gleixner
2009-06-17 22:35 ` Iratxo Pichel Ortiz
2009-06-17 23:46 ` Thomas Gleixner
2009-06-18 0:33 ` Iratxo Pichel Ortiz
2009-06-18 8:57 ` Thomas Gleixner
2009-06-18 16:52 ` Iratxo Pichel Ortiz
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.