All of lore.kernel.org
 help / color / mirror / Atom feed
* NOHZ: local_softirq_pending
@ 2009-06-14 15:28 Iratxo Pichel Ortiz
  2009-06-14 19:50 ` Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-14 15:28 UTC (permalink / raw)
  To: linux-rt-users; +Cc: Iratxo Pichel Ortiz

Hi there,

I hope that someone in this list could help me trace down this issue I 
am seeing.

I am using 2.6.30-rt16 RT Kernel on a IXP425 system. The system is 
running an experimental comm stack that has been working quite well with 
2.6.26-rt2 for a very long time (> 100days).

When the system boots, every 180s aprox, I see this traces in the dmesg:
42949422.930000] NOHZ: local_softirq_pending 02
[42949422.940000] NOHZ: local_softirq_pending 02
[42949422.950000] NOHZ: local_softirq_pending 02
[42949422.960000] ISR HIHG LATENCY 30073
[42949596.920000] NOHZ: local_softirq_pending 02
[42949596.930000] NOHZ: local_softirq_pending 02
[42949596.940000] NOHZ: local_softirq_pending 02
[42949596.950000] NOHZ: local_softirq_pending 02
[42949596.960000] ISR HIHG LATENCY 40066

The "ISR HIGH LATENCY" is small latency measurement that I have done in 
the ISR function of the driver. Hw is ticking the system every 2.5ms and 
the usual run-time of the ISR handler is well under 200us. Every aprox 
180seconds, suddenly this run-time get as high as 50ms.

It is very highly probable that the problem is inside my driver, because 
if I dont load the modules, this "NOHZ..." trace is not shown. Anyway, I 
would like to know if this "issue" is triggered by some known cause, or 
at least any idea where to start looking at.

Please, CC me as I am not subscribed to the list.

Thank you very much in advance and best regards!

    Iratxo.


Another run:
[   96.950000] NOHZ: local_softirq_pending 02
[   96.960000] NOHZ: local_softirq_pending 02
[   96.970000] NOHZ: local_softirq_pending 02
[   97.000000] ISR HIGH LATENCY 50068
[  270.000000] ISR HIGH LATENCY 50048
[  463.000000] ISR HIGH LATENCY 50049
[  643.000000] ISR HIGH LATENCY 50047

-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-14 15:28 NOHZ: local_softirq_pending Iratxo Pichel Ortiz
@ 2009-06-14 19:50 ` Iratxo Pichel Ortiz
  2009-06-15 13:19   ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-14 19:50 UTC (permalink / raw)
  To: Iratxo Pichel Ortiz; +Cc: linux-rt-users

I have some more info about this issue, just in case someone can lead
me to the light. I have changed my .config file, disabling hi-prec
timers and setting HZ to 1000. The problem seems to be gone until I
get a big load to the box. When I ping the board from my PC with a lot
of traffic, the hi-latency messages start again.

Thanks in advance!

    Iratxo.

On Sun, Jun 14, 2009 at 5:28 PM, Iratxo Pichel
Ortiz<ipichel@albentia.com> wrote:
> Hi there,
>
> I hope that someone in this list could help me trace down this issue I am
> seeing.
>
> I am using 2.6.30-rt16 RT Kernel on a IXP425 system. The system is running
> an experimental comm stack that has been working quite well with 2.6.26-rt2
> for a very long time (> 100days).
>
> When the system boots, every 180s aprox, I see this traces in the dmesg:
> 42949422.930000] NOHZ: local_softirq_pending 02
> [42949422.940000] NOHZ: local_softirq_pending 02
> [42949422.950000] NOHZ: local_softirq_pending 02
> [42949422.960000] ISR HIHG LATENCY 30073
> [42949596.920000] NOHZ: local_softirq_pending 02
> [42949596.930000] NOHZ: local_softirq_pending 02
> [42949596.940000] NOHZ: local_softirq_pending 02
> [42949596.950000] NOHZ: local_softirq_pending 02
> [42949596.960000] ISR HIHG LATENCY 40066
>
> The "ISR HIGH LATENCY" is small latency measurement that I have done in the
> ISR function of the driver. Hw is ticking the system every 2.5ms and the
> usual run-time of the ISR handler is well under 200us. Every aprox
> 180seconds, suddenly this run-time get as high as 50ms.
>
> It is very highly probable that the problem is inside my driver, because if
> I dont load the modules, this "NOHZ..." trace is not shown. Anyway, I would
> like to know if this "issue" is triggered by some known cause, or at least
> any idea where to start looking at.
>
> Please, CC me as I am not subscribed to the list.
>
> Thank you very much in advance and best regards!
>
>   Iratxo.
>
>
> Another run:
> [   96.950000] NOHZ: local_softirq_pending 02
> [   96.960000] NOHZ: local_softirq_pending 02
> [   96.970000] NOHZ: local_softirq_pending 02
> [   97.000000] ISR HIGH LATENCY 50068
> [  270.000000] ISR HIGH LATENCY 50048
> [  463.000000] ISR HIGH LATENCY 50049
> [  643.000000] ISR HIGH LATENCY 50047
>
> --
> Iratxo Pichel Ortiz
> Software Development Manager
>
> Albentia Systems S.A.
> http://www.albentia.com
>
> Tel: +34 914400567
> Cel: +34 663808405
> Fax: +34 914400569
>
> C\Margarita Salas 22
> Parque Tecnológico de Leganés
> Leganés (28918)
> Madrid
> Spain
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-14 19:50 ` Iratxo Pichel Ortiz
@ 2009-06-15 13:19   ` Thomas Gleixner
  2009-06-15 14:38     ` Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-15 13:19 UTC (permalink / raw)
  To: Iratxo Pichel Ortiz; +Cc: Iratxo Pichel Ortiz, linux-rt-users

Iratxo,

please do not top post.

On Sun, 14 Jun 2009, Iratxo Pichel Ortiz wrote:

> > I am using 2.6.30-rt16 RT Kernel on a IXP425 system. The system is running

  2.6.29-rt16, right ?

> > an experimental comm stack that has been working quite well with 2.6.26-rt2
> > for a very long time (> 100days).
> >
> > When the system boots, every 180s aprox, I see this traces in the dmesg:
> > 42949422.930000] NOHZ: local_softirq_pending 02

Hmm, so something is marking the timer softirq pending, but does not
wake the timer softirq.

> > [42949422.940000] NOHZ: local_softirq_pending 02
> > [42949422.950000] NOHZ: local_softirq_pending 02
> > [42949422.960000] ISR HIHG LATENCY 30073
> > [42949596.920000] NOHZ: local_softirq_pending 02
> > [42949596.930000] NOHZ: local_softirq_pending 02
> > [42949596.940000] NOHZ: local_softirq_pending 02
> > [42949596.950000] NOHZ: local_softirq_pending 02

So this repeats itself roughly every 3min and stays that way for a
couple of jiffies. Does your code fiddle with timers ?

> > [42949596.960000] ISR HIHG LATENCY 40066

The latency is exaclty the number of NOHZ messages * 10ms. I assume
you have HZ=100. Looks like your code is waiting for a timer, which
does not fire due to the local_softirq_pending 02 problem.

Does it work when you disable CONFIG_NOHZ ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-15 13:19   ` Thomas Gleixner
@ 2009-06-15 14:38     ` Iratxo Pichel Ortiz
       [not found]       ` <8e6b7a710906160101x6a8ae9d5qa7638627f513278@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-15 14:38 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Iratxo Pichel Ortiz, linux-rt-users, Noelia Morón

Thomas,

Thomas Gleixner wrote:
> Iratxo,
>
> please do not top post.
>   
Thanks for you answer and sorry for my top-posting.
>   2.6.29-rt16, right ?
>   
I have also tried rt18, but as I had thought seeing change-logs, the 
behavior is exactly the same.
>
> Hmm, so something is marking the timer softirq pending, but does not
> wake the timer softirq.
>
>   
>>> [42949422.940000] NOHZ: local_softirq_pending 02
>>> [42949422.950000] NOHZ: local_softirq_pending 02
>>> [42949422.960000] ISR HIHG LATENCY 30073
>>> [42949596.920000] NOHZ: local_softirq_pending 02
>>> [42949596.930000] NOHZ: local_softirq_pending 02
>>> [42949596.940000] NOHZ: local_softirq_pending 02
>>> [42949596.950000] NOHZ: local_softirq_pending 02
>>>       
>
> So this repeats itself roughly every 3min and stays that way for a
> couple of jiffies. Does your code fiddle with timers ?
>
>   

There are some timers in the system, but I believe that those are quite 
out of the main path, but there could be some problem with delayed 
works, as they use also timers there.

And, well, I have managed to avoid this 180s dependency, and make it 
happen only under very heavy load. To do do this I have changed HZ from 
100 to 1000. Once I do this the problem arises only when the load, for 
example heavy pinging the ethernet interface.
>>> [42949596.960000] ISR HIHG LATENCY 40066
>>>       
>
> The latency is exaclty the number of NOHZ messages * 10ms. I assume
> you have HZ=100. Looks like your code is waiting for a timer, which
> does not fire due to the local_softirq_pending 02 problem.
>   
You are right, I hadn't noticed this. This pattern also repeats with HZ 
1000, (see dmesg excerpt below), but I only see the first 10 NOHZ 
because of the ratelimit in that function. Is this right? At least the 
time in the printks summed all-together sum up the aprox 50msec delay. 
In this case is not the ISR that gets delayed, but a RT task, it depends 
on the moment.

[4294750.246000] NOHZ: local_softirq_pending 06
[4294750.246000] NOHZ: local_softirq_pending 06
[4294750.247000] NOHZ: local_softirq_pending 06
[4294750.248000] NOHZ: local_softirq_pending 06
[4294750.249000] NOHZ: local_softirq_pending 06
[4294750.250000] NOHZ: local_softirq_pending 06
[4294750.251000] NOHZ: local_softirq_pending 06
[4294750.252000] NOHZ: local_softirq_pending 06
[4294750.253000] NOHZ: local_softirq_pending 06
[4294750.254000] NOHZ: local_softirq_pending 06
[4294750.296000] SCHED Max time reached 51368
[4294750.296000] SCHED Hi time reached 51368

Do you know what could be causing this issue. I have managed to repeat 
this traces (NOHZ...) without using my code, using a workqueue and in 
the work just by doing something like:

work_func() {
    mdelay(10);
    msleep(10);

   queue_work(myqueue, mywork);
}

And then by heavy loading the box from the outside.
> Does it work when you disable CONFIG_NOHZ ?
>   
I will try this and let the list know.
> Thanks,
>
> 	tglx
>
>   
Thanks a really lot.

BR,

    Iratxo.


-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending - More info without need for specific code.
       [not found]       ` <8e6b7a710906160101x6a8ae9d5qa7638627f513278@mail.gmail.com>
@ 2009-06-16  8:26         ` Iratxo Pichel Ortiz
       [not found]         ` <4A376450.5020209@albentia.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16  8:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Iratxo Pichel Ortiz, linux-rt-users, Noelia Morón,
	'Rodrigo Partearroyo'

Thomas,

Iratxo Pichel Ortiz wrote:
>>>  2.6.29-rt16, right ?
>>>       
>> I have also tried rt18, but as I had thought seeing change-logs, the
>> behavior is exactly the same.
>>     
You were right here :) It was 29.
>> Do you know what could be causing this issue. I have managed to repeat this
>> traces (NOHZ...) without using my code, using a workqueue and in the work
>> just by doing something like:
>>
>> work_func() {
>>   mdelay(10);
>>   msleep(10);
>>
>>  queue_work(myqueue, mywork);
>> }
>>
>> And then by heavy loading the box from the outside.
>>     
I have written a very small module that causes the 
"local_softirq_pending" under not some load. Please find code at the end 
of this email. Here is pasted some traces of dmesg (I have increased the 
ratelimit of the "NOHZ: local..." trace to 250.

The only strange thing here is that I am calling "set_workqueue_prio" (I 
have hacked source to export this symbol), and I am starting to think 
that this could not be a good idea. Any hints about this?

[  648.954000] NOHZ: local_softirq_pending 
0e                                                                                                                    

[  648.955000] NOHZ: local_softirq_pending 
0e                                                                                                                    

[  648.956000] NOHZ: local_softirq_pending 
0e                                                                                                                    

   
>>> Does it work when you disable CONFIG_NOHZ ?
>>>       
Still pending to test.
>> I will try this and let the list know.
>>     
>>> Thanks,
>>>
>>>        tglx
>>>       
#include <linux/module.h>
#include <linux/workqueue.h>
#include <linux/sched.h>
#include <linux/delay.h>

#define LATENCY_PRIO        98

static struct workqueue_struct* wq;
static struct work_struct        w;

void LatencyzerWork(struct work_struct* work)
{
    static int count;

    /* Busy and normal waiting */
    mdelay(10);
    msleep(1);

    /* Requeue work */
    if(count++ < 400000)
        queue_work(wq, &w);
    else
        printk("LAT: Count reached %d, stopped latencying\n", count);
}

int __init LatencyzerInit(void)
{
    /* Prepare workqueue */
    printk("LAT: Initializing latencyzer\n");
    if(!(wq = create_singlethread_workqueue("LATENCY"))) {
        printk("LAT: Error creating task\n");
        return -1;
    }

    /* Change workqueue prio */
    set_workqueue_prio(wq, SCHED_FIFO, LATENCY_PRIO, 1);
    printk("LAT: Task prio set to %u\n", LATENCY_PRIO);

    /* Run first iteration */
    INIT_WORK(&w, LatencyzerWork);
    queue_work(wq, &w);
    return 0;
}

void __exit LatencyzerExit(void)
{
    printk("Latencyzer exiting...\n");

    /* Done here */
    cancel_work_sync(&w);
    destroy_workqueue(wq);

    return;
}

module_init(LatencyzerInit);
module_exit(LatencyzerExit);
MODULE_LICENSE("GPL");

-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending - More info without need for specific code.
       [not found]         ` <4A376450.5020209@albentia.com>
@ 2009-06-16  9:26           ` Iratxo Pichel Ortiz
  2009-06-16 14:45             ` NOHZ: local_softirq_pending Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16  9:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-rt-users, Iratxo Pichel Ortiz, Noelia Morón,
	'Rodrigo Partearroyo'

Thomas,

More info below, I hope it helps.

Iratxo Pichel Ortiz wrote:
> Thomas,
>
> Iratxo Pichel Ortiz wrote:
>>>>  2.6.29-rt16, right ?
>>>>       
>>> I have also tried rt18, but as I had thought seeing change-logs, the
>>> behavior is exactly the same.
>>>     
> You were right here :) It was 29.
>>> Do you know what could be causing this issue. I have managed to 
>>> repeat this
>>> traces (NOHZ...) without using my code, using a workqueue and in the 
>>> work
>>> just by doing something like:
>>>
>>> work_func() {
>>>   mdelay(10);
>>>   msleep(10);
>>>
>>>  queue_work(myqueue, mywork);
>>> }
>>>
>>> And then by heavy loading the box from the outside.
>>>     
> I have written a very small module that causes the
> "local_softirq_pending" under not some load. Please find code at the end
> of this email. Here is pasted some traces of dmesg (I have increased the
> ratelimit of the "NOHZ: local..." trace to 250.
>
> The only strange thing here is that I am calling "set_workqueue_prio" (I
> have hacked source to export this symbol), and I am starting to think
> that this could not be a good idea. Any hints about this?
>
> [  648.954000] NOHZ: local_softirq_pending
> 0e
>
> [  648.955000] NOHZ: local_softirq_pending
> 0e
>
> [  648.956000] NOHZ: local_softirq_pending
> 0e
I have changed the implementation of the module test to use kthreads
instead of workqueues. The behavior is exactly the same. I have tried
with prios from 1 to 99. Please find the code below as before. I have
also atached the differente softirqs codes that had been pending in some
of the tests.
>
>
>>>> Does it work when you disable CONFIG_NOHZ ?
>>>>       
> Still pending to test.
>>> I will try this and let the list know.
>>>    
>>>> Thanks,
>>>>
>>>>        tglx
>>>>     
Thanks a lot again!

    Iratxo.

---- DMESG ----
[4294833.294000] NOHZ: local_softirq_pending
06
[4294833.295000] NOHZ: local_softirq_pending
06
[4294834.246000] NOHZ: local_softirq_pending
02
[4294834.247000] NOHZ: local_softirq_pending
02
[4294834.248000] NOHZ: local_softirq_pending 02

---- Test module code ----

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/delay.h>

#define LATENCY_PRIO    1

struct task_struct *t;

int LatencyzerWork(void* data)
{
    static int count;

    while(1) {
        /* Busy and normal waiting */
        mdelay(10);
        msleep(1);

    /* Requeue work */
    if(count++ > 400000)
        break;
    }

    printk("LAT: Count reached %d, stopped latencying\n", count);
    return 0;
}

int __init LatencyzerInit(void)
{
    struct sched_param schedp;

    /* Prepare Thread */
    printk("LAT: Initializing latencyzer\n");
    if(!(t = kthread_create(LatencyzerWork, NULL, "LATENCY"))) {
        printk("LAT: Error creating thread\n");
        return -1;
    }

    /* Change task prio */
    schedp.sched_priority = LATENCY_PRIO;
    sched_setscheduler(t, SCHED_FIFO, &schedp);
    printk("LAT: Task prio set to %u\n", LATENCY_PRIO);

    /* Run thread */
    wake_up_process(t);
    return 0;
}

void __exit LatencyzerExit(void)
{
    printk("LAT: Latencyzer exiting...");

    /* Done here */
    kthread_stop(t);
    printk(" DONE\n");

    return;
}

module_init(LatencyzerInit);
module_exit(LatencyzerExit);
MODULE_LICENSE("GPL");


-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain


--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-16  9:26           ` Iratxo Pichel Ortiz
@ 2009-06-16 14:45             ` Iratxo Pichel Ortiz
  2009-06-16 18:39               ` Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16 14:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo',
	Iratxo Pichel Ortiz

Please find below more detailed infor regarding this "NOHZ: 
local_softirq_pending".

Iratxo Pichel Ortiz wrote:
> Thomas,
>
> More info below, I hope it helps.
>
> Iratxo Pichel Ortiz wrote:
>> Thomas,
>>
>> Iratxo Pichel Ortiz wrote:
>>
>>>> Do you know what could be causing this issue. I have managed to 
>>>> repeat this
>>>> traces (NOHZ...) without using my code, using a workqueue and in 
>>>> the work
>>>> just by doing something like:
>>>>
>>>> work_func() {
>>>>   mdelay(10);
>>>>   msleep(10);
>>>>
>>>>  queue_work(myqueue, mywork);
>>>> }
>>>>
>>>> And then by heavy loading the box from the outside.
>>>>     
>> I have written a very small module that causes the
>> "local_softirq_pending" under not some load. Please find code at the end
>> of this email. Here is pasted some traces of dmesg (I have increased the
>> ratelimit of the "NOHZ: local..." trace to 250.
>>
>> The only strange thing here is that I am calling "set_workqueue_prio" (I
>> have hacked source to export this symbol), and I am starting to think
>> that this could not be a good idea. Any hints about this?
>>
>> [  648.954000] NOHZ: local_softirq_pending
>> 0e
>>
>> [  648.955000] NOHZ: local_softirq_pending
>> 0e
>>
>> [  648.956000] NOHZ: local_softirq_pending
>> 0e
> I have changed the implementation of the module test to use kthreads
> instead of workqueues. The behavior is exactly the same. I have tried
> with prios from 1 to 99. Please find the code below as before. I have
> also atached the differente softirqs codes that had been pending in some
> of the tests.
I have even tried this without any system-loader module. Just by booting 
the kernel and pinging the box very heavily, there are a lot of NOHZ... 
traces in dmesg. Indeed they follow a very strange pattern that I cannot 
match without any part of the kernel. The pattern is the following (NOHZ 
and HZ=1000):
[4294715.247000] NOHZ: local_softirq_pending 
06                                              
[4294715.248000] NOHZ: local_softirq_pending 
06                                              
[4294715.249000] NOHZ: local_softirq_pending 
06                                              
... It repeats every jiffy ...                                      
[4294715.290000] NOHZ: local_softirq_pending 
06                                              
[4294715.291000] NOHZ: local_softirq_pending 
06                                              
[4294715.292000] NOHZ: local_softirq_pending 06      

And then back again at some seconds later:
[4294723.246000] NOHZ: local_softirq_pending 
0e                                              
[4294723.246000] NOHZ: local_softirq_pending 
0e                                              
[4294723.247000] NOHZ: local_softirq_pending 
0e                                              
... It repeats every jiffy ... 
                                                             
[4294723.293000] NOHZ: local_softirq_pending 
10e                                             
[4294723.294000] NOHZ: local_softirq_pending 
10e                                             
[4294723.295000] NOHZ: local_softirq_pending 10e       

The patter always starts about X.246/X.248 and always stops at X.295, so 
I believe that this points to some subsystem, but dont know where to 
look for.

So it seems that my "RT" tasks is delayed, as you said in your original 
mail, when the 02 SIRQ is delayed, but the rest of the time is correctly 
running. This problem appears to be a Kernel or RT patch issue, so 
please let me know which tests would you like me to do, I have a couple 
of boxes here and some time to build and test kernels. Alternatively, if 
you would like me to look at any part of the system, let me know and I 
will try my best.

>>
>>
>>>>> Does it work when you disable CONFIG_NOHZ ?
>>>>>       
>> Still pending to test.
>>>> I will try this and let the list know.
>>>>   
>>>>> Thanks,
>>>>>
>>>>>        tglx
>>>>>     
Thanks,

    Iratxo.

-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-16 14:45             ` NOHZ: local_softirq_pending Iratxo Pichel Ortiz
@ 2009-06-16 18:39               ` Iratxo Pichel Ortiz
  2009-06-17 22:03                 ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-16 18:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

Thomas,

Please find below experiment results of disabling CONFIG_NOHZ.

Iratxo Pichel Ortiz wrote:
> Please find below more detailed infor regarding this "NOHZ: 
> local_softirq_pending".
>
> Iratxo Pichel Ortiz wrote:
>> Thomas,
>>
>> More info below, I hope it helps.
>>
>> Iratxo Pichel Ortiz wrote:
>>> Thomas,
>>>
>>> Iratxo Pichel Ortiz wrote:
>>>
>>>>> Do you know what could be causing this issue. I have managed to 
>>>>> repeat this
>>>>> traces (NOHZ...) without using my code, using a workqueue and in 
>>>>> the work
>>>>> just by doing something like:
>>>>>
>>>>> work_func() {
>>>>>   mdelay(10);
>>>>>   msleep(10);
>>>>>
>>>>>  queue_work(myqueue, mywork);
>>>>> }
>>>>>
>>>>> And then by heavy loading the box from the outside.
>>>>>     
>>> I have written a very small module that causes the
>>> "local_softirq_pending" under not some load. Please find code at the 
>>> end
>>> of this email. Here is pasted some traces of dmesg (I have increased 
>>> the
>>> ratelimit of the "NOHZ: local..." trace to 250.
>>>
>>> The only strange thing here is that I am calling 
>>> "set_workqueue_prio" (I
>>> have hacked source to export this symbol), and I am starting to think
>>> that this could not be a good idea. Any hints about this?
>>>
>>> [  648.954000] NOHZ: local_softirq_pending
>>> 0e
>>>
>>> [  648.955000] NOHZ: local_softirq_pending
>>> 0e
>>>
>>> [  648.956000] NOHZ: local_softirq_pending
>>> 0e
>> I have changed the implementation of the module test to use kthreads
>> instead of workqueues. The behavior is exactly the same. I have tried
>> with prios from 1 to 99. Please find the code below as before. I have
>> also atached the differente softirqs codes that had been pending in some
>> of the tests.
> I have even tried this without any system-loader module. Just by 
> booting the kernel and pinging the box very heavily, there are a lot 
> of NOHZ... traces in dmesg. Indeed they follow a very strange pattern 
> that I cannot match without any part of the kernel. The pattern is the 
> following (NOHZ and HZ=1000):
> [...]
> So it seems that my "RT" tasks is delayed, as you said in your 
> original mail, when the 02 SIRQ is delayed, but the rest of the time 
> is correctly running. This problem appears to be a Kernel or RT patch 
> issue, so please let me know which tests would you like me to do, I 
> have a couple of boxes here and some time to build and test kernels. 
> Alternatively, if you would like me to look at any part of the system, 
> let me know and I will try my best.
>
>>>
>>>
>>>>>> Does it work when you disable CONFIG_NOHZ ?
>>>>>>       
>>> Still pending to test.
I have tried disabling the CONFIG_NOHZ kernel option. Of course the 
trace is gone, but the weird behavior is still there. When I run my 
software without load from the network, the main task of the system 
experiences runtimes of about 700us. When I load the system, there are 
latencies of 50700us, so the 50ms delay is again there, and again the 
time when the task finishes is always X.296, 1 jiffy after the "NOHZ: 
pending..." was shown with CONFIG_NOHZ enabled.

Please let me know what tests would you like me to do.
>>>>> I will try this and let the list know.
>>>>>  
>>>>>> Thanks,
>>>>>>
>>>>>>        tglx
>>>>>>     
> Thanks,
>
>    Iratxo.
>
Thanks,

    Iratxo.



-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-16 18:39               ` Iratxo Pichel Ortiz
@ 2009-06-17 22:03                 ` Thomas Gleixner
  2009-06-17 22:35                   ` Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-17 22:03 UTC (permalink / raw)
  To: Iratxo Pichel Ortiz
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

On Tue, 16 Jun 2009, Iratxo Pichel Ortiz wrote:
> I have tried disabling the CONFIG_NOHZ kernel option. Of course the trace is
> gone, but the weird behavior is still there. When I run my software without
> load from the network, the main task of the system experiences runtimes of
> about 700us. When I load the system, there are latencies of 50700us, so the
> 50ms delay is again there, and again the time when the task finishes is always
> X.296, 1 jiffy after the "NOHZ: pending..." was shown with CONFIG_NOHZ
> enabled.

What kind of system is this ? 

Can you please check whether you can reproduce with NOHZ=y and function
tracer enabled ?

That would be great.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-17 22:03                 ` Thomas Gleixner
@ 2009-06-17 22:35                   ` Iratxo Pichel Ortiz
  2009-06-17 23:46                     ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-17 22:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

Thomas Gleixner wrote:
> On Tue, 16 Jun 2009, Iratxo Pichel Ortiz wrote:
>   
>> I have tried disabling the CONFIG_NOHZ kernel option. Of course the trace is
>> gone, but the weird behavior is still there. When I run my software without
>> load from the network, the main task of the system experiences runtimes of
>> about 700us. When I load the system, there are latencies of 50700us, so the
>> 50ms delay is again there, and again the time when the task finishes is always
>> X.296, 1 jiffy after the "NOHZ: pending..." was shown with CONFIG_NOHZ
>> enabled.
>>     
>
> What kind of system is this ? 
>   
It is an IXP425 custom board, mainly equivalent to the Intel's IXDP425 
development board, so it is ARM (xscale) architecture.

The problem is perfectly repeatable without the need to have other code 
loaded in the system than the standard kernel, but I first detected it 
with a custom development I am working on, that run a RT task (prio 98) 
every 2.5ms to schedule a radio link.

To detect the problem, with my attached .config, I just need to 
massively ping the ethernet interface and I start to see "NOHZ: 
pending..." every second or couple of seconds from jiffie XXX950 to 
XXX999 (HZ 1000). The SOFTIRQ pendings that I have seen pending are 
almost all of them: RCU, NETTX, NETRX, TIMER...

Having tested this I first thought that it could be related with 
ixp4xx_eth driver but I used a different interface and driver and the 
problem is also there. I have also, at least apparently, discarded the 
bridging code and vlan code, that I was using at first.
> Can you please check whether you can reproduce with NOHZ=y and function
> tracer enabled ?
>
>   
I have already tested with NOHZ, and the behavior is exactly the same, 
there is a delay of about 50ms that always finishes at trace  with 
timestamp XXX.296 seconds buf, of course, the "NOHZ: ..." trace is not 
shown.

I have not tried this with the function tracer enabled, I will build the 
kernel again and enable this tracer. Which function should I trace? Any 
hint about this? I have a couple of systems running so I can make any 
trace you would like. I will try to figure out something and give 
feedback later.

I have also tested the system with the latest 2.29.5-rt21 kernel and RT 
patch.

Thanks a lot,

    Iratxo.
> That would be great.
>
> Thanks,
>
> 	tglx
>
>   


-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-17 22:35                   ` Iratxo Pichel Ortiz
@ 2009-06-17 23:46                     ` Thomas Gleixner
  2009-06-18  0:33                       ` Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-17 23:46 UTC (permalink / raw)
  To: Iratxo Pichel Ortiz
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

On Thu, 18 Jun 2009, Iratxo Pichel Ortiz wrote:
> I have not tried this with the function tracer enabled, I will build the
> kernel again and enable this tracer. Which function should I trace? Any hint
> about this? I have a couple of systems running so I can make any trace you
> would like. I will try to figure out something and give feedback later.

The point is not which function to trace. I don't know. The point is
whether the problem is reproducible with function tracer and nohz
enabled.

Please apply following patch and enable NOHZ and function tracer in
.config. After bootup enable the function tracer in debugfs and start
your load test. 

Once the NOHZ.... message appears the function tracer is frozen and
you can read the trace from mountpoint_of_debugfs/tracing/trace.

Please bzip2 it and upload it somewhere as it's probably too large for
the mailinglist.

Thanks,

	tglx
-------------->
Subject: debug-nohz-issue.patch
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 18 Jun 2009 01:41:01 +0200

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6-tip/kernel/time/tick-sched.c
===================================================================
--- linux-2.6-tip.orig/kernel/time/tick-sched.c
+++ linux-2.6-tip/kernel/time/tick-sched.c
@@ -250,6 +250,8 @@ void tick_nohz_stop_sched_tick(int inidl
 	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
 		static int ratelimit;
 
+		tracing_off();
+
 		if (ratelimit < 10) {
 			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
 			       local_softirq_pending());

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-17 23:46                     ` Thomas Gleixner
@ 2009-06-18  0:33                       ` Iratxo Pichel Ortiz
  2009-06-18  8:57                         ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-18  0:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

Thomas Gleixner wrote:
> Please bzip2 it and upload it somewhere as it's probably too large for
> the mailinglist.
>   
Please find 3 realizations at:
http://www.proyectoradio.com/webdocs/rtkernel/trace.bz2
http://www.proyectoradio.com/webdocs/rtkernel/trace2.bz2
http://www.proyectoradio.com/webdocs/rtkernel/trace3.bz2

Just in case they get lost sometime in the future, at the end of this 
email is the result of "tail -30 trace3", hopefully it is enough of that 
file.

Having read the three of them, regardless of the last process before the 
"NOHZ...", there is always a call to the FPE emulator. I am going to do 
some tests changing the FPE emulator, in case it helps.

Thanks a lot,

    Iratxo.
> Thanks,
>
> 	tglx
> -------------->
> Subject: debug-nohz-issue.patch
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 18 Jun 2009 01:41:01 +0200
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/time/tick-sched.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> Index: linux-2.6-tip/kernel/time/tick-sched.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/time/tick-sched.c
> +++ linux-2.6-tip/kernel/time/tick-sched.c
> @@ -250,6 +250,8 @@ void tick_nohz_stop_sched_tick(int inidl
>  	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
>  		static int ratelimit;
>  
> +		tracing_off();
> +
>  		if (ratelimit < 10) {
>  			printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
>  			       local_softirq_pending());
>
>   

              sh-1162  [000] 4154504576.502616: n_tty_set_room <-n_tty_read
              sh-1162  [000] 4154504576.502616: schedule_timeout 
<-n_tty_read
              sh-1162  [000] 4154504576.502616: schedule <-schedule_timeout
              sh-1162  [000] 4154504576.502616: __schedule <-schedule
              sh-1162  [000] 4154504576.502616: rcu_qsctr_inc <-__schedule
              sh-1162  [000] 4154504576.502616: deactivate_task <-__schedule
              sh-1162  [000] 4154504576.502616: dequeue_task 
<-deactivate_task
              sh-1162  [000] 4154504576.502616: update_avg <-dequeue_task
              sh-1162  [000] 4154504576.502616: dequeue_task_fair 
<-dequeue_task
              sh-1162  [000] 4154504576.502616: update_curr 
<-dequeue_task_fair
              sh-1162  [000] 4154504576.502616: clear_buddies 
<-dequeue_task_fair
              sh-1162  [000] 4154504576.502616: __clear_buddies 
<-clear_buddies
              sh-1162  [000] 4154504576.502616: update_min_vruntime 
<-dequeue_task_fair
              sh-1162  [000] 4154504576.502616: put_prev_task_fair 
<-__schedule
              sh-1162  [000] 4154504576.502616: check_spread 
<-put_prev_task_fair
              sh-1162  [000] 4154504576.502616: pick_next_task_rt 
<-__schedule
              sh-1162  [000] 4154504576.502616: pick_next_task_fair 
<-__schedule
              sh-1162  [000] 4154504576.502616: pick_next_task_idle 
<-__schedule
              sh-1162  [000] 4154504576.502616: calc_load_account_active 
<-pick_next_task_idle
              sh-1162  [000] 4154504576.502616: __rcu_read_lock 
<-__atomic_notifier_call_chain
              sh-1162  [000] 4154504576.502616: nwfpe_notify 
<-notifier_call_chain
              sh-1162  [000] 4154504576.502616: dsp_do <-notifier_call_chain
              sh-1162  [000] 4154504576.502616: __rcu_read_unlock 
<-__atomic_notifier_call_chain
          <idle>-0     [000] 4154504576.502616: finish_task_switch 
<-__schedule
          <idle>-0     [000] 4154504576.502616: 
tick_nohz_stop_sched_tick <-cpu_idle
          <idle>-0     [000] 4154504576.502616: ktime_get 
<-tick_nohz_stop_sched_tick
          <idle>-0     [000] 4154504576.502616: ktime_get_ts <-ktime_get
          <idle>-0     [000] 4154504576.502616: getnstimeofday 
<-ktime_get_ts
          <idle>-0     [000] 4154504576.502616: ixp4xx_get_cycles 
<-getnstimeofday
          <idle>-0     [000] 4154504576.502616: set_normalized_timespec 
<-ktime_get_ts

-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-18  0:33                       ` Iratxo Pichel Ortiz
@ 2009-06-18  8:57                         ` Thomas Gleixner
  2009-06-18 16:52                           ` Iratxo Pichel Ortiz
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2009-06-18  8:57 UTC (permalink / raw)
  To: Iratxo Pichel Ortiz
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

Iratxo,

On Thu, 18 Jun 2009, Iratxo Pichel Ortiz wrote:
> Having read the three of them, regardless of the last process before the
> "NOHZ...", there is always a call to the FPE emulator. I am going to do some
> tests changing the FPE emulator, in case it helps.

No, what happens is that the -RT starvation protector kicks in. That's
a mechanism which checks whether the CPU is monopolized by rt tasks
for longer than 950ms. It then blocks execution of rt tasks for
50ms. The silly thing is that it goes idle, but that's a different
problem.

It's default on and can be disabled via:

echo 0 > /proc/sys/kernel/rt_runtime_us

Sorry that I did not notice that earlier, but when that triggers, then
something is really wrong. There should never be a situation where -rt
tasks (neither irq threads nor softirqs) should monopolize the CPU.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NOHZ: local_softirq_pending
  2009-06-18  8:57                         ` Thomas Gleixner
@ 2009-06-18 16:52                           ` Iratxo Pichel Ortiz
  0 siblings, 0 replies; 14+ messages in thread
From: Iratxo Pichel Ortiz @ 2009-06-18 16:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-rt-users, Noelia Morón, 'Rodrigo Partearroyo'

Thomas,

Thomas Gleixner wrote:
> No, what happens is that the -RT starvation protector kicks in. That's
> a mechanism which checks whether the CPU is monopolized by rt tasks
> for longer than 950ms. It then blocks execution of rt tasks for
> 50ms. The silly thing is that it goes idle, but that's a different
> problem.
>   
> It's default on and can be disabled via:
>
> echo 0 > /proc/sys/kernel/rt_runtime_us
>
> Sorry that I did not notice that earlier, but when that triggers, then
> something is really wrong. There should never be a situation where -rt
> tasks (neither irq threads nor softirqs) should monopolize the CPU.
>   
No problem at all! Please let me know if the following assumption is 
right: This protection mechanism is dependent on the available CPU 
power. I mean, if the system has low CPU power, and if, for example, the 
network has very very high load, the total time taken to process all the 
data will be long. Generally much longer than in more powerful systems, 
so this protection arises. Is this right or it should not happen by 
design of the softirqs/RT tasks?

Anyway, seems to be correctly working. I still have to check why with  
HZ=100 it was periodically happening this events even with low load, but 
that could be very probably related to my software.

Again, thank you very much.

Best regards,

    Iratxo.
> Thanks,
>
> 	tglx
>
>   


-- 
Iratxo Pichel Ortiz
Software Development Manager

Albentia Systems S.A.
http://www.albentia.com

Tel: +34 914400567
Cel: +34 663808405
Fax: +34 914400569

C\Margarita Salas 22
Parque Tecnológico de Leganés
Leganés (28918)
Madrid
Spain


--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-06-18 16:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-14 15:28 NOHZ: local_softirq_pending Iratxo Pichel Ortiz
2009-06-14 19:50 ` Iratxo Pichel Ortiz
2009-06-15 13:19   ` Thomas Gleixner
2009-06-15 14:38     ` Iratxo Pichel Ortiz
     [not found]       ` <8e6b7a710906160101x6a8ae9d5qa7638627f513278@mail.gmail.com>
2009-06-16  8:26         ` NOHZ: local_softirq_pending - More info without need for specific code Iratxo Pichel Ortiz
     [not found]         ` <4A376450.5020209@albentia.com>
2009-06-16  9:26           ` Iratxo Pichel Ortiz
2009-06-16 14:45             ` NOHZ: local_softirq_pending Iratxo Pichel Ortiz
2009-06-16 18:39               ` Iratxo Pichel Ortiz
2009-06-17 22:03                 ` Thomas Gleixner
2009-06-17 22:35                   ` Iratxo Pichel Ortiz
2009-06-17 23:46                     ` Thomas Gleixner
2009-06-18  0:33                       ` Iratxo Pichel Ortiz
2009-06-18  8:57                         ` Thomas Gleixner
2009-06-18 16:52                           ` Iratxo Pichel Ortiz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.