linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [2.6.21.1] soft lockup when removing netconsole module
@ 2007-05-26 15:40 Folkert van Heusden
  2007-05-26 15:53 ` Parag Warudkar
  2007-05-29  7:56 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Folkert van Heusden @ 2007-05-26 15:40 UTC (permalink / raw)
  To: linux-kernel

Hi,

When trying to remove the netconsole module, I got the following kernel
output after a while (couple of minutes iirc):

[525720.117293] BUG: soft lockup detected on CPU#1!
[525720.117353]  [<c1004d53>] show_trace_log_lvl+0x1a/0x30
[525720.117439]  [<c1004d7b>] show_trace+0x12/0x14
[525720.117526]  [<c1004e75>] dump_stack+0x16/0x18
[525720.117613]  [<c104dd5b>] softlockup_tick+0xa6/0xc2
[525720.117694]  [<c1026855>] run_local_timers+0x12/0x14
[525720.117738]  [<c1026669>] update_process_times+0x72/0xa1
[525720.117744]  [<c1038673>] tick_sched_timer+0x53/0xb6
[525720.117748]  [<c1033d62>] hrtimer_interrupt+0x189/0x1e3
[525720.117753]  [<c100e9e2>] local_apic_timer_interrupt+0x55/0x5b
[525720.117761]  [<c100ea12>] smp_apic_timer_interrupt+0x2a/0x39
[525720.117766]  [<c1004a3f>] apic_timer_interrupt+0x33/0x38
[525720.117770]  [<c120f4b1>] mutex_lock+0x8/0xa
[525720.117775]  [<c102d2f0>] flush_workqueue+0x2f/0x8f
[525720.117780]  [<c102d7a0>] cancel_rearming_delayed_workqueue+0x29/0x2b
[525720.117785]  [<c102d7b1>] cancel_rearming_delayed_work+0xf/0x11
[525720.117790]  [<c11be143>] netpoll_cleanup+0x75/0xa5
[525720.117794]  [<f893712d>] cleanup_netconsole+0x17/0x1a [netconsole]
[525720.117804]  [<c1041f11>] sys_delete_module+0x12f/0x14f
[525720.117809]  [<c1003f74>] syscall_call+0x7/0xb
[525720.117812]  =======================

Also the rmmod hangs and would not exit even with kill -9. It also
sucks up 100% cpu.


Folkert van Heusden

-- 
MultiTail es una herramienta flexibele para consiguir archivos de log,
y para ejecutar ordenes. Filtrar, añadir colores, merger y vista de
las differencias.  http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 15:40 [2.6.21.1] soft lockup when removing netconsole module Folkert van Heusden
@ 2007-05-26 15:53 ` Parag Warudkar
  2007-05-26 16:12   ` Thomas Gleixner
  2007-05-29  7:56 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Parag Warudkar @ 2007-05-26 15:53 UTC (permalink / raw)
  To: Folkert van Heusden; +Cc: linux-kernel, tglx

Seems to be hrtimers related - CC'ed Thomas.

Parag

Folkert van Heusden wrote:
> Hi,
> 
> When trying to remove the netconsole module, I got the following kernel
> output after a while (couple of minutes iirc):
> 
> [525720.117293] BUG: soft lockup detected on CPU#1!
> [525720.117353]  [<c1004d53>] show_trace_log_lvl+0x1a/0x30
> [525720.117439]  [<c1004d7b>] show_trace+0x12/0x14
> [525720.117526]  [<c1004e75>] dump_stack+0x16/0x18
> [525720.117613]  [<c104dd5b>] softlockup_tick+0xa6/0xc2
> [525720.117694]  [<c1026855>] run_local_timers+0x12/0x14
> [525720.117738]  [<c1026669>] update_process_times+0x72/0xa1
> [525720.117744]  [<c1038673>] tick_sched_timer+0x53/0xb6
> [525720.117748]  [<c1033d62>] hrtimer_interrupt+0x189/0x1e3
> [525720.117753]  [<c100e9e2>] local_apic_timer_interrupt+0x55/0x5b
> [525720.117761]  [<c100ea12>] smp_apic_timer_interrupt+0x2a/0x39
> [525720.117766]  [<c1004a3f>] apic_timer_interrupt+0x33/0x38
> [525720.117770]  [<c120f4b1>] mutex_lock+0x8/0xa
> [525720.117775]  [<c102d2f0>] flush_workqueue+0x2f/0x8f
> [525720.117780]  [<c102d7a0>] cancel_rearming_delayed_workqueue+0x29/0x2b
> [525720.117785]  [<c102d7b1>] cancel_rearming_delayed_work+0xf/0x11
> [525720.117790]  [<c11be143>] netpoll_cleanup+0x75/0xa5
> [525720.117794]  [<f893712d>] cleanup_netconsole+0x17/0x1a [netconsole]
> [525720.117804]  [<c1041f11>] sys_delete_module+0x12f/0x14f
> [525720.117809]  [<c1003f74>] syscall_call+0x7/0xb
> [525720.117812]  =======================
> 
> Also the rmmod hangs and would not exit even with kill -9. It also
> sucks up 100% cpu.
> 
> 
> Folkert van Heusden
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 15:53 ` Parag Warudkar
@ 2007-05-26 16:12   ` Thomas Gleixner
  2007-05-26 16:17     ` Folkert van Heusden
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2007-05-26 16:12 UTC (permalink / raw)
  To: Parag Warudkar; +Cc: Folkert van Heusden, linux-kernel

On Sat, 2007-05-26 at 11:53 -0400, Parag Warudkar wrote:
> Seems to be hrtimers related - CC'ed Thomas.

I doubt that. The tick interrupt just finds out, that the machine is
stuck in rmmod.

> > Also the rmmod hangs and would not exit even with kill -9. It also
> > sucks up 100% cpu.

Can you please enable CONFIG_PROVE_LOCKING ?

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 16:12   ` Thomas Gleixner
@ 2007-05-26 16:17     ` Folkert van Heusden
  2007-05-26 16:35       ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Folkert van Heusden @ 2007-05-26 16:17 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Parag Warudkar, linux-kernel

> > Seems to be hrtimers related - CC'ed Thomas.
> I doubt that. The tick interrupt just finds out, that the machine is
> stuck in rmmod.
> > > Also the rmmod hangs and would not exit even with kill -9. It also
> > > sucks up 100% cpu.
> Can you please enable CONFIG_PROVE_LOCKING ?

Luckily I already had :-)
Here is its output:

[   44.500182] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[   44.500228] ... MAX_LOCKDEP_SUBCLASSES:    8
[   44.500266] ... MAX_LOCK_DEPTH:          30
[   44.500305] ... MAX_LOCKDEP_KEYS:        2048
[   44.500343] ... CLASSHASH_SIZE:           1024
[   44.500383] ... MAX_LOCKDEP_ENTRIES:     8192
[   44.500421] ... MAX_LOCKDEP_CHAINS:      16384
[   44.500459] ... CHAINHASH_SIZE:          8192
[   44.500498]  memory used by lock dependency info: 1096 kB
[   44.500537]  per task-struct memory footprint: 1200 bytes
[   44.500576] ------------------------
[   44.500614] | Locking API testsuite:
[   44.500651] ----------------------------------------------------------------------------
[   44.500697]                                  | spin |wlock |rlock |mutex | wsem | rsem |
[   44.500743]   --------------------------------------------------------------------------
[   44.500794]                      A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.501739]                  A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.502675]              A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.503617]              A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.504557]          A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.505509]          A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.506469]          A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.507424]                     double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.508327]                   initialize held:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.509245]                  bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
[   44.510197]   --------------------------------------------------------------------------
[   44.510243]               recursive read-lock:             |  ok  |             |  ok  |
[   44.510656]            recursive read-lock #2:             |  ok  |             |  ok  |
[   44.511069]             mixed read-write-lock:             |  ok  |             |  ok  |
[   44.511507]             mixed write-read-lock:             |  ok  |             |  ok  |
[   44.511968]   --------------------------------------------------------------------------
[   44.512015]      hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[   44.512501]      soft-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
[   44.512998]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[   44.513484]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
[   44.513970]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
[   44.514456]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
[   44.514941]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
[   44.515425]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
[   44.515910]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
[   44.516396]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
[   44.516882]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
[   44.517380]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
[   44.517924]     hard-safe-A + unsafe-B #1/132:  ok  |  ok  |  ok  |
[   44.518417]     soft-safe-A + unsafe-B #1/132:  ok  |  ok  |  ok  |
[   44.518909]     hard-safe-A + unsafe-B #1/213:  ok  |  ok  |  ok  |
[   44.519402]     soft-safe-A + unsafe-B #1/213:  ok  |  ok  |  ok  |
[   44.519896]     hard-safe-A + unsafe-B #1/231:  ok  |  ok  |  ok  |
[   44.520388]     soft-safe-A + unsafe-B #1/231:  ok  |  ok  |  ok  |
[   44.520881]     hard-safe-A + unsafe-B #1/312:  ok  |  ok  |  ok  |
[   44.521388]     soft-safe-A + unsafe-B #1/312:  ok  |  ok  |  ok  |
[   44.521886]     hard-safe-A + unsafe-B #1/321:  ok  |  ok  |  ok  |
[   44.522389]     soft-safe-A + unsafe-B #1/321:  ok  |  ok  |  ok  |
[   44.522907]     hard-safe-A + unsafe-B #2/123:  ok  |  ok  |  ok  |
[   44.523402]     soft-safe-A + unsafe-B #2/123:  ok  |  ok  |  ok  |
[   44.523894]     hard-safe-A + unsafe-B #2/132:  ok  |  ok  |  ok  |
[   44.524387]     soft-safe-A + unsafe-B #2/132:  ok  |  ok  |  ok  |
[   44.524880]     hard-safe-A + unsafe-B #2/213:  ok  |  ok  |  ok  |
[   44.525373]     soft-safe-A + unsafe-B #2/213:  ok  |  ok  |  ok  |
[   44.525866]     hard-safe-A + unsafe-B #2/231:  ok  |  ok  |  ok  |
[   44.526359]     soft-safe-A + unsafe-B #2/231:  ok  |  ok  |  ok  |
[   44.526850]     hard-safe-A + unsafe-B #2/312:  ok  |  ok  |  ok  |
[   44.527342]     soft-safe-A + unsafe-B #2/312:  ok  |  ok  |  ok  |
[   44.527842]     hard-safe-A + unsafe-B #2/321:  ok  |  ok  |  ok  |
[   44.528382]     soft-safe-A + unsafe-B #2/321:  ok  |  ok  |  ok  |
[   44.529025]       hard-irq lock-inversion/123:  ok  |  ok  |  ok  |
[   44.529516]       soft-irq lock-inversion/123:  ok  |  ok  |  ok  |
[   44.530010]       hard-irq lock-inversion/132:  ok  |  ok  |  ok  |
[   44.530501]       soft-irq lock-inversion/132:  ok  |  ok  |  ok  |
[   44.530995]       hard-irq lock-inversion/213:  ok  |  ok  |  ok  |
[   44.531488]       soft-irq lock-inversion/213:  ok  |  ok  |  ok  |
[   44.531980]       hard-irq lock-inversion/231:  ok  |  ok  |  ok  |
[   44.532470]       soft-irq lock-inversion/231:  ok  |  ok  |  ok  |
[   44.532962]       hard-irq lock-inversion/312:  ok  |  ok  |  ok  |
[   44.533478]       soft-irq lock-inversion/312:  ok  |  ok  |  ok  |
[   44.533998]       hard-irq lock-inversion/321:  ok  |  ok  |  ok  |
[   44.534515]       soft-irq lock-inversion/321:  ok  |  ok  |  ok  |
[   44.535017]       hard-irq read-recursion/123:  ok  |
[   44.535227]       soft-irq read-recursion/123:  ok  |
[   44.535440]       hard-irq read-recursion/132:  ok  |
[   44.535652]       soft-irq read-recursion/132:  ok  |
[   44.535865]       hard-irq read-recursion/213:  ok  |
[   44.536079]       soft-irq read-recursion/213:  ok  |
[   44.536291]       hard-irq read-recursion/231:  ok  |
[   44.536504]       soft-irq read-recursion/231:  ok  |
[   44.536742]       hard-irq read-recursion/312:  ok  |
[   44.536979]       soft-irq read-recursion/312:  ok  |
[   44.537216]       hard-irq read-recursion/321:  ok  |
[   44.537454]       soft-irq read-recursion/321:  ok  |
[   44.537693] -------------------------------------------------------
[   44.537733] Good, all 218 testcases passed! |
[   44.537774] ---------------------------------


Folkert van Heusden

-- 
MultiTail ist eine flexible Applikation um Logfiles und Kommando
Eingaben zu überprüfen. Inkl. Filter, Farben, Zusammenführen,
Ansichten etc. http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 16:17     ` Folkert van Heusden
@ 2007-05-26 16:35       ` Thomas Gleixner
  2007-05-26 16:49         ` Folkert van Heusden
  2007-05-27 20:32         ` Matt Mackall
  0 siblings, 2 replies; 18+ messages in thread
From: Thomas Gleixner @ 2007-05-26 16:35 UTC (permalink / raw)
  To: Folkert van Heusden; +Cc: Parag Warudkar, linux-kernel

On Sat, 2007-05-26 at 18:17 +0200, Folkert van Heusden wrote:
> > > Seems to be hrtimers related - CC'ed Thomas.
> > I doubt that. The tick interrupt just finds out, that the machine is
> > stuck in rmmod.
> > > > Also the rmmod hangs and would not exit even with kill -9. It also
> > > > sucks up 100% cpu.
> > Can you please enable CONFIG_PROVE_LOCKING ?
>
> Luckily I already had :-)
> Here is its output:

So there is no output from lockdep, during the rmmod hang ? The rmmod
sucks 100% cpu, so it busy loops at some place. Looking at the stack
trace this seems to be in:


void cancel_rearming_delayed_workqueue(struct workqueue_struct *wq,                                                                                                                 
                                       struct delayed_work *dwork)                                                                                                                  
{                                                                                                                                                                                   
        while (!cancel_delayed_work(dwork))                                                                                                                                         
                flush_workqueue(wq);                                                                                                                                                
}

which is not really related to timers :)

	tglx



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 16:35       ` Thomas Gleixner
@ 2007-05-26 16:49         ` Folkert van Heusden
  2007-05-26 17:06           ` Thomas Gleixner
  2007-05-27 20:32         ` Matt Mackall
  1 sibling, 1 reply; 18+ messages in thread
From: Folkert van Heusden @ 2007-05-26 16:49 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Parag Warudkar, linux-kernel

> > > > Seems to be hrtimers related - CC'ed Thomas.
> > > I doubt that. The tick interrupt just finds out, that the machine is
> > > stuck in rmmod.
> > > > > Also the rmmod hangs and would not exit even with kill -9. It also
> > > > > sucks up 100% cpu.
> > > Can you please enable CONFIG_PROVE_LOCKING ?
> > Luckily I already had :-)
> > Here is its output:
> So there is no output from lockdep, during the rmmod hang ? The rmmod

Nope, just other subsystems (nfs, usb) minutes before that.

> sucks 100% cpu, so it busy loops at some place. Looking at the stack
> trace this seems to be in:
> void cancel_rearming_delayed_workqueue(struct workqueue_struct *wq,
>                                        struct delayed_work *dwork)
> {
>         while (!cancel_delayed_work(dwork))
>                 flush_workqueue(wq);
> }
> which is not really related to timers :)

Ok.


Folkert van Heusden

-- 

Multitail - gibkaja utilita po sledovaniju log-fajlov i vyvoda
kommand. Fil'trovanie, raskrašivanie, slijanie, vizual'noe sravnenie,
i t.d.  http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 16:49         ` Folkert van Heusden
@ 2007-05-26 17:06           ` Thomas Gleixner
  2007-05-26 17:12             ` Folkert van Heusden
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2007-05-26 17:06 UTC (permalink / raw)
  To: Folkert van Heusden; +Cc: Parag Warudkar, linux-kernel

On Sat, 2007-05-26 at 18:49 +0200, Folkert van Heusden wrote:
> > > > > Seems to be hrtimers related - CC'ed Thomas.
> > > > I doubt that. The tick interrupt just finds out, that the machine is
> > > > stuck in rmmod.
> > > > > > Also the rmmod hangs and would not exit even with kill -9. It also
> > > > > > sucks up 100% cpu.
> > > > Can you please enable CONFIG_PROVE_LOCKING ?
> > > Luckily I already had :-)
> > > Here is its output:
> > So there is no output from lockdep, during the rmmod hang ? The rmmod
> 
> Nope, just other subsystems (nfs, usb) minutes before that.

Hmm. Can you provide the output of that ? Might be some stale stuff
leading to the hang later.

	tglx



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 17:06           ` Thomas Gleixner
@ 2007-05-26 17:12             ` Folkert van Heusden
  0 siblings, 0 replies; 18+ messages in thread
From: Folkert van Heusden @ 2007-05-26 17:12 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Parag Warudkar, linux-kernel

> > > > > > Seems to be hrtimers related - CC'ed Thomas.
> > > > > I doubt that. The tick interrupt just finds out, that the machine is
> > > > > stuck in rmmod.
> > > > > > > Also the rmmod hangs and would not exit even with kill -9. It also
> > > > > > > sucks up 100% cpu.
> > > > > Can you please enable CONFIG_PROVE_LOCKING ?
> > > > Luckily I already had :-)
> > > > Here is its output:
> > > So there is no output from lockdep, during the rmmod hang ? The rmmod
> > Nope, just other subsystems (nfs, usb) minutes before that.
> Hmm. Can you provide the output of that ? Might be some stale stuff
> leading to the hang later.

It is output like:
[  843.122782] usb 7-2: usbfs: USBDEVFS_CONTROL failed cmd newhidups rqt 161 rq 1 len 10 ret -110
[ 1292.121131] usb 7-2: usbfs: USBDEVFS_CONTROL failed cmd newhidups rqt 161 rq 1 len 10 ret -110

and about the nfs server temporarily not available. Please note that I
did not do the rmmod from a nfs mountpoint.


Folkert van Heusden

-- 

Multitail - gibkaja utilita po sledovaniju log-fajlov i vyvoda
kommand. Fil'trovanie, raskrašivanie, slijanie, vizual'noe sravnenie,
i t.d.  http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 16:35       ` Thomas Gleixner
  2007-05-26 16:49         ` Folkert van Heusden
@ 2007-05-27 20:32         ` Matt Mackall
  1 sibling, 0 replies; 18+ messages in thread
From: Matt Mackall @ 2007-05-27 20:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Folkert van Heusden, Parag Warudkar, linux-kernel, Stephen Hemminger

On Sat, May 26, 2007 at 06:35:24PM +0200, Thomas Gleixner wrote:
> On Sat, 2007-05-26 at 18:17 +0200, Folkert van Heusden wrote:
> > > > Seems to be hrtimers related - CC'ed Thomas.
> > > I doubt that. The tick interrupt just finds out, that the machine is
> > > stuck in rmmod.
> > > > > Also the rmmod hangs and would not exit even with kill -9. It also
> > > > > sucks up 100% cpu.
> > > Can you please enable CONFIG_PROVE_LOCKING ?
> >
> > Luckily I already had :-)
> > Here is its output:
> 
> So there is no output from lockdep, during the rmmod hang ? The rmmod
> sucks 100% cpu, so it busy loops at some place. Looking at the stack
> trace this seems to be in:
> 
> 
> void cancel_rearming_delayed_workqueue(struct workqueue_struct *wq,                                                                                                                 
>                                        struct delayed_work *dwork)                                                                                                                  
> {                                                                                                                                                                                   
>         while (!cancel_delayed_work(dwork))                                                                                                                                         
>                 flush_workqueue(wq);                                                                                                                                                
> }
> 
> which is not really related to timers :)

Looks like Steve added the delayed work bits, though I never saw any
of his patches.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-26 15:40 [2.6.21.1] soft lockup when removing netconsole module Folkert van Heusden
  2007-05-26 15:53 ` Parag Warudkar
@ 2007-05-29  7:56 ` Andrew Morton
  2007-05-30 13:28   ` [PATCH] " Jason Wessel
  2007-06-12 11:02   ` Jarek Poplawski
  1 sibling, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2007-05-29  7:56 UTC (permalink / raw)
  To: Folkert van Heusden
  Cc: linux-kernel, Jarek Poplawski, Jason Wessel, Thomas Gleixner, stable

On Sat, 26 May 2007 17:40:12 +0200 Folkert van Heusden <folkert@vanheusden.com> wrote:

> When trying to remove the netconsole module, I got the following kernel
> output after a while (couple of minutes iirc):
> 
> [525720.117293] BUG: soft lockup detected on CPU#1!
> [525720.117353]  [<c1004d53>] show_trace_log_lvl+0x1a/0x30
> [525720.117439]  [<c1004d7b>] show_trace+0x12/0x14
> [525720.117526]  [<c1004e75>] dump_stack+0x16/0x18
> [525720.117613]  [<c104dd5b>] softlockup_tick+0xa6/0xc2
> [525720.117694]  [<c1026855>] run_local_timers+0x12/0x14
> [525720.117738]  [<c1026669>] update_process_times+0x72/0xa1
> [525720.117744]  [<c1038673>] tick_sched_timer+0x53/0xb6
> [525720.117748]  [<c1033d62>] hrtimer_interrupt+0x189/0x1e3
> [525720.117753]  [<c100e9e2>] local_apic_timer_interrupt+0x55/0x5b
> [525720.117761]  [<c100ea12>] smp_apic_timer_interrupt+0x2a/0x39
> [525720.117766]  [<c1004a3f>] apic_timer_interrupt+0x33/0x38
> [525720.117770]  [<c120f4b1>] mutex_lock+0x8/0xa
> [525720.117775]  [<c102d2f0>] flush_workqueue+0x2f/0x8f
> [525720.117780]  [<c102d7a0>] cancel_rearming_delayed_workqueue+0x29/0x2b
> [525720.117785]  [<c102d7b1>] cancel_rearming_delayed_work+0xf/0x11
> [525720.117790]  [<c11be143>] netpoll_cleanup+0x75/0xa5
> [525720.117794]  [<f893712d>] cleanup_netconsole+0x17/0x1a [netconsole]
> [525720.117804]  [<c1041f11>] sys_delete_module+0x12f/0x14f
> [525720.117809]  [<c1003f74>] syscall_call+0x7/0xb
> [525720.117812]  =======================
> 
> Also the rmmod hangs and would not exit even with kill -9. It also
> sucks up 100% cpu.

Jason recently posted a mystery patch without telling us what problem it
fixed.

It looks like you just found it: cancel_rearming_delayed_work() will hang
if the work isn't actually pending.  Please test this:


From: Jason Wessel <jason.wessel@windriver.com>

Do not call cancel_rearming_delayed_work() if there is no
pending work.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/core/netpoll.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff -puN net/core/netpoll.c~a net/core/netpoll.c
--- a/net/core/netpoll.c~a
+++ a/net/core/netpoll.c
@@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
 			if (atomic_dec_and_test(&npinfo->refcnt)) {
 				skb_queue_purge(&npinfo->arp_tx);
 				skb_queue_purge(&npinfo->txq);
-				cancel_rearming_delayed_work(&npinfo->tx_work);
-				flush_scheduled_work();
+				if (delayed_work_pending(&npinfo->tx_work)) {
+					cancel_rearming_delayed_work(&npinfo->tx_work);
+					flush_scheduled_work();
+				}
 
 				kfree(npinfo);
 			}
_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-29  7:56 ` Andrew Morton
@ 2007-05-30 13:28   ` Jason Wessel
  2007-05-30 20:38     ` Folkert van Heusden
  2007-06-12 11:02   ` Jarek Poplawski
  1 sibling, 1 reply; 18+ messages in thread
From: Jason Wessel @ 2007-05-30 13:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Folkert van Heusden, linux-kernel, Jarek Poplawski,
	Thomas Gleixner, stable

[-- Attachment #1: Type: text/plain, Size: 341 bytes --]


Attached is a patch to fix the soft lockup problem when removing the 
netconsole module.

It looks different than the original patch I posted because the context 
had to change to maintain 80 column code.  For reference the original 
e-mail + patch was "[BUG] 2.6.21 hang in 
cancel_rearming_delayed_workqueue()" sent on 5/25/07.

Jason.



[-- Attachment #2: netpoll_cleanup_fix.patch --]
[-- Type: text/plain, Size: 2109 bytes --]

>From 102a018d601e954f658d272d4daeef139ae40e0e Mon Sep 17 00:00:00 2001
From: Jason Wessel <jason.wessel@windriver.com>
Date: Wed, 30 May 2007 08:03:52 -0500
Subject: [PATCH] The netpoll_cleanup handler can hang the kernel if there is no work in
the work queue because a call to cancel_rearming_delayed_work() with
no work goes into an infinite loop.

The typical case where this is a problem is on removing a kernel
module such as the netconsole driver or kgdboe.

To maintain 80 column code, the function had to have one level of braces
dropped.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
---
 net/core/netpoll.c |   36 +++++++++++++++++++-----------------
 1 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 4581ece..28fa50e 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -766,30 +766,32 @@ void netpoll_cleanup(struct netpoll *np)
 	struct netpoll_info *npinfo;
 	unsigned long flags;
 
-	if (np->dev) {
-		npinfo = np->dev->npinfo;
-		if (npinfo) {
-			if (npinfo->rx_np == np) {
-				spin_lock_irqsave(&npinfo->rx_lock, flags);
-				npinfo->rx_np = NULL;
-				npinfo->rx_flags &= ~NETPOLL_RX_ENABLED;
-				spin_unlock_irqrestore(&npinfo->rx_lock, flags);
-			}
+	if (!np->dev)
+		return;
+
+	npinfo = np->dev->npinfo;
+	if (npinfo) {
+		if (npinfo->rx_np == np) {
+			spin_lock_irqsave(&npinfo->rx_lock, flags);
+			npinfo->rx_np = NULL;
+			npinfo->rx_flags &= ~NETPOLL_RX_ENABLED;
+			spin_unlock_irqrestore(&npinfo->rx_lock, flags);
+		}
 
-			np->dev->npinfo = NULL;
-			if (atomic_dec_and_test(&npinfo->refcnt)) {
-				skb_queue_purge(&npinfo->arp_tx);
-				skb_queue_purge(&npinfo->txq);
+		np->dev->npinfo = NULL;
+		if (atomic_dec_and_test(&npinfo->refcnt)) {
+			skb_queue_purge(&npinfo->arp_tx);
+			skb_queue_purge(&npinfo->txq);
+			if (delayed_work_pending(&npinfo->tx_work)) {
 				cancel_rearming_delayed_work(&npinfo->tx_work);
 				flush_scheduled_work();
-
-				kfree(npinfo);
 			}
-		}
 
-		dev_put(np->dev);
+			kfree(npinfo);
+		}
 	}
 
+	dev_put(np->dev);
 	np->dev = NULL;
 }
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-30 13:28   ` [PATCH] " Jason Wessel
@ 2007-05-30 20:38     ` Folkert van Heusden
  0 siblings, 0 replies; 18+ messages in thread
From: Folkert van Heusden @ 2007-05-30 20:38 UTC (permalink / raw)
  To: Jason Wessel
  Cc: Andrew Morton, linux-kernel, Jarek Poplawski, Thomas Gleixner, stable

This patch seems to solve the problem.
I removed (and inserted) the module twice without any oopses or panics
or hangs in general.

On Wed, May 30, 2007 at 08:28:09AM -0500, Jason Wessel wrote:
>
> Attached is a patch to fix the soft lockup problem when removing the 
> netconsole module.
>
> It looks different than the original patch I posted because the context had 
> to change to maintain 80 column code.  For reference the original e-mail + 
> patch was "[BUG] 2.6.21 hang in cancel_rearming_delayed_workqueue()" sent 
> on 5/25/07.
>
> Jason.
>
>

> >From 102a018d601e954f658d272d4daeef139ae40e0e Mon Sep 17 00:00:00 2001
> From: Jason Wessel <jason.wessel@windriver.com>
> Date: Wed, 30 May 2007 08:03:52 -0500
> Subject: [PATCH] The netpoll_cleanup handler can hang the kernel if there is no work in
> the work queue because a call to cancel_rearming_delayed_work() with
> no work goes into an infinite loop.
> 
> The typical case where this is a problem is on removing a kernel
> module such as the netconsole driver or kgdboe.
> 
> To maintain 80 column code, the function had to have one level of braces
> dropped.
> 
> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> ---
>  net/core/netpoll.c |   36 +++++++++++++++++++-----------------
>  1 files changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 4581ece..28fa50e 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -766,30 +766,32 @@ void netpoll_cleanup(struct netpoll *np)
>  	struct netpoll_info *npinfo;
>  	unsigned long flags;
>  
> -	if (np->dev) {
> -		npinfo = np->dev->npinfo;
> -		if (npinfo) {
> -			if (npinfo->rx_np == np) {
> -				spin_lock_irqsave(&npinfo->rx_lock, flags);
> -				npinfo->rx_np = NULL;
> -				npinfo->rx_flags &= ~NETPOLL_RX_ENABLED;
> -				spin_unlock_irqrestore(&npinfo->rx_lock, flags);
> -			}
> +	if (!np->dev)
> +		return;
> +
> +	npinfo = np->dev->npinfo;
> +	if (npinfo) {
> +		if (npinfo->rx_np == np) {
> +			spin_lock_irqsave(&npinfo->rx_lock, flags);
> +			npinfo->rx_np = NULL;
> +			npinfo->rx_flags &= ~NETPOLL_RX_ENABLED;
> +			spin_unlock_irqrestore(&npinfo->rx_lock, flags);
> +		}
>  
> -			np->dev->npinfo = NULL;
> -			if (atomic_dec_and_test(&npinfo->refcnt)) {
> -				skb_queue_purge(&npinfo->arp_tx);
> -				skb_queue_purge(&npinfo->txq);
> +		np->dev->npinfo = NULL;
> +		if (atomic_dec_and_test(&npinfo->refcnt)) {
> +			skb_queue_purge(&npinfo->arp_tx);
> +			skb_queue_purge(&npinfo->txq);
> +			if (delayed_work_pending(&npinfo->tx_work)) {
>  				cancel_rearming_delayed_work(&npinfo->tx_work);
>  				flush_scheduled_work();
> -
> -				kfree(npinfo);
>  			}
> -		}
>  
> -		dev_put(np->dev);
> +			kfree(npinfo);
> +		}
>  	}
>  
> +	dev_put(np->dev);
>  	np->dev = NULL;
>  }
>  
> -- 
> 1.5.0.6
> 



Folkert van Heusden

-- 

Multitail - gibkaja utilita po sledovaniju log-fajlov i vyvoda
kommand. Fil'trovanie, raskrašivanie, slijanie, vizual'noe sravnenie,
i t.d.  http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-05-29  7:56 ` Andrew Morton
  2007-05-30 13:28   ` [PATCH] " Jason Wessel
@ 2007-06-12 11:02   ` Jarek Poplawski
  2007-06-13  9:25     ` [PATCH] " Jarek Poplawski
  1 sibling, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2007-06-12 11:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Folkert van Heusden, linux-kernel, Jason Wessel, Thomas Gleixner,
	stable, netdev

On Tue, May 29, 2007 at 12:56:28AM -0700, Andrew Morton wrote:
> On Sat, 26 May 2007 17:40:12 +0200 Folkert van Heusden <folkert@vanheusden.com> wrote:
> 
> > When trying to remove the netconsole module, I got the following kernel
> > output after a while (couple of minutes iirc):
> > 
> > [525720.117293] BUG: soft lockup detected on CPU#1!
> > [525720.117353]  [<c1004d53>] show_trace_log_lvl+0x1a/0x30
> > [525720.117439]  [<c1004d7b>] show_trace+0x12/0x14
> > [525720.117526]  [<c1004e75>] dump_stack+0x16/0x18
> > [525720.117613]  [<c104dd5b>] softlockup_tick+0xa6/0xc2
> > [525720.117694]  [<c1026855>] run_local_timers+0x12/0x14
> > [525720.117738]  [<c1026669>] update_process_times+0x72/0xa1
> > [525720.117744]  [<c1038673>] tick_sched_timer+0x53/0xb6
> > [525720.117748]  [<c1033d62>] hrtimer_interrupt+0x189/0x1e3
> > [525720.117753]  [<c100e9e2>] local_apic_timer_interrupt+0x55/0x5b
> > [525720.117761]  [<c100ea12>] smp_apic_timer_interrupt+0x2a/0x39
> > [525720.117766]  [<c1004a3f>] apic_timer_interrupt+0x33/0x38
> > [525720.117770]  [<c120f4b1>] mutex_lock+0x8/0xa
> > [525720.117775]  [<c102d2f0>] flush_workqueue+0x2f/0x8f
> > [525720.117780]  [<c102d7a0>] cancel_rearming_delayed_workqueue+0x29/0x2b
> > [525720.117785]  [<c102d7b1>] cancel_rearming_delayed_work+0xf/0x11
> > [525720.117790]  [<c11be143>] netpoll_cleanup+0x75/0xa5
> > [525720.117794]  [<f893712d>] cleanup_netconsole+0x17/0x1a [netconsole]
> > [525720.117804]  [<c1041f11>] sys_delete_module+0x12f/0x14f
> > [525720.117809]  [<c1003f74>] syscall_call+0x7/0xb
> > [525720.117812]  =======================
> > 
> > Also the rmmod hangs and would not exit even with kill -9. It also
> > sucks up 100% cpu.
> 
> Jason recently posted a mystery patch without telling us what problem it
> fixed.
> 

To be fair the problem should be known:

http://marc.info/?l=linux-kernel&m=117700287817801&w=2

List:       linux-kernel
Subject:    Re: [PATCH -mm] workqueue: debug possible endless loop in cancel_rearming_delayed_work
From:       Chuck Ebbert <cebbert () redhat ! com>
Date:       2007-04-19 17:07:11
Message-ID: 4627A1BF.8080406 () redhat ! com

> Okay, an easy test for it: insmod netconsole ; rmmod netconsole
> 
> In 2.6.20.x it loops forever and cancel_rearming_delayed_work()
> is part of the trace...

I hoped the discussion about cancel_rearming_delayed_work would
reach more people (there was also a patch proposal to add a warning
to the usage comment). But it seem it was not enough...

Of course such a problem should preferably be fixed by somebody who
knows the code (alas I don't know netconsole), to be sure all needed
cancels are still done after this change. I hope Jason's patch is
right but I'm a little surprised I can't see netdev in cc (I'll try
to fix this).

Cheers,
Jarek P.

PS: I'm very sorry for such late response (holidays).

> It looks like you just found it: cancel_rearming_delayed_work() will hang
> if the work isn't actually pending.  Please test this:
> 
> 
> From: Jason Wessel <jason.wessel@windriver.com>
> 
> Do not call cancel_rearming_delayed_work() if there is no
> pending work.
> 
> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  net/core/netpoll.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff -puN net/core/netpoll.c~a net/core/netpoll.c
> --- a/net/core/netpoll.c~a
> +++ a/net/core/netpoll.c
> @@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
>  			if (atomic_dec_and_test(&npinfo->refcnt)) {
>  				skb_queue_purge(&npinfo->arp_tx);
>  				skb_queue_purge(&npinfo->txq);
> -				cancel_rearming_delayed_work(&npinfo->tx_work);
> -				flush_scheduled_work();
> +				if (delayed_work_pending(&npinfo->tx_work)) {
> +					cancel_rearming_delayed_work(&npinfo->tx_work);
> +					flush_scheduled_work();
> +				}
>  
>  				kfree(npinfo);
>  			}
> _
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-06-12 11:02   ` Jarek Poplawski
@ 2007-06-13  9:25     ` Jarek Poplawski
  2007-06-26 23:07       ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2007-06-13  9:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Folkert van Heusden, linux-kernel, Jason Wessel, Thomas Gleixner,
	stable, netdev

On Tue, Jun 12, 2007 at 01:02:33PM +0200, Jarek Poplawski wrote:
...
> Of course such a problem should preferably be fixed by somebody who
> knows the code (alas I don't know netconsole), to be sure all needed
> cancels are still done after this change. I hope Jason's patch is
> right but I'm a little surprised I can't see netdev in cc (I'll try
> to fix this).

So, I've had a look into netpoll and, unfortunately, I don't
think this patch is right... 

> > From: Jason Wessel <jason.wessel@windriver.com>
> > 
> > Do not call cancel_rearming_delayed_work() if there is no
> > pending work.
> > 
> > Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > 
> >  net/core/netpoll.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff -puN net/core/netpoll.c~a net/core/netpoll.c
> > --- a/net/core/netpoll.c~a
> > +++ a/net/core/netpoll.c
> > @@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
> >  			if (atomic_dec_and_test(&npinfo->refcnt)) {
> >  				skb_queue_purge(&npinfo->arp_tx);
> >  				skb_queue_purge(&npinfo->txq);
> > -				cancel_rearming_delayed_work(&npinfo->tx_work);
> > -				flush_scheduled_work();
> > +				if (delayed_work_pending(&npinfo->tx_work)) {
> > +					cancel_rearming_delayed_work(&npinfo->tx_work);
> > +					flush_scheduled_work();
> > +				}
> >  
> >  				kfree(npinfo);
> >  			}
> > _

There are such possibilities:

1. After positive delayed_work_pending(&npinfo->tx_work) test
some work is queued, but there is no guarantee that when running
it'll rearm again, so cancel_rearming_delayed_work can loop again;

2. After negative delayed_work_pending(&npinfo->tx_work) test
a work is just running, eg. waiting on netif_tx_lock, while
kfree(npinfo) is done here (oops?!).

I've found an additional problem here with or without this patch:
after deleting a timer in cancel_rearming_delayed_work() there could
stay a last skb queued in npinfo->txq, and after kfree(npinfo)
we have small memory leak. If I'm right here similar fix is needed
in the current netpoll code: additional npinfo->txq purging only
or maybe the whole cancel_rearming_ changed like this.

I've tried to eliminate these problems in attached below patch
proposal. I'm not sure it's all right: as I've written earlier I
don't know netconsole enough, but it's probably a little better
than above solution.

I've some doubts yet (I didn't have time to check this all):

1. I hope this other schedule_delayed_work() from netpoll_send_skb()
is not possible when netpoll_cleanup() runs - if I'm wrong additional
check of npinfo->refcnt should be done there;
2. I also hope npinfo->refcnt before scheduling should be enough here
- if not - another possibility is adding some locking eg.:
netif_tx_lock before cancel for synchronization.

Of course it would be very nice if somebody could test or verify
this patch more.

Regards,
Jarek P.


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c
--- 2.6.21-/net/core/netpoll.c	2007-04-26 15:08:32.000000000 +0200
+++ 2.6.21/net/core/netpoll.c	2007-06-12 21:05:23.000000000 +0200
@@ -73,7 +73,8 @@ static void queue_process(struct work_st
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
 
-			schedule_delayed_work(&npinfo->tx_work, HZ/10);
+			if (atomic_read(&npinfo->refcnt))
+				schedule_delayed_work(&npinfo->tx_work, HZ/10);
 			return;
 		}
 		netif_tx_unlock(dev);
@@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np)
 			if (atomic_dec_and_test(&npinfo->refcnt)) {
 				skb_queue_purge(&npinfo->arp_tx);
 				skb_queue_purge(&npinfo->txq);
-				cancel_rearming_delayed_work(&npinfo->tx_work);
+				cancel_delayed_work(&npinfo->tx_work);
 				flush_scheduled_work();
 
+				/* clean after last, unfinished work */
+				if (!skb_queue_empty(&npinfo->txq)) {
+					struct sk_buff *skb;
+					skb = __skb_dequeue(&npinfo->txq);
+					kfree_skb(skb);
+				}
 				kfree(npinfo);
 			}
 		}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-06-13  9:25     ` [PATCH] " Jarek Poplawski
@ 2007-06-26 23:07       ` Andrew Morton
  2007-06-27  0:46         ` Wessel, Jason
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2007-06-26 23:07 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Folkert van Heusden, linux-kernel, Jason Wessel, Thomas Gleixner,
	stable, netdev

On Wed, 13 Jun 2007 11:25:37 +0200
Jarek Poplawski <jarkao2@o2.pl> wrote:

> On Tue, Jun 12, 2007 at 01:02:33PM +0200, Jarek Poplawski wrote:
> ...
> > Of course such a problem should preferably be fixed by somebody who
> > knows the code (alas I don't know netconsole), to be sure all needed
> > cancels are still done after this change. I hope Jason's patch is
> > right but I'm a little surprised I can't see netdev in cc (I'll try
> > to fix this).
> 
> So, I've had a look into netpoll and, unfortunately, I don't
> think this patch is right... 
> 
> > > From: Jason Wessel <jason.wessel@windriver.com>
> > > 
> > > Do not call cancel_rearming_delayed_work() if there is no
> > > pending work.
> > > 
> > > Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > ---
> > > 
> > >  net/core/netpoll.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff -puN net/core/netpoll.c~a net/core/netpoll.c
> > > --- a/net/core/netpoll.c~a
> > > +++ a/net/core/netpoll.c
> > > @@ -784,8 +784,10 @@ void netpoll_cleanup(struct netpoll *np)
> > >  			if (atomic_dec_and_test(&npinfo->refcnt)) {
> > >  				skb_queue_purge(&npinfo->arp_tx);
> > >  				skb_queue_purge(&npinfo->txq);
> > > -				cancel_rearming_delayed_work(&npinfo->tx_work);
> > > -				flush_scheduled_work();
> > > +				if (delayed_work_pending(&npinfo->tx_work)) {
> > > +					cancel_rearming_delayed_work(&npinfo->tx_work);
> > > +					flush_scheduled_work();
> > > +				}
> > >  
> > >  				kfree(npinfo);
> > >  			}
> > > _
> 
> There are such possibilities:
> 
> 1. After positive delayed_work_pending(&npinfo->tx_work) test
> some work is queued, but there is no guarantee that when running
> it'll rearm again, so cancel_rearming_delayed_work can loop again;
> 
> 2. After negative delayed_work_pending(&npinfo->tx_work) test
> a work is just running, eg. waiting on netif_tx_lock, while
> kfree(npinfo) is done here (oops?!).
> 
> I've found an additional problem here with or without this patch:
> after deleting a timer in cancel_rearming_delayed_work() there could
> stay a last skb queued in npinfo->txq, and after kfree(npinfo)
> we have small memory leak. If I'm right here similar fix is needed
> in the current netpoll code: additional npinfo->txq purging only
> or maybe the whole cancel_rearming_ changed like this.
> 
> I've tried to eliminate these problems in attached below patch
> proposal. I'm not sure it's all right: as I've written earlier I
> don't know netconsole enough, but it's probably a little better
> than above solution.
> 
> I've some doubts yet (I didn't have time to check this all):
> 
> 1. I hope this other schedule_delayed_work() from netpoll_send_skb()
> is not possible when netpoll_cleanup() runs - if I'm wrong additional
> check of npinfo->refcnt should be done there;
> 2. I also hope npinfo->refcnt before scheduling should be enough here
> - if not - another possibility is adding some locking eg.:
> netif_tx_lock before cancel for synchronization.
> 
> Of course it would be very nice if somebody could test or verify
> this patch more.
> 
> Regards,
> Jarek P.
> 
> 
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> ---
> 
> diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c
> --- 2.6.21-/net/core/netpoll.c	2007-04-26 15:08:32.000000000 +0200
> +++ 2.6.21/net/core/netpoll.c	2007-06-12 21:05:23.000000000 +0200
> @@ -73,7 +73,8 @@ static void queue_process(struct work_st
>  			netif_tx_unlock(dev);
>  			local_irq_restore(flags);
>  
> -			schedule_delayed_work(&npinfo->tx_work, HZ/10);
> +			if (atomic_read(&npinfo->refcnt))
> +				schedule_delayed_work(&npinfo->tx_work, HZ/10);
>  			return;
>  		}
>  		netif_tx_unlock(dev);
> @@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np)
>  			if (atomic_dec_and_test(&npinfo->refcnt)) {
>  				skb_queue_purge(&npinfo->arp_tx);
>  				skb_queue_purge(&npinfo->txq);
> -				cancel_rearming_delayed_work(&npinfo->tx_work);
> +				cancel_delayed_work(&npinfo->tx_work);
>  				flush_scheduled_work();
>  
> +				/* clean after last, unfinished work */
> +				if (!skb_queue_empty(&npinfo->txq)) {
> +					struct sk_buff *skb;
> +					skb = __skb_dequeue(&npinfo->txq);
> +					kfree_skb(skb);
> +				}
>  				kfree(npinfo);
>  			}
>  		}

Everything went quiet?

If this patch has been tested and fixes the bug, can you please send a
version which is ready for merging?  (ie: add a suitable description of
what it does).


Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-06-26 23:07       ` Andrew Morton
@ 2007-06-27  0:46         ` Wessel, Jason
  2007-06-27  1:00           ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Wessel, Jason @ 2007-06-27  0:46 UTC (permalink / raw)
  To: Andrew Morton, Jarek Poplawski
  Cc: Folkert van Heusden, linux-kernel, Thomas Gleixner, stable, netdev



> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org] 
> > 
> > Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> > 
> > ---
> > 
> > diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c
> > --- 2.6.21-/net/core/netpoll.c	2007-04-26 
> 15:08:32.000000000 +0200
> > +++ 2.6.21/net/core/netpoll.c	2007-06-12 
> 21:05:23.000000000 +0200
> > @@ -73,7 +73,8 @@ static void queue_process(struct work_st
> >  			netif_tx_unlock(dev);
> >  			local_irq_restore(flags);
> >  
> > -			schedule_delayed_work(&npinfo->tx_work, HZ/10);
> > +			if (atomic_read(&npinfo->refcnt))
> > +				
> schedule_delayed_work(&npinfo->tx_work, HZ/10);
> >  			return;
> >  		}
> >  		netif_tx_unlock(dev);
> > @@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np)
> >  			if (atomic_dec_and_test(&npinfo->refcnt)) {
> >  				skb_queue_purge(&npinfo->arp_tx);
> >  				skb_queue_purge(&npinfo->txq);
> > -				
> cancel_rearming_delayed_work(&npinfo->tx_work);
> > +				cancel_delayed_work(&npinfo->tx_work);
> >  				flush_scheduled_work();
> >  
> > +				/* clean after last, unfinished work */
> > +				if (!skb_queue_empty(&npinfo->txq)) {
> > +					struct sk_buff *skb;
> > +					skb = 
> __skb_dequeue(&npinfo->txq);
> > +					kfree_skb(skb);
> > +				}
> >  				kfree(npinfo);
> >  			}
> >  		}
> 
> Everything went quiet?
> 
> If this patch has been tested and fixes the bug, can you 
> please send a version which is ready for merging?  (ie: add a 
> suitable description of what it does).
> 
> 

I mailed Jarek separately.

I had tested the patch with netconsole and kgdb and it does in fact fix
the problem that was reported.

Jason. 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-06-27  0:46         ` Wessel, Jason
@ 2007-06-27  1:00           ` Andrew Morton
  2007-06-27  7:24             ` Jarek Poplawski
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2007-06-27  1:00 UTC (permalink / raw)
  To: Wessel, Jason
  Cc: Jarek Poplawski, Folkert van Heusden, linux-kernel,
	Thomas Gleixner, stable, netdev

On Tue, 26 Jun 2007 17:46:13 -0700 "Wessel, Jason" <jason.wessel@windriver.com> wrote:

> > >  			}
> > >  		}
> > 
> > Everything went quiet?
> > 
> > If this patch has been tested and fixes the bug, can you 
> > please send a version which is ready for merging?  (ie: add a 
> > suitable description of what it does).
> > 
> > 
> 
> I mailed Jarek separately.
> 
> I had tested the patch with netconsole and kgdb and it does in fact fix
> the problem that was reported.

OK, thanks.  Please don't mail people separately!

I queued this up with a null changelog for now.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
  2007-06-27  1:00           ` Andrew Morton
@ 2007-06-27  7:24             ` Jarek Poplawski
  0 siblings, 0 replies; 18+ messages in thread
From: Jarek Poplawski @ 2007-06-27  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wessel, Jason, Folkert van Heusden, linux-kernel,
	Thomas Gleixner, stable, netdev

On Tue, Jun 26, 2007 at 06:00:00PM -0700, Andrew Morton wrote:
> On Tue, 26 Jun 2007 17:46:13 -0700 "Wessel, Jason" <jason.wessel@windriver.com> wrote:
...
> > > Everything went quiet?
> > > 
> > > If this patch has been tested and fixes the bug, can you 
> > > please send a version which is ready for merging?  (ie: add a 
> > > suitable description of what it does).
> > > 
> > > 
> > 
> > I mailed Jarek separately.
> > 
> > I had tested the patch with netconsole and kgdb and it does in fact fix
> > the problem that was reported.
> 
> OK, thanks.  Please don't mail people separately!
> 
> I queued this up with a null changelog for now.
> 

I pasted here this queued version - only the changelog is added.

Regards,
Jarek P.

------------------------------------------------------
Subject: netconsole: fix soft lockup (2.6.21) and memory leak when removing module
From: Jarek Poplawski <jarkao2@o2.pl>

#1
Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
required a work function should always (unconditionally) rearm with
delay > 0 - otherwise it would endlessly loop. This patch replaces
this function with cancel_delayed_work(). Later kernel versions don't
require this, so here it's only for uniformity.

#2
After deleting a timer in cancel_[rearming_]delayed_work() there could
stay a last skb queued in npinfo->txq causing a memory leak after
kfree(npinfo).

Initial patch & testing by: Jason Wessel <jason.wessel@windriver.com>

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/core/netpoll.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff -puN net/core/netpoll.c~netconsole-fix-soft-lockup-when-removing-module net/core/netpoll.c
--- a/net/core/netpoll.c~netconsole-fix-soft-lockup-when-removing-module
+++ a/net/core/netpoll.c
@@ -72,7 +72,8 @@ static void queue_process(struct work_st
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
 
-			schedule_delayed_work(&npinfo->tx_work, HZ/10);
+			if (atomic_read(&npinfo->refcnt))
+				schedule_delayed_work(&npinfo->tx_work, HZ/10);
 			return;
 		}
 		netif_tx_unlock(dev);
@@ -785,9 +786,15 @@ void netpoll_cleanup(struct netpoll *np)
 			if (atomic_dec_and_test(&npinfo->refcnt)) {
 				skb_queue_purge(&npinfo->arp_tx);
 				skb_queue_purge(&npinfo->txq);
-				cancel_rearming_delayed_work(&npinfo->tx_work);
+				cancel_delayed_work(&npinfo->tx_work);
 				flush_scheduled_work();
 
+				/* clean after last, unfinished work */
+				if (!skb_queue_empty(&npinfo->txq)) {
+					struct sk_buff *skb;
+					skb = __skb_dequeue(&npinfo->txq);
+					kfree_skb(skb);
+				}
 				kfree(npinfo);
 			}
 		}

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-06-27  7:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-26 15:40 [2.6.21.1] soft lockup when removing netconsole module Folkert van Heusden
2007-05-26 15:53 ` Parag Warudkar
2007-05-26 16:12   ` Thomas Gleixner
2007-05-26 16:17     ` Folkert van Heusden
2007-05-26 16:35       ` Thomas Gleixner
2007-05-26 16:49         ` Folkert van Heusden
2007-05-26 17:06           ` Thomas Gleixner
2007-05-26 17:12             ` Folkert van Heusden
2007-05-27 20:32         ` Matt Mackall
2007-05-29  7:56 ` Andrew Morton
2007-05-30 13:28   ` [PATCH] " Jason Wessel
2007-05-30 20:38     ` Folkert van Heusden
2007-06-12 11:02   ` Jarek Poplawski
2007-06-13  9:25     ` [PATCH] " Jarek Poplawski
2007-06-26 23:07       ` Andrew Morton
2007-06-27  0:46         ` Wessel, Jason
2007-06-27  1:00           ` Andrew Morton
2007-06-27  7:24             ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).