All of lore.kernel.org
 help / color / mirror / Atom feed
* qla2xxx BUG: workqueue leaked lock or atomic
@ 2007-02-26 13:31 Andre Noll
  2007-02-26 18:26 ` Andrew Vasquez
  0 siblings, 1 reply; 22+ messages in thread
From: Andre Noll @ 2007-02-26 13:31 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 6353 bytes --]

Hi

On linux-2.6.20.1, we're seeing hard lockups with 2 raid systems
connected to a qla2xxx card and used as a single volume via lvm.
The system seems to lock up only if data gets written to both raid
systems at the same time.

On a standard kernel nothing makes it to the log, the system just
freezes. So we tried a lockdep kernel which reports two BUGs during
boot, see below.

Could this be related to our problem?

Thanks
Andre


[   64.150773] Loading iSCSI transport class v2.0-724.
[   64.151096] QLogic Fibre Channel HBA Driver
[   64.151405] ACPI: PCI Interrupt 0000:05:08.0[A] -> GSI 32 (level, low) -> IRQ 32
[   64.151821] qla2xxx 0000:05:08.0: Found an ISP2422, irq 32, iobase 0xffffc20000006000
[   64.152231] qla2xxx 0000:05:08.0: Configuring PCI space...
[   64.152498] qla2xxx 0000:05:08.0: Configure NVRAM parameters...
[   64.159088] qla2xxx 0000:05:08.0: Verifying loaded RISC code...
[   74.169623] qla2xxx 0000:05:08.0: Firmware image unavailable.
[   74.169737] qla2xxx 0000:05:08.0: Firmware images can be retrieved from: ftp://ftp.qlogic.com/outgoing/linux/firmware/.
[   74.169902] qla2xxx 0000:05:08.0: Attempting to load (potentially outdated) firmware from flash.
[   74.760935] qla2xxx 0000:05:08.0: Allocated (64 KB) for EFT...
[   74.761186] qla2xxx 0000:05:08.0: Allocated (1413 KB) for firmware dump...
[   74.776988] scsi0 : qla2xxx
[   74.961451] qla2xxx 0000:05:08.0: 
[   74.961452]  QLogic Fibre Channel HBA Driver: 8.01.07-k4
[   74.961453]   QLogic HP AE369-60001 - QLA2340
[   74.961454]   ISP2422: PCI-X Mode 1 (133 MHz) @ 0000:05:08.0 hdma+, host#=0, fw=4.00.70 [IP] 
[   74.961970] ACPI: PCI Interrupt 0000:05:08.1[B] -> GSI 33 (level, low) -> IRQ 33
[   74.962296] qla2xxx 0000:05:08.1: Found an ISP2422, irq 33, iobase 0xffffc20000172000
[   74.962662] qla2xxx 0000:05:08.1: Configuring PCI space...
[   74.962914] qla2xxx 0000:05:08.1: Configure NVRAM parameters...
[   74.969494] qla2xxx 0000:05:08.1: Verifying loaded RISC code...
[   75.353426] qla2xxx 0000:05:08.0: LIP reset occured (f7f7).
[   75.385670] qla2xxx 0000:05:08.0: LIP occured (f7f7).
[   75.388282] qla2xxx 0000:05:08.0: LOOP UP detected (2 Gbps).
[   75.778656] BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
[   75.778771] 
[   75.778772] Call Trace:
[   75.778967]  <IRQ>  [<ffffffff8024b877>] trace_hardirqs_on+0xd7/0x180
[   75.779154]  [<ffffffff8052bc1b>] _spin_unlock_irq+0x2b/0x40
[   75.779271]  [<ffffffff804605d7>] qla2x00_process_completed_request+0x137/0x1d0
[   75.779424]  [<ffffffff804606f2>] qla2x00_status_entry+0x82/0xa40
[   75.779541]  [<ffffffff8024b17f>] __lock_acquire+0xcdf/0xd90
[   75.779657]  [<ffffffff8052bcb2>] _spin_unlock_irqrestore+0x42/0x60
[   75.779775]  [<ffffffff8046228e>] qla24xx_intr_handler+0x4e/0x2b0
[   75.779892]  [<ffffffff804613e1>] qla24xx_process_response_queue+0xc1/0x1c0
[   75.780012]  [<ffffffff80462414>] qla24xx_intr_handler+0x1d4/0x2b0
[   75.780131]  [<ffffffff8025e950>] handle_IRQ_event+0x20/0x60
[   75.780270]  [<ffffffff802604ad>] handle_fasteoi_irq+0xbd/0x110
[   75.780411]  [<ffffffff8020cf62>] do_IRQ+0x132/0x1a0
[   75.780545]  [<ffffffff80208430>] default_idle+0x0/0x60
[   75.780682]  [<ffffffff8020a236>] ret_from_intr+0x0/0xf
[   75.780818]  <EOI>  [<ffffffff80208467>] default_idle+0x37/0x60
[   75.781021]  [<ffffffff80208469>] default_idle+0x39/0x60
[   75.781156]  [<ffffffff80208467>] default_idle+0x37/0x60
[   75.781294]  [<ffffffff802084f1>] cpu_idle+0x61/0x90
[   75.781429]  [<ffffffff806d6f8b>] start_secondary+0x51b/0x530
[   75.781569] 
[   75.781873] scsi 0:0:0:0: Direct-Access     transtec T6100F16R1-E     342I PQ: 0 ANSI: 5
[   75.782532] BUG: workqueue leaked lock or atomic: scsi_wq_0/0x00000000/362
[   75.782678]     last function: fc_scsi_scan_rport+0x0/0x90
[   75.782878] 1 lock held by scsi_wq_0/362:
[   75.783008]  #0:  (&shost->scan_mutex){--..}, at: [<ffffffff80529fe5>] mutex_lock+0x25/0x30
[   75.783517] 
[   75.783518] Call Trace:
[   75.783754]  [<ffffffff80248319>] debug_show_held_locks+0x9/0x10
[   75.783896]  [<ffffffff8023eb49>] run_workqueue+0x149/0x1a0
[   75.784036]  [<ffffffff802427c0>] keventd_create_kthread+0x0/0x90
[   75.784180]  [<ffffffff8023edc1>] worker_thread+0x151/0x190
[   75.784322]  [<ffffffff80227e80>] default_wake_function+0x0/0x10
[   75.784463]  [<ffffffff8023ec70>] worker_thread+0x0/0x190
[   75.784600]  [<ffffffff80242a2a>] kthread+0xda/0x110
[   75.784737]  [<ffffffff8020ab08>] child_rip+0xa/0x12
[   75.784875]  [<ffffffff8052bc1b>] _spin_unlock_irq+0x2b/0x40
[   75.785014]  [<ffffffff8020a28c>] restore_args+0x0/0x30
[   75.785149]  [<ffffffff80242950>] kthread+0x0/0x110
[   75.785285]  [<ffffffff8020aafe>] child_rip+0x0/0x12
[   75.785417] 
[   84.980341] qla2xxx 0000:05:08.1: Firmware image unavailable.
[   84.980455] qla2xxx 0000:05:08.1: Firmware images can be retrieved from: ftp://ftp.qlogic.com/outgoing/linux/firmware/.
[   84.980620] qla2xxx 0000:05:08.1: Attempting to load (potentially outdated) firmware from flash.
[   85.571726] qla2xxx 0000:05:08.1: Allocated (64 KB) for EFT...
[   85.571956] qla2xxx 0000:05:08.1: Allocated (1413 KB) for firmware dump...
[   85.587766] scsi1 : qla2xxx
[   85.718476] qla2xxx 0000:05:08.1: 
[   85.718478]  QLogic Fibre Channel HBA Driver: 8.01.07-k4
[   85.718479]   QLogic HP AE369-60001 - QLA2340
[   85.718480]   ISP2422: PCI-X Mode 1 (133 MHz) @ 0000:05:08.1 hdma+, host#=1, fw=4.00.70 [IP] 
[   85.719505] sda : very big device. try to use READ CAPACITY(16).
[   85.719727] SCSI device sda: 11714863104 512-byte hdwr sectors (5998010 MB)
[   85.720114] sda: Write Protect is off
[   85.720219] sda: Mode Sense: 9b 00 00 08
[   85.720608] SCSI device sda: write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   85.721008] sda : very big device. try to use READ CAPACITY(16).
[   85.721206] SCSI device sda: 11714863104 512-byte hdwr sectors (5998010 MB)
[   85.721552] sda: Write Protect is off
[   85.721680] sda: Mode Sense: 9b 00 00 08
[   85.722088] SCSI device sda: write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   85.722298]  sda: unknown partition table
[   85.722897] sd 0:0:0:0: Attached scsi disk sda
[   85.723205] sd 0:0:0:0: Attached scsi generic sg0 type 0

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-26 13:31 qla2xxx BUG: workqueue leaked lock or atomic Andre Noll
@ 2007-02-26 18:26 ` Andrew Vasquez
  2007-02-27 10:11   ` Andre Noll
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Vasquez @ 2007-02-26 18:26 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

On Mon, 26 Feb 2007, Andre Noll wrote:

> On linux-2.6.20.1, we're seeing hard lockups with 2 raid systems
> connected to a qla2xxx card and used as a single volume via lvm.
> The system seems to lock up only if data gets written to both raid
> systems at the same time.
> 
> On a standard kernel nothing makes it to the log, the system just
> freezes. So we tried a lockdep kernel which reports two BUGs during
> boot, see below.
> 
> Could this be related to our problem?

Before we proceed further, could you retrieve the latest firmware
release for 24xx type HBAs:

> [   64.151096] QLogic Fibre Channel HBA Driver
> [   64.151405] ACPI: PCI Interrupt 0000:05:08.0[A] -> GSI 32 (level, low) -> IRQ 32
> [   64.151821] qla2xxx 0000:05:08.0: Found an ISP2422, irq 32, iobase 0xffffc20000006000
> [   64.152231] qla2xxx 0000:05:08.0: Configuring PCI space...
> [   64.152498] qla2xxx 0000:05:08.0: Configure NVRAM parameters...
> [   64.159088] qla2xxx 0000:05:08.0: Verifying loaded RISC code...
> [   74.169623] qla2xxx 0000:05:08.0: Firmware image unavailable.
> [   74.169737] qla2xxx 0000:05:08.0: Firmware images can be retrieved from: ftp://ftp.qlogic.com/outgoing/linux/firmware/.
> [   74.169902] qla2xxx 0000:05:08.0: Attempting to load (potentially outdated) firmware from flash.
> [   74.760935] qla2xxx 0000:05:08.0: Allocated (64 KB) for EFT...
> [   74.761186] qla2xxx 0000:05:08.0: Allocated (1413 KB) for firmware dump...
> [   74.776988] scsi0 : qla2xxx
> [   74.961451] qla2xxx 0000:05:08.0: 
> [   74.961452]  QLogic Fibre Channel HBA Driver: 8.01.07-k4
> [   74.961453]   QLogic HP AE369-60001 - QLA2340
> [   74.961454]   ISP2422: PCI-X Mode 1 (133 MHz) @ 0000:05:08.0 hdma+, host#=0, fw=4.00.70 [IP] 

You are loading some stale firmware that's left over on the card --
I'm not even sure what 4.00.70 is, as the latest release firmware is
4.00.27.  You can retrieve the image here:

	ftp://ftp.qlogic.com/outgoing/linux/firmware/ql2400_fw.bin

Let's start there... before we move on to this:

> [   75.778656] BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
> [   75.778771] 
> [   75.778772] Call Trace:
> [   75.778967]  <IRQ>  [<ffffffff8024b877>] trace_hardirqs_on+0xd7/0x180
> [   75.779154]  [<ffffffff8052bc1b>] _spin_unlock_irq+0x2b/0x40
> [   75.779271]  [<ffffffff804605d7>] qla2x00_process_completed_request+0x137/0x1d0
> [   75.779424]  [<ffffffff804606f2>] qla2x00_status_entry+0x82/0xa40
> [   75.779541]  [<ffffffff8024b17f>] __lock_acquire+0xcdf/0xd90
> [   75.779657]  [<ffffffff8052bcb2>] _spin_unlock_irqrestore+0x42/0x60
> [   75.779775]  [<ffffffff8046228e>] qla24xx_intr_handler+0x4e/0x2b0
> [   75.779892]  [<ffffffff804613e1>] qla24xx_process_response_queue+0xc1/0x1c0
> [   75.780012]  [<ffffffff80462414>] qla24xx_intr_handler+0x1d4/0x2b0
> [   75.780131]  [<ffffffff8025e950>] handle_IRQ_event+0x20/0x60

Hmm....

Regards,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-26 18:26 ` Andrew Vasquez
@ 2007-02-27 10:11   ` Andre Noll
  2007-02-27 14:35     ` Andre Noll
  2007-02-27 18:51     ` Andrew Vasquez
  0 siblings, 2 replies; 22+ messages in thread
From: Andre Noll @ 2007-02-27 10:11 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 7880 bytes --]

On 10:26, Andrew Vasquez wrote:
> You are loading some stale firmware that's left over on the card --
> I'm not even sure what 4.00.70 is, as the latest release firmware is
> 4.00.27.

That's the firmware which came with the card. Anyway, I just upgraded
the firmware, but the bug remains. The backtrace differs a bit though
as now the tg3 network driver seems to be involved as well.

Thanks for your help
Andre

[   67.511167] qla2xxx 0000:05:08.0: Allocated (64 KB) for EFT...
[   67.511434] qla2xxx 0000:05:08.0: Allocated (1413 KB) for firmware dump...
[   67.531231] scsi0 : qla2xxx
[   67.854344] qla2xxx 0000:05:08.0: 
[   67.854346]  QLogic Fibre Channel HBA Driver: 8.01.07-k4
[   67.854347]   QLogic HP AE369-60001 - QLA2340
[   67.854348]   ISP2422: PCI-X Mode 1 (133 MHz) @ 0000:05:08.0 hdma+, host#=0, fw=4.00.27 [IP] 
[   67.854881] ACPI: PCI Interrupt 0000:05:08.1[B] -> GSI 33 (level, low) -> IRQ 33
[   67.855230] qla2xxx 0000:05:08.1: Found an ISP2422, irq 33, iobase 0xffffc20000012000
[   67.855645] qla2xxx 0000:05:08.1: Configuring PCI space...
[   67.855907] qla2xxx 0000:05:08.1: Configure NVRAM parameters...
[   67.862486] qla2xxx 0000:05:08.1: Verifying loaded RISC code...
[   68.106663] qla2xxx 0000:05:08.1: Allocated (64 KB) for EFT...
[   68.107058] qla2xxx 0000:05:08.1: Allocated (1413 KB) for firmware dump...
[   68.126759] scsi1 : qla2xxx
[   68.196783] Adding 6540152k swap on /dev/md2.  Priority:-1 extents:1 across:6540152k
[   68.260645] qla2xxx 0000:05:08.0: LIP reset occured (f8f7).
[   68.296027] qla2xxx 0000:05:08.0: LIP occured (f8f7).
[   68.298214] qla2xxx 0000:05:08.0: LOOP UP detected (2 Gbps).
[   68.326627] qla2xxx 0000:05:08.1: 
[   68.326628]  QLogic Fibre Channel HBA Driver: 8.01.07-k4
[   68.326630]   QLogic HP AE369-60001 - QLA2340
[   68.326631]   ISP2422: PCI-X Mode 1 (133 MHz) @ 0000:05:08.1 hdma+, host#=1, fw=4.00.27 [IP] 
[   68.504335] EXT3 FS on md1, internal journal
[   68.524627] PM: Writing back config space on device 0000:03:06.0 at offset b (was 164814e4, writing d00e11)
[   68.524644] PM: Writing back config space on device 0000:03:06.0 at offset 3 (was 804000, writing 804010)
[   68.524650] PM: Writing back config space on device 0000:03:06.0 at offset 2 (was 2000000, writing 2000010)
[   68.524657] PM: Writing back config space on device 0000:03:06.0 at offset 1 (was 2b00000, writing 2b00146)
[   68.532665] BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
[   68.532784] 
[   68.532785] Call Trace:
[   68.532979]  <IRQ>  [<ffffffff8024b877>] trace_hardirqs_on+0xd7/0x180
[   68.533168]  [<ffffffff80511f5b>] _spin_unlock_irq+0x2b/0x40
[   68.533295]  [<ffffffff88032747>] :qla2xxx:qla2x00_process_completed_request+0x137/0x1d0
[   68.533457]  [<ffffffff88032862>] :qla2xxx:qla2x00_status_entry+0x82/0xa40
[   68.533577]  [<ffffffff8024b17f>] __lock_acquire+0xcdf/0xd90
[   68.533693]  [<ffffffff80511ff2>] _spin_unlock_irqrestore+0x42/0x60
[   68.533816]  [<ffffffff880343fe>] :qla2xxx:qla24xx_intr_handler+0x4e/0x2b0
[   68.533942]  [<ffffffff88033551>] :qla2xxx:qla24xx_process_response_queue+0xc1/0x1c0
[   68.534102]  [<ffffffff88034584>] :qla2xxx:qla24xx_intr_handler+0x1d4/0x2b0
[   68.534224]  [<ffffffff8025e950>] handle_IRQ_event+0x20/0x60
[   68.534339]  [<ffffffff802604ad>] handle_fasteoi_irq+0xbd/0x110
[   68.534459]  [<ffffffff8020cf62>] do_IRQ+0x132/0x1a0
[   68.534574]  [<ffffffff8020a236>] ret_from_intr+0x0/0xf
[   68.534687]  <EOI>  [<ffffffff803ad15c>] __delay+0xc/0x20
[   68.534862]  [<ffffffff803ad1a7>] __const_udelay+0x37/0x40
[   68.534982]  [<ffffffff88006737>] :tg3:tg3_chip_reset+0x547/0x670
[   68.535103]  [<ffffffff8800df2d>] :tg3:tg3_reset_hw+0x5d/0x1790
[   68.535218]  [<ffffffff803ad1e7>] __udelay+0x37/0x40
[   68.535333]  [<ffffffff8800408d>] :tg3:_tw32_flush+0x6d/0x80
[   68.535451]  [<ffffffff88012196>] :tg3:tg3_open+0x2d6/0x610
[   68.535569]  [<ffffffff8800f6a2>] :tg3:tg3_init_hw+0x42/0x50
[   68.535687]  [<ffffffff880121a3>] :tg3:tg3_open+0x2e3/0x610
[   68.535804]  [<ffffffff804b36e3>] dev_open+0x43/0x90
[   68.535917]  [<ffffffff804b2814>] dev_change_flags+0x74/0x160
[   68.536034]  [<ffffffff804f3e66>] devinet_ioctl+0x2e6/0x730
[   68.536149]  [<ffffffff804b4bc2>] dev_ioctl+0x302/0x340
[   68.536264]  [<ffffffff803aa71b>] __up_read+0x9b/0xb0
[   68.536378]  [<ffffffff804f42fc>] inet_ioctl+0x4c/0x70
[   68.536494]  [<ffffffff804a73ec>] sock_ioctl+0x1fc/0x230
[   68.536610]  [<ffffffff8029c701>] do_ioctl+0x31/0xa0
[   68.536722]  [<ffffffff8029ca2b>] vfs_ioctl+0x2bb/0x2e0
[   68.536836]  [<ffffffff8029ca9a>] sys_ioctl+0x4a/0x80
[   68.536948]  [<ffffffff80209cee>] system_call+0x7e/0x83
[   68.537059] 
[   68.712832] scsi 0:0:0:0: Direct-Access     transtec T6100F16R1-E     342I PQ: 0 ANSI: 5
[   68.713384] sda : very big device. try to use READ CAPACITY(16).
[   68.713594] SCSI device sda: 11714863104 512-byte hdwr sectors (5998010 MB)
[   68.713976] sda: Write Protect is off
[   68.714079] sda: Mode Sense: 9b 00 00 08
[   68.714483] SCSI device sda: write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   68.714876] sda : very big device. try to use READ CAPACITY(16).
[   68.715080] SCSI device sda: 11714863104 512-byte hdwr sectors (5998010 MB)
[   68.715436] sda: Write Protect is off
[   68.715539] sda: Mode Sense: 9b 00 00 08
[   68.715944] SCSI device sda: write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   68.718244]  sda: unknown partition table
[   68.718707] sd 0:0:0:0: Attached scsi disk sda
[   68.718945] sd 0:0:0:0: Attached scsi generic sg0 type 0
[   68.719413] BUG: workqueue leaked lock or atomic: scsi_wq_0/0x00000000/2138
[   68.719556]     last function: fc_scsi_scan_rport+0x0/0x90
[   68.719754] 1 lock held by scsi_wq_0/2138:
[   68.719878]  #0:  (&shost->scan_mutex){--..}, at: [<ffffffff80510325>] mutex_lock+0x25/0x30
[   68.720380] 
[   68.720381] Call Trace:
[   68.720616]  [<ffffffff80248319>] debug_show_held_locks+0x9/0x10
[   68.720757]  [<ffffffff8023eb49>] run_workqueue+0x149/0x1a0
[   68.720891]  [<ffffffff802427c0>] keventd_create_kthread+0x0/0x90
[   68.721030]  [<ffffffff8023edc1>] worker_thread+0x151/0x190
[   68.721167]  [<ffffffff80227e80>] default_wake_function+0x0/0x10
[   68.721307]  [<ffffffff8023ec70>] worker_thread+0x0/0x190
[   68.721443]  [<ffffffff80242a2a>] kthread+0xda/0x110
[   68.721575]  [<ffffffff8020ab08>] child_rip+0xa/0x12
[   68.721709]  [<ffffffff80511f5b>] _spin_unlock_irq+0x2b/0x40
[   68.721842]  [<ffffffff8020a28c>] restore_args+0x0/0x30
[   68.721973]  [<ffffffff80242950>] kthread+0x0/0x110
[   68.722106]  [<ffffffff8020aafe>] child_rip+0x0/0x12
[   68.722240] 
[   68.762666] qla2xxx 0000:05:08.1: LIP reset occured (f7f7).
[   68.797954] qla2xxx 0000:05:08.1: LIP occured (f7f7).
[   68.800134] qla2xxx 0000:05:08.1: LOOP UP detected (2 Gbps).
[   69.127937] scsi 1:0:0:0: Direct-Access     ADVUNI   OXYGENRAID 416F  341B PQ: 0 ANSI: 3
[   69.128528] sdb : very big device. try to use READ CAPACITY(16).
[   69.128777] SCSI device sdb: 9370656768 512-byte hdwr sectors (4797776 MB)
[   69.129220] sdb: Write Protect is off
[   69.129326] sdb: Mode Sense: 8f 00 00 08
[   69.129878] SCSI device sdb: write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   69.130342] sdb : very big device. try to use READ CAPACITY(16).
[   69.130585] SCSI device sdb: 9370656768 512-byte hdwr sectors (4797776 MB)
[   69.131006] sdb: Write Protect is off
[   69.131110] sdb: Mode Sense: 8f 00 00 08
[   69.131660] SCSI device sdb: write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   69.131843]  sdb: unknown partition table
[   69.132401] sd 1:0:0:0: Attached scsi disk sdb
[   69.132624] sd 1:0:0:0: Attached scsi generic sg1 type 0

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-27 10:11   ` Andre Noll
@ 2007-02-27 14:35     ` Andre Noll
  2007-02-27 18:51     ` Andrew Vasquez
  1 sibling, 0 replies; 22+ messages in thread
From: Andre Noll @ 2007-02-27 14:35 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

On 11:11, Andre Noll wrote:
> On 10:26, Andrew Vasquez wrote:
> > You are loading some stale firmware that's left over on the card --
> > I'm not even sure what 4.00.70 is, as the latest release firmware is
> > 4.00.27.
> 
> That's the firmware which came with the card. Anyway, I just upgraded
> the firmware, but the bug remains.

the system crashed again btw., this time resulting in a kernel panic
instead of just locking up silently. Here's a screenshot:

	http://systemlinux.org/~maan/shots/qla2xxx-crash-huangho2.png

Regards
Andre

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-27 10:11   ` Andre Noll
  2007-02-27 14:35     ` Andre Noll
@ 2007-02-27 18:51     ` Andrew Vasquez
  2007-02-28 15:18       ` Andre Noll
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Vasquez @ 2007-02-27 18:51 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-kernel, linux-scsi, James Bottomley

On Tue, 27 Feb 2007, Andre Noll wrote:

> On 10:26, Andrew Vasquez wrote:
> > You are loading some stale firmware that's left over on the card --
> > I'm not even sure what 4.00.70 is, as the latest release firmware is
> > 4.00.27.
> 
> That's the firmware which came with the card. Anyway, I just upgraded
> the firmware, but the bug remains. The backtrace differs a bit though
> as now the tg3 network driver seems to be involved as well.
> 
> Thanks for your help
> Andre
...
> [   68.532665] BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
> [   68.532784] 
> [   68.532785] Call Trace:
> [   68.532979]  <IRQ>  [<ffffffff8024b877>] trace_hardirqs_on+0xd7/0x180
> [   68.533168]  [<ffffffff80511f5b>] _spin_unlock_irq+0x2b/0x40
> [   68.533295]  [<ffffffff88032747>] :qla2xxx:qla2x00_process_completed_request+0x137/0x1d0
> [   68.533457]  [<ffffffff88032862>] :qla2xxx:qla2x00_status_entry+0x82/0xa40
> [   68.533577]  [<ffffffff8024b17f>] __lock_acquire+0xcdf/0xd90
> [   68.533693]  [<ffffffff80511ff2>] _spin_unlock_irqrestore+0x42/0x60
> [   68.533816]  [<ffffffff880343fe>] :qla2xxx:qla24xx_intr_handler+0x4e/0x2b0
> [   68.533942]  [<ffffffff88033551>] :qla2xxx:qla24xx_process_response_queue+0xc1/0x1c0
> [   68.534102]  [<ffffffff88034584>] :qla2xxx:qla24xx_intr_handler+0x1d4/0x2b0

Ok, since 2.6.20, there been a patch added to qla2xxx which drops the
spin_unlock_irq() call while attempting to ramp-up the queue-depth:

	commit befede3dabd204e9c546cbfbe391b29286c57da2
	Author: Seokmann Ju <seokmann.ju@qlogic.com>
	Date:   Tue Jan 9 11:37:52 2007 -0800

	    [SCSI] qla2xxx: correct locking while call starget_for_each_device()

	    Removed spin_unlock_irq()/spin_lock_irq() pairs surrounding
	    starget_for_each_device() calls.
	    As Matthew W. pointed out, starget_for_each_device() can be called under
	    a spinlock being held.
	    The change has been tested and verified on qla2xxx.ko module.
	    Thanks Matthew W. and Hisashi H. for help.

	    Signed-off-by: Andrew Vasquez <Andrew.vasquez@qlogic.com>
	    Signed-off-by: Seokmann Ju <Seokmann.ju@qlogic.com>
	    Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>

	http://marc.theaimsgroup.com/?l=linux-scsi&m=116837234904583&w=2

Could you try the latest 2.6.21-rc which contains the correction?

Regards,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-27 18:51     ` Andrew Vasquez
@ 2007-02-28 15:18       ` Andre Noll
  2007-02-28 15:37         ` Andre Noll
  0 siblings, 1 reply; 22+ messages in thread
From: Andre Noll @ 2007-02-28 15:18 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

On 10:51, Andrew Vasquez wrote:
> On Tue, 27 Feb 2007, Andre Noll wrote:
> > [   68.532665] BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
> 
> Ok, since 2.6.20, there been a patch added to qla2xxx which drops the
> spin_unlock_irq() call while attempting to ramp-up the queue-depth:
> 
> Could you try the latest 2.6.21-rc which contains the correction?

With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
writing to both raid systems at the same time via lvm still locks up
the system within minutes.

As lockdep revealed another dm-related lock problem on this kernel,
I guess I'll have to bother the lvm people on this.

Thanks
Andre

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-28 15:18       ` Andre Noll
@ 2007-02-28 15:37         ` Andre Noll
  2007-03-07  4:39           ` Andrew Morton
  2007-03-07 18:46           ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Andre Noll @ 2007-02-28 15:37 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

On 16:18, Andre Noll wrote:

> With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> writing to both raid systems at the same time via lvm still locks up
> the system within minutes.

Screenshot of the resulting kernel panic:

	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png

Andre

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-28 15:37         ` Andre Noll
@ 2007-03-07  4:39           ` Andrew Morton
  2007-03-07 17:09             ` Andre Noll
  2007-03-07 18:46           ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2007-03-07  4:39 UTC (permalink / raw)
  To: Andre Noll
  Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley,
	Jens Axboe, Alasdair G Kergon, Adrian Bunk

On Wed, 28 Feb 2007 16:37:22 +0100 Andre Noll <maan@systemlinux.org> wrote:

> On 16:18, Andre Noll wrote:
> 
> > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > writing to both raid systems at the same time via lvm still locks up
> > the system within minutes.
> 
> Screenshot of the resulting kernel panic:
> 
> 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> 

It died in CFQ.  Please try a different IO scheduler.  Use something
like

	echo deadline > /sys/block/sda/queue/scheduler

This could still be the old qla2xxx bug, or it could be a new qla2xxx bug,
or it could be a block bug, or it could be an LVM bug.

Adrian, can we please track this as a regression?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-07  4:39           ` Andrew Morton
@ 2007-03-07 17:09             ` Andre Noll
  2007-03-07 19:45               ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Andre Noll @ 2007-03-07 17:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley,
	Jens Axboe, Alasdair G Kergon, Adrian Bunk

[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

On 20:39, Andrew Morton wrote:
> On Wed, 28 Feb 2007 16:37:22 +0100 Andre Noll <maan@systemlinux.org> wrote:
> 
> > On 16:18, Andre Noll wrote:
> > 
> > > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > > writing to both raid systems at the same time via lvm still locks up
> > > the system within minutes.
> > 
> > Screenshot of the resulting kernel panic:
> > 
> > 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> > 
> 
> It died in CFQ.  Please try a different IO scheduler.  Use something
> like
> 
> 	echo deadline > /sys/block/sda/queue/scheduler
> 
> This could still be the old qla2xxx bug, or it could be a new qla2xxx bug,
> or it could be a block bug, or it could be an LVM bug.

OK. I'm running with deadline right now. But I guess this kernel
panic was caused by an LVM bug because lockdep reported problems with
LVM. Nobody responded to my bug report on the LVM mailing list (see
http://www.redhat.com/archives/linux-lvm/2007-February/msg00102.html).

Non-working snapshots and no help from the mailing list convinced me
to ditch the lvm setup [1] in favour of linear software raid. This
means I can't do lvm-related tests any more.

BTW: Are ext3 filesystem sizes greater than 8T now officially
supported?

Thanks
Andre

[1] vg of two hardware raids, 10T together, a single lv and some snapshots
-- 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-02-28 15:37         ` Andre Noll
  2007-03-07  4:39           ` Andrew Morton
@ 2007-03-07 18:46           ` Jens Axboe
  2007-03-08  8:52             ` Andre Noll
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2007-03-07 18:46 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

On Wed, Feb 28 2007, Andre Noll wrote:
> On 16:18, Andre Noll wrote:
> 
> > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > writing to both raid systems at the same time via lvm still locks up
> > the system within minutes.
> 
> Screenshot of the resulting kernel panic:
> 
> 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png

Do you have the full oops as well?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-07 17:09             ` Andre Noll
@ 2007-03-07 19:45               ` Andrew Morton
  2007-03-07 20:05                 ` Mingming Cao
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2007-03-07 19:45 UTC (permalink / raw)
  To: Andre Noll
  Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley,
	Jens Axboe, Alasdair G Kergon, Adrian Bunk, linux-ext4

On Wed, 7 Mar 2007 18:09:55 +0100 Andre Noll <maan@systemlinux.org> wrote:

> On 20:39, Andrew Morton wrote:
> > On Wed, 28 Feb 2007 16:37:22 +0100 Andre Noll <maan@systemlinux.org> wrote:
> > 
> > > On 16:18, Andre Noll wrote:
> > > 
> > > > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > > > writing to both raid systems at the same time via lvm still locks up
> > > > the system within minutes.
> > > 
> > > Screenshot of the resulting kernel panic:
> > > 
> > > 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> > > 
> > 
> > It died in CFQ.  Please try a different IO scheduler.  Use something
> > like
> > 
> > 	echo deadline > /sys/block/sda/queue/scheduler
> > 
> > This could still be the old qla2xxx bug, or it could be a new qla2xxx bug,
> > or it could be a block bug, or it could be an LVM bug.
> 
> OK. I'm running with deadline right now. But I guess this kernel
> panic was caused by an LVM bug because lockdep reported problems with
> LVM. Nobody responded to my bug report on the LVM mailing list (see
> http://www.redhat.com/archives/linux-lvm/2007-February/msg00102.html).
> 
> Non-working snapshots and no help from the mailing list convinced me
> to ditch the lvm setup [1] in favour of linear software raid. This
> means I can't do lvm-related tests any more.

Sigh.

> BTW: Are ext3 filesystem sizes greater than 8T now officially
> supported?

I think so, but I don't know how much 16TB testing developers and
distros are doing - perhaps the linux-ext4 denizens can tell us?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-07 19:45               ` Andrew Morton
@ 2007-03-07 20:05                 ` Mingming Cao
  2007-03-09  9:36                   ` Andre Noll
  2007-03-12 15:22                   ` Valerie Clement
  0 siblings, 2 replies; 22+ messages in thread
From: Mingming Cao @ 2007-03-07 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andre Noll, Andrew Vasquez, linux-kernel, linux-scsi,
	James Bottomley, Jens Axboe, Alasdair G Kergon, Adrian Bunk,
	linux-ext4

On Wed, 2007-03-07 at 11:45 -0800, Andrew Morton wrote:
> On Wed, 7 Mar 2007 18:09:55 +0100 Andre Noll <maan@systemlinux.org> wrote:
> 
> > On 20:39, Andrew Morton wrote:
> > > On Wed, 28 Feb 2007 16:37:22 +0100 Andre Noll <maan@systemlinux.org> wrote:
> > > 
> > > > On 16:18, Andre Noll wrote:
> > > > 
> > > > > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > > > > writing to both raid systems at the same time via lvm still locks up
> > > > > the system within minutes.
> > > > 
> > > > Screenshot of the resulting kernel panic:
> > > > 
> > > > 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> > > > 
> > > 
> > > It died in CFQ.  Please try a different IO scheduler.  Use something
> > > like
> > > 
> > > 	echo deadline > /sys/block/sda/queue/scheduler
> > > 
> > > This could still be the old qla2xxx bug, or it could be a new qla2xxx bug,
> > > or it could be a block bug, or it could be an LVM bug.
> > 
> > OK. I'm running with deadline right now. But I guess this kernel
> > panic was caused by an LVM bug because lockdep reported problems with
> > LVM. Nobody responded to my bug report on the LVM mailing list (see
> > http://www.redhat.com/archives/linux-lvm/2007-February/msg00102.html).
> > 
> > Non-working snapshots and no help from the mailing list convinced me
> > to ditch the lvm setup [1] in favour of linear software raid. This
> > means I can't do lvm-related tests any more.
> 
> Sigh.
> 
> > BTW: Are ext3 filesystem sizes greater than 8T now officially
> > supported?
> 
> I think so, but I don't know how much 16TB testing developers and
> distros are doing - perhaps the linux-ext4 denizens can tell us?
> -

IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc)
on 10TB ext3, I think RedHat and BULL have done similar test on >8TB
ext3 too.

Mingming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-07 18:46           ` Jens Axboe
@ 2007-03-08  8:52             ` Andre Noll
  2007-03-08  9:02               ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Andre Noll @ 2007-03-08  8:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 742 bytes --]

On 19:46, Jens Axboe wrote:
> On Wed, Feb 28 2007, Andre Noll wrote:
> > On 16:18, Andre Noll wrote:
> > 
> > > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > > writing to both raid systems at the same time via lvm still locks up
> > > the system within minutes.
> > 
> > Screenshot of the resulting kernel panic:
> > 
> > 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> 
> Do you have the full oops as well?

Unfortunately not, as there's no way to scroll up after a kernel panic
(the screenshot was taken by using a KVM switch which just sends the
video output over ethernet).

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-08  8:52             ` Andre Noll
@ 2007-03-08  9:02               ` Jens Axboe
  2007-03-08  9:33                 ` Andre Noll
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2007-03-08  9:02 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

On Thu, Mar 08 2007, Andre Noll wrote:
> On 19:46, Jens Axboe wrote:
> > On Wed, Feb 28 2007, Andre Noll wrote:
> > > On 16:18, Andre Noll wrote:
> > > 
> > > > With 2.6.21-rc2 I am unable to reproduce this BUG message. However,
> > > > writing to both raid systems at the same time via lvm still locks up
> > > > the system within minutes.
> > > 
> > > Screenshot of the resulting kernel panic:
> > > 
> > > 	http://systemlinux.org/~maan/shots/kernel-panic-21-rc2-huangho2.png
> > 
> > Do you have the full oops as well?
> 
> Unfortunately not, as there's no way to scroll up after a kernel panic
> (the screenshot was taken by using a KVM switch which just sends the
> video output over ethernet).

Do you still have the vmlinux? It'd be interesting to see what

$ gbd vmlinux
(gdb) l *cfq_dispatch_insert+0x28

says, here that'd be cfqq dereference. And that must be valid, it's set
on allocation time and only cleared after free. So unless lvm issues
private requests that aren't properly allocated, this whole thing looks
very bizarre.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-08  9:02               ` Jens Axboe
@ 2007-03-08  9:33                 ` Andre Noll
  2007-03-08  9:36                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Andre Noll @ 2007-03-08  9:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

On 10:02, Jens Axboe wrote:
> Do you still have the vmlinux? It'd be interesting to see what
> 
> $ gbd vmlinux
> (gdb) l *cfq_dispatch_insert+0x28
> 
> says, 

The vmlinux in the kernel dir is dated March 5 and my bug report
was Feb 28. So I'm afraid it's gone. I tried the gdb command anyway
but it only gave me

	No symbol table is loaded.  Use the "file" command.

Sorry
Andre

-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-08  9:33                 ` Andre Noll
@ 2007-03-08  9:36                   ` Jens Axboe
  2007-03-08 10:29                     ` Andre Noll
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2007-03-08  9:36 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

On Thu, Mar 08 2007, Andre Noll wrote:
> On 10:02, Jens Axboe wrote:
> > Do you still have the vmlinux? It'd be interesting to see what
> > 
> > $ gbd vmlinux
> > (gdb) l *cfq_dispatch_insert+0x28
> > 
> > says, 
> 
> The vmlinux in the kernel dir is dated March 5 and my bug report
> was Feb 28. So I'm afraid it's gone. I tried the gdb command anyway
> but it only gave me
> 
> 	No symbol table is loaded.  Use the "file" command.

Yeah, you'd need CONFIG_DEBUG_INFO enabled as well. I don't think there
were any CFQ changes between feb 28 and march 5, so you could probably
still try it out. A quicker way:

- Edit .config and set CONFIG_DEBUG_INFO=y (near the bottom)
- make oldconfig
- rm block/cfq-iosched.o
- make block/cfq-iosched.o
- gdb block/cfq-iosched.o

(gdb) l *cfq_dispatch_insert+0x28

and see what that says. Should not take you more than a minute or so,
would appreciate it!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-08  9:36                   ` Jens Axboe
@ 2007-03-08 10:29                     ` Andre Noll
  2007-03-08 10:35                       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Andre Noll @ 2007-03-08 10:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

On 10:36, Jens Axboe wrote:
> - Edit .config and set CONFIG_DEBUG_INFO=y (near the bottom)
> - make oldconfig
> - rm block/cfq-iosched.o
> - make block/cfq-iosched.o
> - gdb block/cfq-iosched.o
> 
> (gdb) l *cfq_dispatch_insert+0x28
> 
> and see what that says. Should not take you more than a minute or so,
> would appreciate it!

No problem, here we go:

# gdb block/cfq-iosched.o
GNU gdb 6.4-debian
Copyright 2005 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) l *cfq_dispatch_insert+0x28
0xcf8 is in cfq_dispatch_insert (block/cfq-iosched.c:865).
860     }
861
862     static void cfq_dispatch_insert(request_queue_t *q, struct request *rq)
863     {
864             struct cfq_data *cfqd = q->elevator->elevator_data;
865             struct cfq_queue *cfqq = RQ_CFQQ(rq);
866
867             cfq_remove_request(rq);
868             cfqq->on_dispatch[rq_is_sync(rq)]++;
869             elv_dispatch_sort(q, rq);

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-08 10:29                     ` Andre Noll
@ 2007-03-08 10:35                       ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2007-03-08 10:35 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Vasquez, linux-kernel, linux-scsi, James Bottomley

On Thu, Mar 08 2007, Andre Noll wrote:
> On 10:36, Jens Axboe wrote:
> > - Edit .config and set CONFIG_DEBUG_INFO=y (near the bottom)
> > - make oldconfig
> > - rm block/cfq-iosched.o
> > - make block/cfq-iosched.o
> > - gdb block/cfq-iosched.o
> > 
> > (gdb) l *cfq_dispatch_insert+0x28
> > 
> > and see what that says. Should not take you more than a minute or so,
> > would appreciate it!
> 
> No problem, here we go:
> 
> # gdb block/cfq-iosched.o
> GNU gdb 6.4-debian
> Copyright 2005 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".
> 
> (gdb) l *cfq_dispatch_insert+0x28
> 0xcf8 is in cfq_dispatch_insert (block/cfq-iosched.c:865).
> 860     }
> 861
> 862     static void cfq_dispatch_insert(request_queue_t *q, struct request *rq)
> 863     {
> 864             struct cfq_data *cfqd = q->elevator->elevator_data;
> 865             struct cfq_queue *cfqq = RQ_CFQQ(rq);
> 866
> 867             cfq_remove_request(rq);
> 868             cfqq->on_dispatch[rq_is_sync(rq)]++;
> 869             elv_dispatch_sort(q, rq);

Ok, so it's ->next_rq being NULL or invalid. Similar to the report from
Dan last week, that's a bit worrisome. I'll have to look further into
that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-07 20:05                 ` Mingming Cao
@ 2007-03-09  9:36                   ` Andre Noll
  2007-03-12 15:22                   ` Valerie Clement
  1 sibling, 0 replies; 22+ messages in thread
From: Andre Noll @ 2007-03-09  9:36 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Andrew Morton, Andrew Vasquez, linux-kernel, linux-scsi,
	James Bottomley, Jens Axboe, Alasdair G Kergon, Adrian Bunk,
	linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

On 12:05, Mingming Cao wrote:
> > > BTW: Are ext3 filesystem sizes greater than 8T now officially
> > > supported?
> > 
> > I think so, but I don't know how much 16TB testing developers and
> > distros are doing - perhaps the linux-ext4 denizens can tell us?
> > -
> 
> IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc)
> on 10TB ext3, I think RedHat and BULL have done similar test on >8TB
> ext3 too.

Thanks. I'm asking because some days ago I tried to create a 10T ext3
filesytem on a linear software raid over two hardware raids, and it
failed horribly. mke2fs from e2fsprogs-1.39 refused to create such a
large filesystem but did it with -F, and I could mount it afterwards.
But writing data immediately produced zillions of errors and only
power-cycling the box helped.

We're now using a 7.9T filesystem on the same hardware. That seems
to work fine on 2.6.21-rc2, so I think this is an ext3 problem. I
cannot completely rule out other reasons though as the underlying
qla2xxx driver also had some problems on earlier kernels.

We'd much rather have a 10T filesystem if possible. So if you have
time to look into the issue I would be willing to recreate the 10T
filesystem and send details.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-07 20:05                 ` Mingming Cao
  2007-03-09  9:36                   ` Andre Noll
@ 2007-03-12 15:22                   ` Valerie Clement
  2007-03-13  7:01                     ` Andreas Dilger
  1 sibling, 1 reply; 22+ messages in thread
From: Valerie Clement @ 2007-03-12 15:22 UTC (permalink / raw)
  To: cmm; +Cc: Andrew Morton, Andre Noll, Theodore Tso, linux-ext4

Mingming Cao wrote:
> On Wed, 2007-03-07 at 11:45 -0800, Andrew Morton wrote:
>> On Wed, 7 Mar 2007 18:09:55 +0100 Andre Noll <maan@systemlinux.org> wrote:
>>
>>> On 20:39, Andrew Morton wrote:
>>>> On Wed, 28 Feb 2007 16:37:22 +0100 Andre Noll <maan@systemlinux.org> wrote:
>>>>
>>> BTW: Are ext3 filesystem sizes greater than 8T now officially
>>> supported?
>> I think so, but I don't know how much 16TB testing developers and
>> distros are doing - perhaps the linux-ext4 denizens can tell us?
>> -
> 
> IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc)
> on 10TB ext3, I think RedHat and BULL have done similar test on >8TB
> ext3 too.
> 
> Mingming

Is there not a problem of backward-compatibility with old kernels?
Doesn't we need to handle a new INCOMPAT flag in e2fsprogs and kernel
before allowing ext3 filesystems greater than 8T?

    Valérie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-12 15:22                   ` Valerie Clement
@ 2007-03-13  7:01                     ` Andreas Dilger
  2007-03-13  8:23                       ` Valerie Clement
  0 siblings, 1 reply; 22+ messages in thread
From: Andreas Dilger @ 2007-03-13  7:01 UTC (permalink / raw)
  To: Valerie Clement; +Cc: cmm, Andrew Morton, Andre Noll, Theodore Tso, linux-ext4

On Mar 12, 2007  16:22 +0100, Valerie Clement wrote:
> Mingming Cao wrote:
> >IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc)
> >on 10TB ext3, I think RedHat and BULL have done similar test on >8TB
> >ext3 too.
> 
> Is there not a problem of backward-compatibility with old kernels?
> Doesn't we need to handle a new INCOMPAT flag in e2fsprogs and kernel
> before allowing ext3 filesystems greater than 8T?

No, it really depends on the kernel.  There were some bugs that caused
problems with > 8TB because of signed 32-bit int problems, so it isn't
really recommended to use > 8TB unless you know this is fixed in your
kernel (and any older kernel you might have to downgrade to).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: qla2xxx BUG: workqueue leaked lock or atomic
  2007-03-13  7:01                     ` Andreas Dilger
@ 2007-03-13  8:23                       ` Valerie Clement
  0 siblings, 0 replies; 22+ messages in thread
From: Valerie Clement @ 2007-03-13  8:23 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Andre Noll, Theodore Tso, linux-ext4

Andreas Dilger wrote:
> On Mar 12, 2007  16:22 +0100, Valerie Clement wrote:
>> Mingming Cao wrote:
>>> IBM has done some testing (dbench, fsstress, fsx, tiobench, iozone etc)
>>> on 10TB ext3, I think RedHat and BULL have done similar test on >8TB
>>> ext3 too.
>> Is there not a problem of backward-compatibility with old kernels?
>> Doesn't we need to handle a new INCOMPAT flag in e2fsprogs and kernel
>> before allowing ext3 filesystems greater than 8T?
> 
> No, it really depends on the kernel.  There were some bugs that caused
> problems with > 8TB because of signed 32-bit int problems, so it isn't
> really recommended to use > 8TB unless you know this is fixed in your
> kernel (and any older kernel you might have to downgrade to).
> 

OK. Thanks.
As Andre mentions it, it seems that the option "-F" for mkfs is 
necessary to create an ext3 FS > 8T.
(I've got the same behavior but I didn't apply the latest patches 
against my current version of e2fsprogs, so I can't check if that has 
changed since).
Is it the right way?

     Valérie

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-03-13  8:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-26 13:31 qla2xxx BUG: workqueue leaked lock or atomic Andre Noll
2007-02-26 18:26 ` Andrew Vasquez
2007-02-27 10:11   ` Andre Noll
2007-02-27 14:35     ` Andre Noll
2007-02-27 18:51     ` Andrew Vasquez
2007-02-28 15:18       ` Andre Noll
2007-02-28 15:37         ` Andre Noll
2007-03-07  4:39           ` Andrew Morton
2007-03-07 17:09             ` Andre Noll
2007-03-07 19:45               ` Andrew Morton
2007-03-07 20:05                 ` Mingming Cao
2007-03-09  9:36                   ` Andre Noll
2007-03-12 15:22                   ` Valerie Clement
2007-03-13  7:01                     ` Andreas Dilger
2007-03-13  8:23                       ` Valerie Clement
2007-03-07 18:46           ` Jens Axboe
2007-03-08  8:52             ` Andre Noll
2007-03-08  9:02               ` Jens Axboe
2007-03-08  9:33                 ` Andre Noll
2007-03-08  9:36                   ` Jens Axboe
2007-03-08 10:29                     ` Andre Noll
2007-03-08 10:35                       ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.