All of lore.kernel.org
 help / color / mirror / Atom feed
* aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
@ 2007-11-30  9:22 Krzysztof Błaszkowski
  2007-11-30 21:33 ` Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2007-11-30  9:22 UTC (permalink / raw)
  To: linux-scsi; +Cc: Vladislav Bolkhovitin

Hello all,

I noticed this according to syslog. furthermore if aic94xx is connected to 
single sata drive only then there is no crash but device is not recognized 
too. (mysterious: "ERROR: Unidentified device type 5").

A crash recorded in syslog:

aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded
ACPI: PCI Interrupt 0000:04:02.0[A] -> GSI 16 (level, low) -> IRQ 16
aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device 0000:04:02.0
scsi6 : aic94xx
PM: Adding info for No Bus:host6
PM: Adding info for No Bus:0000:04:02.0
PM: Removing info for No Bus:0000:04:02.0
aic94xx: Found sequencer Firmware version 1.1 (V30)
aic94xx: device 0000:04:02.0: SAS addr 500304800004ce20, PCBA SN ORG, 8 phys, 
8 enabled phys, flash present, BIOS build 1822
PM: Adding info for No Bus:phy-6:0

<snip>

BUG: unable to handle kernel NULL pointer dereference at virtual address 
00000074
 printing eip:
f8e2daf9
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: aic94xx firmware_class libsas scsi_transport_sas nfsd 
exportfs nvram speedstep_lib freq_table thermal processor fan button battery 
edd ac ipv6 evdev joydev sr_mod ide_cd cdrom e1000 ehci_hcd i2c_i801 uhci_hcd 
rng_core dm_mod usbcore
CPU:    0
EIP:    0060:[<f8e2daf9>]    Not tainted VLI
EFLAGS: 00010286   (2.6.22.8 #6)
EIP is at sas_rphy_add+0x9/0x100 [scsi_transport_sas]
eax: 00000000   ebx: 00000000   ecx: 00000004   edx: 00000282
esi: f2d8c080   edi: 00000000   ebp: f2d8c080   esp: f33bbe84
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process scsi_wq_6 (pid: 8265, ti=f33ba000 task=f5712a90 task.ti=f33ba000)
Stack: f2d8c080 00000000 00000000 f2d8c080 f2d8c0d7 f2d8c080 f8e8bdc2 f46b49e0 
       f2d8c114 f8e8d040 f7c4446c f704e724 f704e6c0 00000000 00000001 00000000 
       f46b49fc f8e8d741 f7c44438 00000000 f33bbedc 402e9267 f7c44380 ffffffed 
Call Trace:
 [<f8e8bdc2>] sas_discover_sata+0x42/0x80 [libsas]
 [<f8e8d040>] sas_ex_discover_end_dev+0x120/0x2d0 [libsas]
 [<f8e8d741>] sas_ex_discover_dev+0x2d1/0x470 [libsas]
 [<402e9267>] attribute_container_device_trigger+0xa7/0xb0
 [<f8e8daa3>] sas_ex_discover_devices+0x83/0xb0 [libsas]
 [<f8e8e6d3>] sas_ex_level_discovery+0x43/0x70 [libsas]
 [<f8e8e71b>] sas_ex_bfs_disc+0x1b/0x30 [libsas]
 [<f8e8e76e>] sas_discover_root_expander+0x3e/0x80 [libsas]
 [<f8e8bf40>] sas_discover_domain+0x0/0xc0 [libsas]
 [<f8e8bfea>] sas_discover_domain+0xaa/0xc0 [libsas]
 [<40131541>] run_workqueue+0x71/0x100
 [<4013167c>] worker_thread+0xac/0x110
 [<401352a0>] autoremove_wake_function+0x0/0x50
 [<401352a0>] autoremove_wake_function+0x0/0x50
 [<401315d0>] worker_thread+0x0/0x110
 [<40134d24>] kthread+0x64/0xa0
 [<40134cc0>] kthread+0x0/0xa0
 [<401048b7>] kernel_thread_helper+0x7/0x10
 =======================
Code: f0 83 c4 1c 5b 5e 5f 5d c3 0f 0b 8d b4 26 00 00 00 00 eb fe 8d b4 26 00 
00 00 00 8d bc 27 00 00 00 00 55 57 89 c7 56 53 83 ec 08 <8b> 70 74 8b 5e 74 
eb 0b 8b 43 74 31 d2 85 c0 74 13 89 c3 89 d8 
EIP: [<f8e2daf9>] sas_rphy_add+0x9/0x100 [scsi_transport_sas] SS:ESP 
0068:f33bbe84


let me know if you need any more information. i used latest firmware available 
from Adaptec's site.

Best regards,
Krzysztof Blaszkowski

Systemy mikroprocesorowe
Storrady 1
PL71602 Szczecin, Poland
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-11-30  9:22 aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives Krzysztof Błaszkowski
@ 2007-11-30 21:33 ` Darrick J. Wong
  2007-12-03 15:11   ` Krzysztof Błaszkowski
  2007-12-03 16:09   ` Krzysztof Błaszkowski
  0 siblings, 2 replies; 15+ messages in thread
From: Darrick J. Wong @ 2007-11-30 21:33 UTC (permalink / raw)
  To: Krzysztof B??aszkowski; +Cc: linux-scsi, Vladislav Bolkhovitin, Alexis Bruemmer

On Fri, Nov 30, 2007 at 10:22:07AM +0100, Krzysztof B??aszkowski wrote:
> Hello all,
> 
> I noticed this according to syslog. furthermore if aic94xx is connected to 
> single sata drive only then there is no crash but device is not recognized 
> too. (mysterious: "ERROR: Unidentified device type 5").

There's been a substantial amount of bugfixes (as well as SATA support)
that went into the aic94xx/libsas code between .22 and .23; could you
please give that a try?

Also, what kind of devices are attached when the system crashes?  From
that stack trace it looks like the software thought there was a SATA
disk attached to an expander...?

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-11-30 21:33 ` Darrick J. Wong
@ 2007-12-03 15:11   ` Krzysztof Błaszkowski
  2007-12-03 16:09   ` Krzysztof Błaszkowski
  1 sibling, 0 replies; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2007-12-03 15:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-scsi, vst, Alexis Bruemmer

On Friday 30 November 2007 22:33, Darrick J. Wong wrote:
> On Fri, Nov 30, 2007 at 10:22:07AM +0100, Krzysztof B??aszkowski wrote:
> > Hello all,
> >
> > I noticed this according to syslog. furthermore if aic94xx is connected
> > to single sata drive only then there is no crash but device is not
> > recognized too. (mysterious: "ERROR: Unidentified device type 5").
>
> There's been a substantial amount of bugfixes (as well as SATA support)
> that went into the aic94xx/libsas code between .22 and .23; could you
> please give that a try?

thank you. I've tried 2.6.23.9 and it seems to work okay and indeed there were 
made many changes some of them by you.

>
> Also, what kind of devices are attached when the system crashes?  From
> that stack trace it looks like the software thought there was a SATA
> disk attached to an expander...?

yes, i connected aic to the expander (LSISASX28) which was loaded with 16 
drives.

Best regards,
Krzysztof
>
> --D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-11-30 21:33 ` Darrick J. Wong
  2007-12-03 15:11   ` Krzysztof Błaszkowski
@ 2007-12-03 16:09   ` Krzysztof Błaszkowski
  2007-12-03 19:36     ` Darrick J. Wong
  1 sibling, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2007-12-03 16:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-scsi, vst, Alexis Bruemmer

[-- Attachment #1: Type: text/plain, Size: 354 bytes --]


I noticed also another failure when i removed a drive. The event was not 
notified by anything (ie the block device and corresponding sg were 
registered) so i run dd on this truly "virtual" drive.

dd reached D state (as well as scsi_wq) . i think it shouldn't happen no 
matter it was AIC failure or LSI expander failure.

>
> --D

Regards,
Krzysztof

[-- Attachment #2: hdd-removal-failure.log --]
[-- Type: text/x-log, Size: 4629 bytes --]

ata26.00: ATA-6: ST3120026AS, 3.18, max UDMA/133
ata26.00: 234441648 sectors, multi 0: LBA48 
ata26.00: ata_hpa_resize 1: hpa sectors (1) is smaller than sectors (234441648)
ata26.00: configured for UDMA/133
scsi 6:0:20:0: Direct-Access     ATA      ST3120026AS      3.18 PQ: 0 ANSI: 5
sd 6:0:20:0: [sdb] 234441648 512-byte hardware sectors (120034 MB)
sd 6:0:20:0: [sdb] Write Protect is off
sd 6:0:20:0: [sdb] Mode Sense: 00 3a 00 00
sd 6:0:20:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 6:0:20:0: [sdb] 234441648 512-byte hardware sectors (120034 MB)
sd 6:0:20:0: [sdb] Write Protect is off
sd 6:0:20:0: [sdb] Mode Sense: 00 3a 00 00
sd 6:0:20:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdb: unknown partition table
sd 6:0:20:0: [sdb] Attached SCSI disk
sd 6:0:20:0: Attached scsi generic sg1 type 0
sd 6:0:20:0: [sdb] Synchronizing SCSI cache
ata26: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata26: status=0x01 { Error }
ata26: error=0x04 { DriveStatusError }
sd 6:0:20:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
sd 6:0:19:0: [sda] Synchronizing SCSI cache
SysRq : Show Blocked State
  task                PC stack   pid father
scsi_wq_6     D 40246817     0  3727      2
       f3d7dc64 00000046 f72d5550 40246817 00000006 40128e47 f618b468 f73deac0 
       42748e00 f72d5698 f72d5550 f3d7dd30 f3d7dd34 f3d7dc80 f3d7dcb8 40402896 
       00000000 f72d5550 4011b8a0 00000000 00000000 4010fa7d f618b3fc f76d8070 
Call Trace:
 [<40246817>] elv_next_request+0xb7/0x210
 [<40128e47>] lock_timer_base+0x27/0x60
 [<40402896>] wait_for_completion+0x86/0xc0
 [<4011b8a0>] default_wake_function+0x0/0x10
 [<4010fa7d>] native_smp_send_reschedule+0x1d/0x30
 [<4011b8a0>] default_wake_function+0x0/0x10
 [<4024a511>] blk_execute_rq+0xa1/0xe0
 [<4024a770>] blk_end_sync_rq+0x0/0x30
 [<4013426b>] autoremove_wake_function+0x1b/0x50
 [<4011b8e7>] __wake_up_common+0x37/0x70
 [<403067a3>] scsi_execute+0xe3/0x110
 [<40306845>] scsi_execute_req+0x75/0xb0
 [<4031a860>] sd_sync_cache+0x70/0xb0
 [<40258ccf>] kobject_get+0xf/0x20
 [<4031ce34>] sd_shutdown+0x64/0x140
 [<4031cbe2>] sd_remove+0x32/0x70
 [<402e15c4>] __device_release_driver+0x94/0xb0
 [<402e15fe>] device_release_driver+0x1e/0x40
 [<402e0869>] bus_remove_device+0x59/0x80
 [<402dee33>] device_del+0x53/0x2c0
 [<4030bed1>] __scsi_remove_device+0x51/0x90
 [<4030bf2f>] scsi_remove_device+0x1f/0x30
 [<4030bfcf>] __scsi_remove_target+0x8f/0xc0
 [<4030c000>] __remove_child+0x0/0x20
 [<4030c018>] __remove_child+0x18/0x20
 [<402df0f2>] device_for_each_child+0x22/0x40
 [<4030c05e>] scsi_remove_target+0x3e/0x50
 [<f8d82f88>] sas_rphy_remove+0x58/0x80 [scsi_transport_sas]
 [<f8d82f28>] sas_rphy_delete+0x8/0x10 [scsi_transport_sas]
 [<f8dbb75e>] sas_unregister_dev+0x8e/0xa0 [libsas]
 [<f8dbe62f>] sas_unregister_devs_sas_addr+0x11f/0x130 [libsas]
 [<f8dbe916>] sas_rediscover_dev+0x116/0x150 [libsas]
 [<f8dbea02>] sas_rediscover+0xb2/0xe0 [libsas]
 [<f8dbb880>] sas_revalidate_domain+0x0/0x50 [libsas]
 [<f8dbea61>] sas_ex_revalidate_domain+0x31/0x70 [libsas]
 [<40130511>] run_workqueue+0x71/0x100
 [<4013061f>] worker_thread+0x7f/0xd0
 [<40134250>] autoremove_wake_function+0x0/0x50
 [<4040254a>] schedule+0x21a/0x4e0
 [<40134250>] autoremove_wake_function+0x0/0x50
 [<401305a0>] worker_thread+0x0/0xd0
 [<40133ca4>] kthread+0x64/0xa0
 [<40133c40>] kthread+0x0/0xa0
 [<40104887>] kernel_thread_helper+0x7/0x10
 =======================
dd            D 40249148     0 18935  16194
       f1fc7d88 00000086 f7d41aa0 40249148 00000000 00000000 4237b300 f3c52900 
       4273fe00 f7d41be8 f7d41aa0 4273fe00 f1fc7de4 42708a64 f1fc7d94 40402eed 
       f1fc7ddc 00000000 401517c5 4040318f 40151780 401342a0 f1fc7ddc f1fc7dd8 
Call Trace:
 [<40249148>] blk_backing_dev_unplug+0x48/0xa0
 [<40402eed>] io_schedule+0x1d/0x30
 [<401517c5>] sync_page+0x45/0x50
 [<4040318f>] __wait_on_bit_lock+0x3f/0x70
 [<40151780>] sync_page+0x0/0x50
 [<401342a0>] wake_bit_function+0x0/0x60
 [<401520ca>] __lock_page+0x9a/0xb0
 [<401342a0>] wake_bit_function+0x0/0x60
 [<401342a0>] wake_bit_function+0x0/0x60
 [<4015279e>] do_generic_mapping_read+0x22e/0x4b0
 [<40152da0>] generic_file_aio_read+0x1c0/0x1f0
 [<40152a20>] file_read_actor+0x0/0x110
 [<40172d6d>] do_sync_read+0xbd/0x110
 [<40134250>] autoremove_wake_function+0x0/0x50
 [<40116455>] do_page_fault+0x1b5/0x630
 [<4012cd5f>] sys_rt_sigaction+0x5f/0xb0
 [<40172e83>] vfs_read+0xc3/0x150
 [<401731c1>] sys_read+0x41/0x70
 [<40103c36>] sysenter_past_esp+0x5f/0x85
 [<40400000>] clip_setup+0x20/0x50
 =======================

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-12-03 16:09   ` Krzysztof Błaszkowski
@ 2007-12-03 19:36     ` Darrick J. Wong
  2007-12-03 19:43       ` Jeff Garzik
  2007-12-03 20:06       ` Krzysztof Błaszkowski
  0 siblings, 2 replies; 15+ messages in thread
From: Darrick J. Wong @ 2007-12-03 19:36 UTC (permalink / raw)
  To: Krzysztof B??aszkowski; +Cc: linux-scsi, vst, Alexis Bruemmer

On Mon, Dec 03, 2007 at 05:09:54PM +0100, Krzysztof B??aszkowski wrote:
> 
> I noticed also another failure when i removed a drive. The event was not 
> notified by anything (ie the block device and corresponding sg were 
> registered) so i run dd on this truly "virtual" drive.
> 
> dd reached D state (as well as scsi_wq) . i think it shouldn't happen no 
> matter it was AIC failure or LSI expander failure.

"It's wireless!" ;)

Seriously, though, it's a good idea to tell the kernel that you're
about to unplug a disk before actually doing it:

echo 1 > /sys/block/sdX/device/delete

This way, the kernel can tell the disk to flush its caches long before
power actually gets removed.  Otherwise, the device removal code can
get hung up just like you observed, and whatever's in the write cache
may or may not actually get written to the media.

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-12-03 19:36     ` Darrick J. Wong
@ 2007-12-03 19:43       ` Jeff Garzik
  2007-12-03 21:31         ` Darrick J. Wong
  2007-12-03 20:06       ` Krzysztof Błaszkowski
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2007-12-03 19:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Krzysztof B??aszkowski, linux-scsi, vst, Alexis Bruemmer

Darrick J. Wong wrote:
> On Mon, Dec 03, 2007 at 05:09:54PM +0100, Krzysztof B??aszkowski wrote:
>> I noticed also another failure when i removed a drive. The event was not 
>> notified by anything (ie the block device and corresponding sg were 
>> registered) so i run dd on this truly "virtual" drive.
>>
>> dd reached D state (as well as scsi_wq) . i think it shouldn't happen no 
>> matter it was AIC failure or LSI expander failure.
> 
> "It's wireless!" ;)
> 
> Seriously, though, it's a good idea to tell the kernel that you're
> about to unplug a disk before actually doing it:
> 
> echo 1 > /sys/block/sdX/device/delete
> 
> This way, the kernel can tell the disk to flush its caches long before
> power actually gets removed.  Otherwise, the device removal code can
> get hung up just like you observed, and whatever's in the write cache
> may or may not actually get written to the media.


What you say is quite true about write cache -- you can clearly lose 
some data by hot-unplugging a device.  And there's nothing we can do 
about that.

But what do you mean by "device removal code can get hung up"?  That 
sounds like a bug we should fix.

	Jeff



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-12-03 19:36     ` Darrick J. Wong
  2007-12-03 19:43       ` Jeff Garzik
@ 2007-12-03 20:06       ` Krzysztof Błaszkowski
  2007-12-04 22:35         ` [PATCH] libsas: Don't issue commands to devices that have been hot-removed Darrick J. Wong
  1 sibling, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2007-12-03 20:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-scsi, vst, Alexis Bruemmer

Hi Darrick,

On Monday 03 December 2007 20:36, Darrick J. Wong wrote:
> On Mon, Dec 03, 2007 at 05:09:54PM +0100, Krzysztof B??aszkowski wrote:
> > I noticed also another failure when i removed a drive. The event was not
> > notified by anything (ie the block device and corresponding sg were
> > registered) so i run dd on this truly "virtual" drive.
> >
> > dd reached D state (as well as scsi_wq) . i think it shouldn't happen no
> > matter it was AIC failure or LSI expander failure.
>
> "It's wireless!" ;)

yep :) and energy from positive thinking spins disk's plates ;) 

>
> Seriously, though, it's a good idea to tell the kernel that you're
> about to unplug a disk before actually doing it:
>
> echo 1 > /sys/block/sdX/device/delete
>
> This way, the kernel can tell the disk to flush its caches long before
> power actually gets removed.  Otherwise, the device removal code can
> get hung up just like you observed, and whatever's in the write cache
> may or may not actually get written to the media.
>

imagine just raining Monday and someone who put hand on the drive thus he had 
to reboot whole box.

Thanks,
Krzysztof
> --D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives
  2007-12-03 19:43       ` Jeff Garzik
@ 2007-12-03 21:31         ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2007-12-03 21:31 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Krzysztof B??aszkowski, linux-scsi, vst, Alexis Bruemmer

On Mon, Dec 03, 2007 at 02:43:09PM -0500, Jeff Garzik wrote:

> But what do you mean by "device removal code can get hung up"?  That sounds 
> like a bug we should fix.

At the moment, libsas' sas_rphy_remove function doesn't distinguish between
removing a device before or after the disk has been disconnected.
Hence, sd_shutdown tries to tell the disk to flush the write cache, even
in the case that the disk is already gone.  Maybe the solution is to
modify aic94xx to remove the device's DDB registration prior to sending
the "device gone" event to libsas so that all subsequent commands bounce
with "no such device" instead of going out to lunch.

(I'll look into this later, as I myself am going out to lunch right now.)

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] libsas: Don't issue commands to devices that have been hot-removed.
  2007-12-03 20:06       ` Krzysztof Błaszkowski
@ 2007-12-04 22:35         ` Darrick J. Wong
  2007-12-04 22:48           ` Jeff Garzik
  0 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2007-12-04 22:35 UTC (permalink / raw)
  To: Krzysztof Błaszkowski; +Cc: linux-scsi, vst, Alexis Bruemmer

Hrm... does this patch help?  You'll get a bunch of ATA/SAS disk errors
printed to the screen if you yank the disk, but at least libsas won't
get stuck waiting for the cache-flush commands to time out.
---
sd will get hung up issuing commands to flush write cache if a SAS device
is unplugged without warning.  Change libsas to reject commands to domain
devices that have already gone away.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---

 drivers/scsi/libsas/sas_ata.c       |    4 ++++
 drivers/scsi/libsas/sas_expander.c  |    3 +++
 drivers/scsi/libsas/sas_port.c      |    2 ++
 drivers/scsi/libsas/sas_scsi_host.c |    7 +++++++
 include/scsi/libsas.h               |    1 +
 5 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 0829b55..f5e5213 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -161,6 +161,10 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc)
 	unsigned int num = 0;
 	unsigned int xfer = 0;
 
+	/* If the device fell off, no sense in issuing commands */
+	if (dev->gone)
+		return AC_ERR_SYSTEM;
+
 	task = sas_alloc_task(GFP_ATOMIC);
 	if (!task)
 		return AC_ERR_SYSTEM;
diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c
index 27674fe..4ba4d2a 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1680,6 +1680,7 @@ static void sas_unregister_ex_tree(struct domain_device *dev)
 	struct domain_device *child, *n;
 
 	list_for_each_entry_safe(child, n, &ex->children, siblings) {
+		child->gone = 1;
 		if (child->dev_type == EDGE_DEV ||
 		    child->dev_type == FANOUT_DEV)
 			sas_unregister_ex_tree(child);
@@ -1699,6 +1700,7 @@ static void sas_unregister_devs_sas_addr(struct domain_device *parent,
 	list_for_each_entry_safe(child, n, &ex_dev->children, siblings) {
 		if (SAS_ADDR(child->sas_addr) ==
 		    SAS_ADDR(phy->attached_sas_addr)) {
+			child->gone = 1;
 			if (child->dev_type == EDGE_DEV ||
 			    child->dev_type == FANOUT_DEV)
 				sas_unregister_ex_tree(child);
@@ -1707,6 +1709,7 @@ static void sas_unregister_devs_sas_addr(struct domain_device *parent,
 			break;
 		}
 	}
+	parent->gone = 1;
 	sas_disable_routing(parent, phy->attached_sas_addr);
 	memset(phy->attached_sas_addr, 0, SAS_ADDR_SIZE);
 	sas_port_delete_phy(phy->port, phy->phy);
diff --git a/drivers/scsi/libsas/sas_port.c b/drivers/scsi/libsas/sas_port.c
index b6f0243..2e82097 100644
--- a/drivers/scsi/libsas/sas_port.c
+++ b/drivers/scsi/libsas/sas_port.c
@@ -144,6 +144,8 @@ void sas_deform_port(struct asd_sas_phy *phy)
 		port->port_dev->pathways--;
 
 	if (port->num_phys == 1) {
+		if (port->port_dev)
+			port->port_dev->gone = 1;
 		sas_unregister_domain_devices(port);
 		sas_port_delete(port->port);
 		port->port = NULL;
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index c29ba47..61d2679 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -228,6 +228,13 @@ int sas_queuecommand(struct scsi_cmnd *cmd,
 			goto out;
 		}
 
+		/* If the device fell off, no sense in issuing commands */
+		if (dev->gone) {
+			cmd->result = DID_BAD_TARGET << 16;
+			scsi_done(cmd);
+			goto out;
+		}
+
 		res = -ENOMEM;
 		task = sas_create_task(cmd, dev, GFP_ATOMIC);
 		if (!task)
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index 8ad7465..73c5b15 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -207,6 +207,7 @@ struct domain_device {
         };
 
         void *lldd_dev;
+	int gone;
 };
 
 struct sas_discovery_event {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.
  2007-12-04 22:35         ` [PATCH] libsas: Don't issue commands to devices that have been hot-removed Darrick J. Wong
@ 2007-12-04 22:48           ` Jeff Garzik
  2007-12-04 23:17             ` Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2007-12-04 22:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Krzysztof Błaszkowski, linux-scsi, vst, Alexis Bruemmer

Darrick J. Wong wrote:
> Hrm... does this patch help?  You'll get a bunch of ATA/SAS disk errors
> printed to the screen if you yank the disk, but at least libsas won't
> get stuck waiting for the cache-flush commands to time out.
> ---
> sd will get hung up issuing commands to flush write cache if a SAS device
> is unplugged without warning.  Change libsas to reject commands to domain
> devices that have already gone away.
> 
> Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
> ---
> 
>  drivers/scsi/libsas/sas_ata.c       |    4 ++++
>  drivers/scsi/libsas/sas_expander.c  |    3 +++
>  drivers/scsi/libsas/sas_port.c      |    2 ++
>  drivers/scsi/libsas/sas_scsi_host.c |    7 +++++++
>  include/scsi/libsas.h               |    1 +
>  5 files changed, 17 insertions(+), 0 deletions(-)

Seems sane...


> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
> index 0829b55..f5e5213 100644
> --- a/drivers/scsi/libsas/sas_ata.c
> +++ b/drivers/scsi/libsas/sas_ata.c
> @@ -161,6 +161,10 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc)
>  	unsigned int num = 0;
>  	unsigned int xfer = 0;
>  
> +	/* If the device fell off, no sense in issuing commands */
> +	if (dev->gone)
> +		return AC_ERR_SYSTEM;
> +
>  	task = sas_alloc_task(GFP_ATOMIC);
>  	if (!task)
>  		return AC_ERR_SYSTEM;

As an aside, issues like this really really imply a need to move libsas 
away from the old libata EH stuff (like brking did with ipr, in patches).

	Jeff




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.
  2007-12-04 22:48           ` Jeff Garzik
@ 2007-12-04 23:17             ` Darrick J. Wong
  2007-12-04 23:40               ` Jeff Garzik
                                 ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Darrick J. Wong @ 2007-12-04 23:17 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Krzysztof Błaszkowski, linux-scsi, vst, Alexis Bruemmer

On Tue, Dec 04, 2007 at 05:48:33PM -0500, Jeff Garzik wrote:

> As an aside, issues like this really really imply a need to move libsas 
> away from the old libata EH stuff (like brking did with ipr, in patches).

Hm... does the new libata EH handle the case of "device was
unplugged, don't bother trying to send any more commands"?

In general, I agree that sas-ata should adopt the new EH.
Unfortunately, I believe the old way of sas-ata configuring ATA ports is
somehow not compatible with the new EH stuff and causes a crash during
the device probe with my patch to move sas-ata to the new EH.  If I
apply the patch that migrates sas-ata to use brking's latest ata-sas
configuration mechanism (the one that creates real ata_hosts), I see
(a) lots and lots of ATA hosts getting created (one per ATA port;
possibly undesirable if you've a SAS topology with a lot of SATA disks)
and (b) NCQ disks don't seem to work if you unplug the disk and plug
it back in (unless NCQ is disabled entirely).  Jeff, by any chance have
you tried plugging SATA devices into your SAS controllers?

James Bottomley wondered if it would be easier to have sas-ata call only
into the parts of libata that convert SCSI commands to ATA taskfiles,
though I'm unsure how many wormy cans that would open.

--D

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.
  2007-12-04 23:17             ` Darrick J. Wong
@ 2007-12-04 23:40               ` Jeff Garzik
  2007-12-06 16:55               ` Brian King
  2008-02-25 23:39               ` Jeff Garzik
  2 siblings, 0 replies; 15+ messages in thread
From: Jeff Garzik @ 2007-12-04 23:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Krzysztof Błaszkowski, linux-scsi, vst, Alexis Bruemmer

Darrick J. Wong wrote:
> On Tue, Dec 04, 2007 at 05:48:33PM -0500, Jeff Garzik wrote:
> 
>> As an aside, issues like this really really imply a need to move libsas 
>> away from the old libata EH stuff (like brking did with ipr, in patches).
> 
> Hm... does the new libata EH handle the case of "device was
> unplugged, don't bother trying to send any more commands"?
> 
> In general, I agree that sas-ata should adopt the new EH.
> Unfortunately, I believe the old way of sas-ata configuring ATA ports is
> somehow not compatible with the new EH stuff and causes a crash during
> the device probe with my patch to move sas-ata to the new EH.  If I
> apply the patch that migrates sas-ata to use brking's latest ata-sas
> configuration mechanism (the one that creates real ata_hosts), I see
> (a) lots and lots of ATA hosts getting created (one per ATA port;
> possibly undesirable if you've a SAS topology with a lot of SATA disks)
> and (b) NCQ disks don't seem to work if you unplug the disk and plug
> it back in (unless NCQ is disabled entirely).  Jeff, by any chance have
> you tried plugging SATA devices into your SAS controllers?

aic94xx yes, bcm and mv no.

Will take a look though...


> James Bottomley wondered if it would be easier to have sas-ata call only
> into the parts of libata that convert SCSI commands to ATA taskfiles,
> though I'm unsure how many wormy cans that would open.

You want more than that.

You want to make sure libata is the place for knowledge about weird ATA 
devices, SATA quirks, ATA device error handling (to be distinguished 
from ATA /link/ error handling), and other areas.

That stuff shouldn't be duplicated, and you /really/ do not want to 
re-learn all those lessons all over again ;-)

	Jeff




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.
  2007-12-04 23:17             ` Darrick J. Wong
  2007-12-04 23:40               ` Jeff Garzik
@ 2007-12-06 16:55               ` Brian King
  2008-02-25 23:39               ` Jeff Garzik
  2 siblings, 0 replies; 15+ messages in thread
From: Brian King @ 2007-12-06 16:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Garzik, Krzysztof B?aszkowski, linux-scsi, vst, Alexis Bruemmer

Darrick J. Wong wrote:
> In general, I agree that sas-ata should adopt the new EH.
> Unfortunately, I believe the old way of sas-ata configuring ATA ports is
> somehow not compatible with the new EH stuff and causes a crash during
> the device probe with my patch to move sas-ata to the new EH.  If I
> apply the patch that migrates sas-ata to use brking's latest ata-sas
> configuration mechanism (the one that creates real ata_hosts), I see
> (a) lots and lots of ATA hosts getting created (one per ATA port;
> possibly undesirable if you've a SAS topology with a lot of SATA disks)

The new libata EH ends up spending more time in the error handling thread
than the old code did. One of the reasons having multiple ATA/SCSI hosts
is a good thing is that is the granularity of error handling, so it
prevents stalling all the other devices under that SAS HBA while we are
hitting errors on an ATAPI SATA device, for example.

Arguably, SATA users of libata already have one SCSI host per ATA port,
so my SAS patches really just bring SAS in line with that design...

-Brian

-- 
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] libsas: Don't issue commands to devices that have been hot-removed.
  2007-12-04 23:17             ` Darrick J. Wong
  2007-12-04 23:40               ` Jeff Garzik
  2007-12-06 16:55               ` Brian King
@ 2008-02-25 23:39               ` Jeff Garzik
  2 siblings, 0 replies; 15+ messages in thread
From: Jeff Garzik @ 2008-02-25 23:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Garzik, Krzysztof Błaszkowski, linux-scsi, vst,
	Alexis Bruemmer


(digging through old email)

Darrick J. Wong wrote:
> On Tue, Dec 04, 2007 at 05:48:33PM -0500, Jeff Garzik wrote:
> 
>> As an aside, issues like this really really imply a need to move libsas 
>> away from the old libata EH stuff (like brking did with ipr, in patches).
> 
> Hm... does the new libata EH handle the case of "device was
> unplugged, don't bother trying to send any more commands"?

Yes, most certainly :)  We wouldn't have hotplug support without that...


> In general, I agree that sas-ata should adopt the new EH.
> Unfortunately, I believe the old way of sas-ata configuring ATA ports is
> somehow not compatible with the new EH stuff and causes a crash during
> the device probe with my patch to move sas-ata to the new EH.  If I
> apply the patch that migrates sas-ata to use brking's latest ata-sas
> configuration mechanism (the one that creates real ata_hosts), I see
> (a) lots and lots of ATA hosts getting created (one per ATA port;
> possibly undesirable if you've a SAS topology with a lot of SATA disks)
> and (b) NCQ disks don't seem to work if you unplug the disk and plug
> it back in (unless NCQ is disabled entirely).  Jeff, by any chance have
> you tried plugging SATA devices into your SAS controllers?

Just tested mvsas here...


> James Bottomley wondered if it would be easier to have sas-ata call only
> into the parts of libata that convert SCSI commands to ATA taskfiles,
> though I'm unsure how many wormy cans that would open.

Like Brian K noted, libata-EH is heavily involved in "anything not 
hotpath read/write", including but not limited to:  PMP, hotplug, device 
probing, device revalidation, explicit sequencing of ATA commands during 
initialization (critical for getting many ATA devices working)

You don't want to reinvent or duplicate all those ATA device 
initialization/revalidation quirks.

	Jeff




^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] libsas: Don't issue commands to devices that have been hot-removed
@ 2010-10-01 20:55 Dan Williams
  0 siblings, 0 replies; 15+ messages in thread
From: Dan Williams @ 2010-10-01 20:55 UTC (permalink / raw)
  To: james.bottomley
  Cc: Haipao Fan, linux-scsi, Jeff Garzik, Maciej Trela,
	Patrick Thomson, Jeff Skirvin, Brian King, Darrick J. Wong

From: Darrick J. Wong <djwong@us.ibm.com>

sd will get hung up issuing commands to flush write cache if a SAS
device behind the expander is unplugged without warning.  Change libsas
to reject commands to domain devices that have already gone away.

[maciej.trela@intel.com: removed setting ->gone in sas_deform_port() to
 permit sync cache commands at module removal]

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
linux-scsi-reference: <20071204223516.GA6767@tree.beaverton.ibm.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Patrick Thomson <patrick.s.thomson@intel.com>
Cc: Jeff Skirvin <jeffrey.d.skirvin@intel.com>
Tested-by: Haipao Fan <haipao.fan@intel.com>
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/scsi/libsas/sas_ata.c       |    4 ++++
 drivers/scsi/libsas/sas_expander.c  |    3 +++
 drivers/scsi/libsas/sas_scsi_host.c |    7 +++++++
 include/scsi/libsas.h               |    1 +
 4 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 042153c..da2e740 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -162,6 +162,10 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc)
 	unsigned int xfer = 0;
 	unsigned int si;
 
+	/* If the device fell off, no sense in issuing commands */
+	if (dev->gone)
+		return AC_ERR_SYSTEM;
+
 	task = sas_alloc_task(GFP_ATOMIC);
 	if (!task)
 		return AC_ERR_SYSTEM;
diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c
index 83dd507..61d81f8 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1724,6 +1724,7 @@ static void sas_unregister_ex_tree(struct domain_device *dev)
 	struct domain_device *child, *n;
 
 	list_for_each_entry_safe(child, n, &ex->children, siblings) {
+		child->gone = 1;
 		if (child->dev_type == EDGE_DEV ||
 		    child->dev_type == FANOUT_DEV)
 			sas_unregister_ex_tree(child);
@@ -1744,6 +1745,7 @@ static void sas_unregister_devs_sas_addr(struct domain_device *parent,
 			&ex_dev->children, siblings) {
 			if (SAS_ADDR(child->sas_addr) ==
 			    SAS_ADDR(phy->attached_sas_addr)) {
+				child->gone = 1;
 				if (child->dev_type == EDGE_DEV ||
 				    child->dev_type == FANOUT_DEV)
 					sas_unregister_ex_tree(child);
@@ -1752,6 +1754,7 @@ static void sas_unregister_devs_sas_addr(struct domain_device *parent,
 				break;
 			}
 		}
+		parent->gone = 1;
 		sas_disable_routing(parent, phy->attached_sas_addr);
 	}
 	memset(phy->attached_sas_addr, 0, SAS_ADDR_SIZE);
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f0cfba9..1787bd2 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -228,6 +228,13 @@ int sas_queuecommand(struct scsi_cmnd *cmd,
 			goto out;
 		}
 
+		/* If the device fell off, no sense in issuing commands */
+		if (dev->gone) {
+			cmd->result = DID_BAD_TARGET << 16;
+			scsi_done(cmd);
+			goto out;
+		}
+
 		res = -ENOMEM;
 		task = sas_create_task(cmd, dev, GFP_ATOMIC);
 		if (!task)
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index d06e13b..3dec194 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -205,6 +205,7 @@ struct domain_device {
         };
 
         void *lldd_dev;
+	int gone;
 };
 
 struct sas_discovery_event {


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-10-01 20:54 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-30  9:22 aic94xx or libsas crash on X7DB3 supermicro with enclosure and sata drives Krzysztof Błaszkowski
2007-11-30 21:33 ` Darrick J. Wong
2007-12-03 15:11   ` Krzysztof Błaszkowski
2007-12-03 16:09   ` Krzysztof Błaszkowski
2007-12-03 19:36     ` Darrick J. Wong
2007-12-03 19:43       ` Jeff Garzik
2007-12-03 21:31         ` Darrick J. Wong
2007-12-03 20:06       ` Krzysztof Błaszkowski
2007-12-04 22:35         ` [PATCH] libsas: Don't issue commands to devices that have been hot-removed Darrick J. Wong
2007-12-04 22:48           ` Jeff Garzik
2007-12-04 23:17             ` Darrick J. Wong
2007-12-04 23:40               ` Jeff Garzik
2007-12-06 16:55               ` Brian King
2008-02-25 23:39               ` Jeff Garzik
2010-10-01 20:55 Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.