Investigating potential flaw in scsi error handling

* Investigating potential flaw in scsi error handling
@ 2008-02-09 21:59 Elias Oltmanns
  2008-02-09 23:30 ` James Bottomley
  0 siblings, 1 reply; 8+ messages in thread
From: Elias Oltmanns @ 2008-02-09 21:59 UTC (permalink / raw)
  To: linux-scsi; +Cc: Tejun Heo

Hi there,

I'm experiencing system lockups with 2.6.24 which I believe to be
related to scsi error handling. Actually, I have patched the mainline
kernel with a disk shock protection patch [1] and in my case it is indeed
the shock protection mechanism that triggers the lockups. However, some
rather lengthy investigations have lead me to the conclusion that this
additional patch is just the means to reproduce the error condition
fairly reliably rather than the origin of the problem.

The problem has only become apparent since Tejun's commit
31cc23b34913bc173680bdc87af79e551bf8cc0d. More precisely, libata now
sets max_host_blocked and max_device_blocked to 1 for all ATA devices.
Various tests I've conducted so far have lead me to the conclusion that
a non zero return code from scsi_dispatch_command is sufficient to
trigger the problem I'm seeing provided that max_host_blocked and
max_device_blocked are set to 1.

Unfortunately, I'm a bit at a loss as to how I should proceed to find
the culprit. I can reliably reproduce the problem using the disk shock
protection patch in order to cause non zero return values from
scsi_dispatch_command. How can I find out where in the error handling of
this condition things might go wrong?

Most likely you will need further information to help me solving this
issue but perhaps you can already come up with some suggestions and tell
me what else you'd like to know.

Thanks in advance,

Elias

[1] http://article.gmane.org/gmane.linux.drivers.hdaps.devel/1094

PS: Since the disk shock protection patch is mainly concerned with an
ATA specific feature, I'm currently working on it to implement it in
libata rather than in the scsi midlayer. This doesn't change anything
with regard to the problem I've described above but has confirmed my
suspicion that it must be the return code from scsi_dispatch_command
that triggers system freeze.

^ permalink raw reply	[flat|nested] 8+ messages in thread