All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] firewire-sbp2: misc hotplug related patches
@ 2008-02-03 22:00 Stefan Richter
  2008-02-03 22:03 ` [PATCH 1/9] firewire: log GUID of new devices Stefan Richter
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:00 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

Here is various stuff to hopefully improve fw-sbp2's behavior during bus
resets.  The main piece is patch 9/9 which considerably raises the
chance that ongoing I/O survives plugging and unplugging of other
devices on the same bus as the device which services the I/O.

The other patches are basically side products of patch 9/9 but contain
quite useful fixes as well.

I got quite good results with several OxSemi based SBP-2 devices, also
with an Initio based dual-LU device and LSI based devices.  A Prolific
PL-3505 based device with known buggy firmware didn't work too well; I
don't know if (but doubt that) the ieee1394 stack would do better.  My
two TI StorageLynx based devices, both with somewhat quirky firmware as
well, still almost never survive bus resets if there is also I/O.  These
devices don't show such problems with the old stack.

I simply tested HDDs with dd (read only), and CD/DVD-ROM/R/W with
cdparanoia.  The latter sometimes produced audible gaps in the ripped
audio file due to sometimes long blocked I/O when the bus was recovering
from addition or removal of devices.  I think the ability to recover
from such gaps also depends on the optical drive attached to the SBP-2
bridge.  However, dd apparently never had issues with arbitrarily long
I/O pauses.

It has to be noted that I still encountered various weird things while
abusing the hardware, i.e. all of this needs time for testing.  Also, I
don't have a device with multiple logical units beneath the same target
for testing.  The mentioned dual-LU device exposes the LUs in separate
unit directories, i.e. as separate targets.

Combined diffstat and shortlog:

 drivers/firewire/fw-device.c |   28 ++--
 drivers/firewire/fw-sbp2.c   |  224 ++++++++++++++++++++++++++---------
 drivers/ieee1394/sbp2.c      |   12 +
 drivers/ieee1394/sbp2.h      |    2
 4 files changed, 202 insertions(+), 64 deletions(-)

1/9 firewire: log GUID of new devices
2/9 firewire: fw-sbp2: add INQUIRY delay workaround
3/9 ieee1394: sbp2: add INQUIRY delay workaround
4/9 firewire: fw-sbp2: wait for completion of fetch agent reset
5/9 firewire: fw-sbp2: log bus_id at management request failures
6/9 firewire: fw-sbp2: don't add scsi_device twice
7/9 firewire: fw-sbp2: logout and login after failed reconnect
8/9 firewire: fw-sbp2: sort includes
9/9 firewire: fw-sbp2: fix I/O errors during reconnect
-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/9] firewire: log GUID of new devices
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
@ 2008-02-03 22:03 ` Stefan Richter
  2008-02-04  8:14   ` Stefan Richter
  2008-02-11 16:53   ` Jarod Wilson
  2008-02-03 22:04 ` [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround Stefan Richter
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:03 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

This should help to interpret user reports.  E.g. one can look up the
vendor OUI (first three bytes of the GUID) and thus tell what is what.

Also simplifies the math in the GUID sysfs attribute.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-device.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

Index: linux/drivers/firewire/fw-device.c
===================================================================
--- linux.orig/drivers/firewire/fw-device.c
+++ linux/drivers/firewire/fw-device.c
@@ -358,12 +358,9 @@ static ssize_t
 guid_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct fw_device *device = fw_device(dev);
-	u64 guid;
 
-	guid = ((u64)device->config_rom[3] << 32) | device->config_rom[4];
-
-	return snprintf(buf, PAGE_SIZE, "0x%016llx\n",
-			(unsigned long long)guid);
+	return snprintf(buf, PAGE_SIZE, "0x%08x%08x\n",
+			device->config_rom[3], device->config_rom[4]);
 }
 
 static struct device_attribute fw_device_attributes[] = {
@@ -723,13 +720,22 @@ static void fw_device_init(struct work_s
 	 */
 	if (atomic_cmpxchg(&device->state,
 		    FW_DEVICE_INITIALIZING,
-		    FW_DEVICE_RUNNING) == FW_DEVICE_SHUTDOWN)
+		    FW_DEVICE_RUNNING) == FW_DEVICE_SHUTDOWN) {
 		fw_device_shutdown(&device->work.work);
-	else
-		fw_notify("created new fw device %s "
-			  "(%d config rom retries, S%d00)\n",
-			  device->device.bus_id, device->config_rom_retries,
-			  1 << device->max_speed);
+	} else {
+		if (device->config_rom_retries)
+			fw_notify("created device %s: GUID %08x%08x, S%d00, "
+				  "%d config ROM retries\n",
+				  device->device.bus_id,
+				  device->config_rom[3], device->config_rom[4],
+				  1 << device->max_speed,
+				  device->config_rom_retries);
+		else
+			fw_notify("created device %s: GUID %08x%08x, S%d00\n",
+				  device->device.bus_id,
+				  device->config_rom[3], device->config_rom[4],
+				  1 << device->max_speed);
+	}
 
 	/*
 	 * Reschedule the IRM work if we just finished reading the

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
  2008-02-03 22:03 ` [PATCH 1/9] firewire: log GUID of new devices Stefan Richter
@ 2008-02-03 22:04 ` Stefan Richter
  2008-02-11 17:01   ` Jarod Wilson
  2008-02-03 22:07 ` [PATCH 3/9] ieee1394: sbp2: " Stefan Richter
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:04 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

Several different SBP-2 bridges accept a login early while the IDE
device is still powering up.  They are therefore unable to respond to
SCSI INQUIRY immediately, and the SCSI core has to retry the INQUIRY.
One of these retries is typically successful, and all is well.

But in case of Momobay FX-3A, the INQUIRY retries tend to fail entirely.
This can usually be avoided by waiting a little while after login before
letting the SCSI core send the INQUIRY.  The old sbp2 driver handles
this more gracefully for as yet unknown reasons (perhaps because it
waits for fetch agent resets to complete, unlike fw-sbp2 which quickly
proceeds after requesting the agent reset).  Therefore the workaround is
not as much necessary for sbp2.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -32,6 +32,7 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/mod_devicetable.h>
+#include <linux/delay.h>
 #include <linux/device.h>
 #include <linux/scatterlist.h>
 #include <linux/dma-mapping.h>
@@ -82,6 +83,9 @@ MODULE_PARM_DESC(exclusive_login, "Exclu
  *   Avoids access beyond actual disk limits on devices with an off-by-one bug.
  *   Don't use this with devices which don't have this bug.
  *
+ * - delay inquiry
+ *   Wait extra SBP2_INQUIRY_DELAY seconds after login before SCSI inquiry.
+ *
  * - override internal blacklist
  *   Instead of adding to the built-in blacklist, use only the workarounds
  *   specified in the module load parameter.
@@ -91,6 +95,8 @@ MODULE_PARM_DESC(exclusive_login, "Exclu
 #define SBP2_WORKAROUND_INQUIRY_36	0x2
 #define SBP2_WORKAROUND_MODE_SENSE_8	0x4
 #define SBP2_WORKAROUND_FIX_CAPACITY	0x8
+#define SBP2_WORKAROUND_DELAY_INQUIRY	0x10
+#define SBP2_INQUIRY_DELAY		12
 #define SBP2_WORKAROUND_OVERRIDE	0x100
 
 static int sbp2_param_workarounds;
@@ -100,6 +106,7 @@ MODULE_PARM_DESC(workarounds, "Work arou
 	", 36 byte inquiry = "    __stringify(SBP2_WORKAROUND_INQUIRY_36)
 	", skip mode page 8 = "   __stringify(SBP2_WORKAROUND_MODE_SENSE_8)
 	", fix capacity = "       __stringify(SBP2_WORKAROUND_FIX_CAPACITY)
+	", delay inquiry = "      __stringify(SBP2_WORKAROUND_DELAY_INQUIRY)
 	", override internal blacklist = " __stringify(SBP2_WORKAROUND_OVERRIDE)
 	", or a combination)");
 
@@ -303,6 +310,11 @@ static const struct {
 		.workarounds		= SBP2_WORKAROUND_INQUIRY_36 |
 					  SBP2_WORKAROUND_MODE_SENSE_8,
 	},
+	/* DViCO Momobay FX-3A with TSB42AA9A bridge */ {
+		.firmware_revision	= 0x002800,
+		.model			= 0x000000,
+		.workarounds		= SBP2_WORKAROUND_DELAY_INQUIRY,
+	},
 	/* Initio bridges, actually only needed for some older ones */ {
 		.firmware_revision	= 0x000200,
 		.model			= ~0,
@@ -712,6 +724,9 @@ static void sbp2_login(struct work_struc
 	PREPARE_DELAYED_WORK(&lu->work, sbp2_reconnect);
 	sbp2_agent_reset(lu);
 
+	if (lu->tgt->workarounds & SBP2_WORKAROUND_DELAY_INQUIRY)
+		ssleep(SBP2_INQUIRY_DELAY);
+
 	memset(&eight_bytes_lun, 0, sizeof(eight_bytes_lun));
 	eight_bytes_lun.scsi_lun[0] = (lu->lun >> 8) & 0xff;
 	eight_bytes_lun.scsi_lun[1] = lu->lun & 0xff;

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 3/9] ieee1394: sbp2: add INQUIRY delay workaround
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
  2008-02-03 22:03 ` [PATCH 1/9] firewire: log GUID of new devices Stefan Richter
  2008-02-03 22:04 ` [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround Stefan Richter
@ 2008-02-03 22:07 ` Stefan Richter
  2008-02-11 17:03   ` Jarod Wilson
  2008-02-03 22:08 ` [PATCH 4/9] firewire: fw-sbp2: wait for completion of fetch agent reset Stefan Richter
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:07 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

Add the same workaround as found in fw-sbp2 for feature parity and
compatibility of the workarounds module parameter.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/ieee1394/sbp2.c |   12 ++++++++++++
 drivers/ieee1394/sbp2.h |    2 ++
 2 files changed, 14 insertions(+)

Index: linux/drivers/ieee1394/sbp2.c
===================================================================
--- linux.orig/drivers/ieee1394/sbp2.c
+++ linux/drivers/ieee1394/sbp2.c
@@ -183,6 +183,9 @@ MODULE_PARM_DESC(exclusive_login, "Exclu
  *   Avoids access beyond actual disk limits on devices with an off-by-one bug.
  *   Don't use this with devices which don't have this bug.
  *
+ * - delay inquiry
+ *   Wait extra SBP2_INQUIRY_DELAY seconds after login before SCSI inquiry.
+ *
  * - override internal blacklist
  *   Instead of adding to the built-in blacklist, use only the workarounds
  *   specified in the module load parameter.
@@ -195,6 +198,7 @@ MODULE_PARM_DESC(workarounds, "Work arou
 	", 36 byte inquiry = "    __stringify(SBP2_WORKAROUND_INQUIRY_36)
 	", skip mode page 8 = "   __stringify(SBP2_WORKAROUND_MODE_SENSE_8)
 	", fix capacity = "       __stringify(SBP2_WORKAROUND_FIX_CAPACITY)
+	", delay inquiry = "      __stringify(SBP2_WORKAROUND_DELAY_INQUIRY)
 	", override internal blacklist = " __stringify(SBP2_WORKAROUND_OVERRIDE)
 	", or a combination)");
 
@@ -357,6 +361,11 @@ static const struct {
 		.workarounds		= SBP2_WORKAROUND_INQUIRY_36 |
 					  SBP2_WORKAROUND_MODE_SENSE_8,
 	},
+	/* DViCO Momobay FX-3A with TSB42AA9A bridge */ {
+		.firmware_revision	= 0x002800,
+		.model_id		= 0x000000,
+		.workarounds		= SBP2_WORKAROUND_DELAY_INQUIRY,
+	},
 	/* Initio bridges, actually only needed for some older ones */ {
 		.firmware_revision	= 0x000200,
 		.model_id		= SBP2_ROM_VALUE_WILDCARD,
@@ -914,6 +923,9 @@ static int sbp2_start_device(struct sbp2
 	sbp2_agent_reset(lu, 1);
 	sbp2_max_speed_and_size(lu);
 
+	if (lu->workarounds & SBP2_WORKAROUND_DELAY_INQUIRY)
+		ssleep(SBP2_INQUIRY_DELAY);
+
 	error = scsi_add_device(lu->shost, 0, lu->ud->id, 0);
 	if (error) {
 		SBP2_ERR("scsi_add_device failed");
Index: linux/drivers/ieee1394/sbp2.h
===================================================================
--- linux.orig/drivers/ieee1394/sbp2.h
+++ linux/drivers/ieee1394/sbp2.h
@@ -343,6 +343,8 @@ enum sbp2lu_state_types {
 #define SBP2_WORKAROUND_INQUIRY_36	0x2
 #define SBP2_WORKAROUND_MODE_SENSE_8	0x4
 #define SBP2_WORKAROUND_FIX_CAPACITY	0x8
+#define SBP2_WORKAROUND_DELAY_INQUIRY	0x10
+#define SBP2_INQUIRY_DELAY		12
 #define SBP2_WORKAROUND_OVERRIDE	0x100
 
 #endif /* SBP2_H */

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 4/9] firewire: fw-sbp2: wait for completion of fetch agent reset
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (2 preceding siblings ...)
  2008-02-03 22:07 ` [PATCH 3/9] ieee1394: sbp2: " Stefan Richter
@ 2008-02-03 22:08 ` Stefan Richter
  2008-02-04  8:11   ` Stefan Richter
  2008-02-03 22:09 ` [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures Stefan Richter
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:08 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

Like the old sbp2 driver, wait for the write transaction to the
AGENT_RESET to complete before proceeding (after login, after reconnect,
or in SCSI error handling).

There is one occasion where AGENT_RESET is written to from atomic
context when getting DEAD status for a command ORB.  There we still
continue without waiting for the transaction to complete because this
is more difficult to fix...

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |   39 ++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -603,29 +603,46 @@ sbp2_send_management_orb(struct sbp2_log
 
 static void
 complete_agent_reset_write(struct fw_card *card, int rcode,
-			   void *payload, size_t length, void *data)
+			   void *payload, size_t length, void *done)
 {
-	struct fw_transaction *t = data;
+	complete(done);
+}
+
+static void sbp2_agent_reset(struct sbp2_logical_unit *lu)
+{
+	struct fw_device *device = fw_device(lu->tgt->unit->device.parent);
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct fw_transaction t;
+	static u32 z;
 
-	kfree(t);
+	fw_send_request(device->card, &t, TCODE_WRITE_QUADLET_REQUEST,
+			lu->tgt->node_id, lu->generation, device->max_speed,
+			lu->command_block_agent_address + SBP2_AGENT_RESET,
+			&z, sizeof(z), complete_agent_reset_write, &done);
+	wait_for_completion(&done);
 }
 
-static int sbp2_agent_reset(struct sbp2_logical_unit *lu)
+static void
+complete_agent_reset_write_no_wait(struct fw_card *card, int rcode,
+				   void *payload, size_t length, void *data)
+{
+	kfree(data);
+}
+
+static void sbp2_agent_reset_no_wait(struct sbp2_logical_unit *lu)
 {
 	struct fw_device *device = fw_device(lu->tgt->unit->device.parent);
 	struct fw_transaction *t;
-	static u32 zero;
+	static u32 z;
 
-	t = kzalloc(sizeof(*t), GFP_ATOMIC);
+	t = kmalloc(sizeof(*t), GFP_ATOMIC);
 	if (t == NULL)
-		return -ENOMEM;
+		return;
 
 	fw_send_request(device->card, t, TCODE_WRITE_QUADLET_REQUEST,
 			lu->tgt->node_id, lu->generation, device->max_speed,
 			lu->command_block_agent_address + SBP2_AGENT_RESET,
-			&zero, sizeof(zero), complete_agent_reset_write, t);
-
-	return 0;
+			&z, sizeof(z), complete_agent_reset_write_no_wait, t);
 }
 
 static void sbp2_release_target(struct kref *kref)
@@ -1110,7 +1127,7 @@ complete_command_orb(struct sbp2_orb *ba
 
 	if (status != NULL) {
 		if (STATUS_GET_DEAD(*status))
-			sbp2_agent_reset(orb->lu);
+			sbp2_agent_reset_no_wait(orb->lu);
 
 		switch (STATUS_GET_RESPONSE(*status)) {
 		case SBP2_STATUS_REQUEST_COMPLETE:

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (3 preceding siblings ...)
  2008-02-03 22:08 ` [PATCH 4/9] firewire: fw-sbp2: wait for completion of fetch agent reset Stefan Richter
@ 2008-02-03 22:09 ` Stefan Richter
  2008-02-11 17:16   ` Jarod Wilson
  2008-02-03 22:10 ` [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice Stefan Richter
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:09 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

for easier readable logs if more than one SBP-2 device is present.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |   66 ++++++++++++++++++-------------------
 1 file changed, 33 insertions(+), 33 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -149,6 +149,7 @@ struct sbp2_target {
 	struct kref kref;
 	struct fw_unit *unit;
 	struct list_head lu_list;
+	const char *bus_id;
 
 	u64 management_agent_address;
 	int directory_id;
@@ -566,20 +567,20 @@ sbp2_send_management_orb(struct sbp2_log
 
 	retval = -EIO;
 	if (sbp2_cancel_orbs(lu) == 0) {
-		fw_error("orb reply timed out, rcode=0x%02x\n",
-			 orb->base.rcode);
+		fw_error("%s: orb reply timed out, rcode=0x%02x\n",
+			 lu->tgt->bus_id, orb->base.rcode);
 		goto out;
 	}
 
 	if (orb->base.rcode != RCODE_COMPLETE) {
-		fw_error("management write failed, rcode 0x%02x\n",
-			 orb->base.rcode);
+		fw_error("%s: management write failed, rcode 0x%02x\n",
+			 lu->tgt->bus_id, orb->base.rcode);
 		goto out;
 	}
 
 	if (STATUS_GET_RESPONSE(orb->status) != 0 ||
 	    STATUS_GET_SBP_STATUS(orb->status) != 0) {
-		fw_error("error status: %d:%d\n",
+		fw_error("%s: error status: %d:%d\n", lu->tgt->bus_id,
 			 STATUS_GET_RESPONSE(orb->status),
 			 STATUS_GET_SBP_STATUS(orb->status));
 		goto out;
@@ -664,7 +665,7 @@ static void sbp2_release_target(struct k
 		kfree(lu);
 	}
 	scsi_remove_host(shost);
-	fw_notify("released %s\n", tgt->unit->device.bus_id);
+	fw_notify("released %s\n", tgt->bus_id);
 
 	put_device(&tgt->unit->device);
 	scsi_host_put(shost);
@@ -693,12 +694,11 @@ static void sbp2_login(struct work_struc
 {
 	struct sbp2_logical_unit *lu =
 		container_of(work, struct sbp2_logical_unit, work.work);
-	struct Scsi_Host *shost =
-		container_of((void *)lu->tgt, struct Scsi_Host, hostdata[0]);
+	struct sbp2_target *tgt = lu->tgt;
+	struct fw_device *device = fw_device(tgt->unit->device.parent);
+	struct Scsi_Host *shost;
 	struct scsi_device *sdev;
 	struct scsi_lun eight_bytes_lun;
-	struct fw_unit *unit = lu->tgt->unit;
-	struct fw_device *device = fw_device(unit->device.parent);
 	struct sbp2_login_response response;
 	int generation, node_id, local_node_id;
 
@@ -715,14 +715,14 @@ static void sbp2_login(struct work_struc
 		if (lu->retries++ < 5)
 			sbp2_queue_work(lu, DIV_ROUND_UP(HZ, 5));
 		else
-			fw_error("failed to login to %s LUN %04x\n",
-				 unit->device.bus_id, lu->lun);
+			fw_error("%s: failed to login to LUN %04x\n",
+				 tgt->bus_id, lu->lun);
 		goto out;
 	}
 
-	lu->generation        = generation;
-	lu->tgt->node_id      = node_id;
-	lu->tgt->address_high = local_node_id << 16;
+	lu->generation    = generation;
+	tgt->node_id	  = node_id;
+	tgt->address_high = local_node_id << 16;
 
 	/* Get command block agent offset and login id. */
 	lu->command_block_agent_address =
@@ -730,8 +730,8 @@ static void sbp2_login(struct work_struc
 		response.command_block_agent.low;
 	lu->login_id = LOGIN_RESPONSE_GET_LOGIN_ID(response);
 
-	fw_notify("logged in to %s LUN %04x (%d retries)\n",
-		  unit->device.bus_id, lu->lun, lu->retries);
+	fw_notify("%s: logged in to LUN %04x (%d retries)\n",
+		  tgt->bus_id, lu->lun, lu->retries);
 
 #if 0
 	/* FIXME: The linux1394 sbp2 does this last step. */
@@ -747,6 +747,7 @@ static void sbp2_login(struct work_struc
 	memset(&eight_bytes_lun, 0, sizeof(eight_bytes_lun));
 	eight_bytes_lun.scsi_lun[0] = (lu->lun >> 8) & 0xff;
 	eight_bytes_lun.scsi_lun[1] = lu->lun & 0xff;
+	shost = container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
 
 	sdev = __scsi_add_device(shost, 0, 0,
 				 scsilun_to_int(&eight_bytes_lun), lu);
@@ -791,7 +792,7 @@ static void sbp2_login(struct work_struc
 	 */
 	PREPARE_DELAYED_WORK(&lu->work, sbp2_login);
  out:
-	sbp2_target_put(lu->tgt);
+	sbp2_target_put(tgt);
 }
 
 static int sbp2_add_logical_unit(struct sbp2_target *tgt, int lun_entry)
@@ -874,7 +875,7 @@ static int sbp2_scan_unit_dir(struct sbp
 			if (timeout > tgt->mgt_orb_timeout)
 				fw_notify("%s: config rom contains %ds "
 					  "management ORB timeout, limiting "
-					  "to %ds\n", tgt->unit->device.bus_id,
+					  "to %ds\n", tgt->bus_id,
 					  timeout / 1000,
 					  tgt->mgt_orb_timeout / 1000);
 			break;
@@ -902,7 +903,7 @@ static void sbp2_init_workarounds(struct
 	if (w)
 		fw_notify("Please notify linux1394-devel@lists.sourceforge.net "
 			  "if you need the workarounds parameter for %s\n",
-			  tgt->unit->device.bus_id);
+			  tgt->bus_id);
 
 	if (w & SBP2_WORKAROUND_OVERRIDE)
 		goto out;
@@ -924,8 +925,7 @@ static void sbp2_init_workarounds(struct
 	if (w)
 		fw_notify("Workarounds for %s: 0x%x "
 			  "(firmware_revision 0x%06x, model_id 0x%06x)\n",
-			  tgt->unit->device.bus_id,
-			  w, firmware_revision, model);
+			  tgt->bus_id, w, firmware_revision, model);
 	tgt->workarounds = w;
 }
 
@@ -949,6 +949,7 @@ static int sbp2_probe(struct device *dev
 	tgt->unit = unit;
 	kref_init(&tgt->kref);
 	INIT_LIST_HEAD(&tgt->lu_list);
+	tgt->bus_id = unit->device.bus_id;
 
 	if (fw_device_enable_phys_dma(device) < 0)
 		goto fail_shost_put;
@@ -999,8 +1000,8 @@ static void sbp2_reconnect(struct work_s
 {
 	struct sbp2_logical_unit *lu =
 		container_of(work, struct sbp2_logical_unit, work.work);
-	struct fw_unit *unit = lu->tgt->unit;
-	struct fw_device *device = fw_device(unit->device.parent);
+	struct sbp2_target *tgt = lu->tgt;
+	struct fw_device *device = fw_device(tgt->unit->device.parent);
 	int generation, node_id, local_node_id;
 
 	if (fw_device_is_shutdown(device))
@@ -1015,8 +1016,7 @@ static void sbp2_reconnect(struct work_s
 				     SBP2_RECONNECT_REQUEST,
 				     lu->login_id, NULL) < 0) {
 		if (lu->retries++ >= 5) {
-			fw_error("failed to reconnect to %s\n",
-				 unit->device.bus_id);
+			fw_error("%s: failed to reconnect\n", tgt->bus_id);
 			/* Fall back and try to log in again. */
 			lu->retries = 0;
 			PREPARE_DELAYED_WORK(&lu->work, sbp2_login);
@@ -1025,17 +1025,17 @@ static void sbp2_reconnect(struct work_s
 		goto out;
 	}
 
-	lu->generation        = generation;
-	lu->tgt->node_id      = node_id;
-	lu->tgt->address_high = local_node_id << 16;
+	lu->generation    = generation;
+	tgt->node_id      = node_id;
+	tgt->address_high = local_node_id << 16;
 
-	fw_notify("reconnected to %s LUN %04x (%d retries)\n",
-		  unit->device.bus_id, lu->lun, lu->retries);
+	fw_notify("%s: reconnected to LUN %04x (%d retries)\n",
+		  tgt->bus_id, lu->lun, lu->retries);
 
 	sbp2_agent_reset(lu);
 	sbp2_cancel_orbs(lu);
  out:
-	sbp2_target_put(lu->tgt);
+	sbp2_target_put(tgt);
 }
 
 static void sbp2_update(struct fw_unit *unit)
@@ -1377,7 +1377,7 @@ static int sbp2_scsi_abort(struct scsi_c
 {
 	struct sbp2_logical_unit *lu = cmd->device->hostdata;
 
-	fw_notify("sbp2_scsi_abort\n");
+	fw_notify("%s: sbp2_scsi_abort\n", lu->tgt->bus_id);
 	sbp2_agent_reset(lu);
 	sbp2_cancel_orbs(lu);
 

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (4 preceding siblings ...)
  2008-02-03 22:09 ` [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures Stefan Richter
@ 2008-02-03 22:10 ` Stefan Richter
  2008-02-11 17:19   ` Jarod Wilson
  2008-02-03 22:11 ` [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect Stefan Richter
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:10 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

When a reconnect failed but re-login succeeded, __scsi_add_device was
called again.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -741,6 +741,12 @@ static void sbp2_login(struct work_struc
 	PREPARE_DELAYED_WORK(&lu->work, sbp2_reconnect);
 	sbp2_agent_reset(lu);
 
+	/* This was a re-login. */
+	if (lu->sdev) {
+		sbp2_cancel_orbs(lu);
+		goto out;
+	}
+
 	if (lu->tgt->workarounds & SBP2_WORKAROUND_DELAY_INQUIRY)
 		ssleep(SBP2_INQUIRY_DELAY);
 

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (5 preceding siblings ...)
  2008-02-03 22:10 ` [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice Stefan Richter
@ 2008-02-03 22:11 ` Stefan Richter
  2008-02-11 17:32   ` Jarod Wilson
  2008-02-03 22:12 ` [PATCH 8/9] firewire: fw-sbp2: sort includes Stefan Richter
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:11 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

If fw-sbp2 was too late with requesting the reconnect, the target would
reject this.  In this case, log out before attempting the reconnect.
Else several firmwares will deny the re-login because they somehow
didn't invalidate the old login.

Also, don't retry reconnects in this situation.  The retries won't
succeed either.

These changes improve chances for successful re-login and shorten the
period during which the logical unit is inaccessible.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -710,6 +710,11 @@ static void sbp2_login(struct work_struc
 	node_id       = device->node_id;
 	local_node_id = device->card->node_id;
 
+	/* If this is a re-login attempt, log out, or we might be rejected. */
+	if (lu->sdev)
+		sbp2_send_management_orb(lu, device->node_id, generation,
+				SBP2_LOGOUT_REQUEST, lu->login_id, NULL);
+
 	if (sbp2_send_management_orb(lu, node_id, generation,
 				SBP2_LOGIN_REQUEST, lu->lun, &response) < 0) {
 		if (lu->retries++ < 5)
@@ -1021,9 +1026,17 @@ static void sbp2_reconnect(struct work_s
 	if (sbp2_send_management_orb(lu, node_id, generation,
 				     SBP2_RECONNECT_REQUEST,
 				     lu->login_id, NULL) < 0) {
-		if (lu->retries++ >= 5) {
+		/*
+		 * If reconnect was impossible even though we are in the
+		 * current generation, fall back and try to log in again.
+		 *
+		 * We could check for "Function rejected" status, but
+		 * looking at the bus generation as simpler and more general.
+		 */
+		smp_rmb(); /* get current card generation */
+		if (generation == device->card->generation ||
+		    lu->retries++ >= 5) {
 			fw_error("%s: failed to reconnect\n", tgt->bus_id);
-			/* Fall back and try to log in again. */
 			lu->retries = 0;
 			PREPARE_DELAYED_WORK(&lu->work, sbp2_login);
 		}

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 8/9] firewire: fw-sbp2: sort includes
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (6 preceding siblings ...)
  2008-02-03 22:11 ` [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect Stefan Richter
@ 2008-02-03 22:12 ` Stefan Richter
  2008-02-03 22:13 ` [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect Stefan Richter
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:12 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -28,15 +28,15 @@
  * and many others.
  */
 
+#include <linux/blkdev.h>
+#include <linux/delay.h>
+#include <linux/device.h>
+#include <linux/dma-mapping.h>
 #include <linux/kernel.h>
+#include <linux/mod_devicetable.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
-#include <linux/mod_devicetable.h>
-#include <linux/delay.h>
-#include <linux/device.h>
 #include <linux/scatterlist.h>
-#include <linux/dma-mapping.h>
-#include <linux/blkdev.h>
 #include <linux/string.h>
 #include <linux/stringify.h>
 #include <linux/timer.h>
@@ -48,9 +48,9 @@
 #include <scsi/scsi_device.h>
 #include <scsi/scsi_host.h>
 
-#include "fw-transaction.h"
-#include "fw-topology.h"
 #include "fw-device.h"
+#include "fw-topology.h"
+#include "fw-transaction.h"
 
 /*
  * So far only bridges from Oxford Semiconductor are known to support

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (7 preceding siblings ...)
  2008-02-03 22:12 ` [PATCH 8/9] firewire: fw-sbp2: sort includes Stefan Richter
@ 2008-02-03 22:13 ` Stefan Richter
  2008-02-11 18:09   ` Jarod Wilson
  2008-02-04 15:54 ` [PATCH 0/9] firewire-sbp2: misc hotplug related patches John Stoffel
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-03 22:13 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

While fw-sbp2 takes the necessary time to reconnect to a logical unit
after bus reset, the SCSI core keeps sending new commands.  They are all
immediately completed with host busy status, and application clients or
filesystems will break quickly.  The SCSI device might even be taken
offline:  http://bugzilla.kernel.org/show_bug.cgi?id=9734

The only remedy seems to be to block the SCSI device until reconnect.
Alas the SCSI core has no useful API to block only one logical unit i.e.
the scsi_device, therefore we block the entire Scsi_Host.  This
currently corresponds to an SBP-2 target.  In case of targets with
multiple logical units, we need to satisfy the dependencies between
logical units by carefully tracking the blocking state of the target and
its units.  We block all logical units of a target as soon as one of
them needs to be blocked, and keep them blocked until all of them are
ready to be unblocked.

Furthermore, as the history of the old sbp2 driver has shown, the
scsi_block_requests() API is a minefield with high potential of
deadlocks.  We therefore take extra measures to keep logical units
unblocked during __scsi_add_device() and during shutdown.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |   71 +++++++++++++++++++++++++++++++++++--
 1 file changed, 69 insertions(+), 2 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -41,6 +41,7 @@
 #include <linux/stringify.h>
 #include <linux/timer.h>
 #include <linux/workqueue.h>
+#include <asm/atomic.h>
 #include <asm/system.h>
 
 #include <scsi/scsi.h>
@@ -139,6 +140,7 @@ struct sbp2_logical_unit {
 	int generation;
 	int retries;
 	struct delayed_work work;
+	atomic_t blocked;
 };
 
 /*
@@ -157,6 +159,9 @@ struct sbp2_target {
 	int address_high;
 	unsigned int workarounds;
 	unsigned int mgt_orb_timeout;
+
+	atomic_t dont_block;
+	atomic_t blocked;
 };
 
 /*
@@ -646,6 +651,53 @@ static void sbp2_agent_reset_no_wait(str
 			&z, sizeof(z), complete_agent_reset_write_no_wait, t);
 }
 
+/*
+ * Blocks lu->tgt if all of the following conditions are met:
+ *   - Login, INQUIRY, and high-level SCSI setup of all logical units of the
+ *     target have been successfully finished (indicated by dont_block == 0).
+ *   - The lu->generation is stale.  sbp2_reconnect will unblock lu later.
+ */
+static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
+{
+	struct fw_card *card = fw_device(lu->tgt->unit->device.parent)->card;
+
+	if (!atomic_read(&lu->tgt->dont_block) &&
+	    lu->generation != card->generation &&
+	    atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {
+
+		/* raise the block count of the target */
+		if (atomic_inc_return(&lu->tgt->blocked) == 1) {
+			scsi_block_requests(lu->sdev->host);
+			fw_notify("blocked %s\n", lu->tgt->bus_id);
+		}
+	}
+}
+
+/* Unblocks lu->tgt as soon as all its logical units can be unblocked. */
+static void sbp2_conditionally_unblock(struct sbp2_logical_unit *lu)
+{
+	if (atomic_cmpxchg(&lu->blocked, 1, 0) == 1) {
+
+		/* lower the block count of the target */
+		if (atomic_dec_and_test(&lu->tgt->blocked)) {
+			scsi_unblock_requests(lu->sdev->host);
+			fw_notify("unblocked %s\n", lu->tgt->bus_id);
+		}
+	}
+}
+
+
+/* Prevents future blocking of tgt and then unblocks it. */
+static void sbp2_unblock(struct sbp2_target *tgt)
+{
+	struct Scsi_Host *shost =
+		container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
+
+	atomic_inc(&tgt->dont_block);
+	smp_wmb();
+	scsi_unblock_requests(shost);
+}
+
 static void sbp2_release_target(struct kref *kref)
 {
 	struct sbp2_target *tgt = container_of(kref, struct sbp2_target, kref);
@@ -653,6 +705,12 @@ static void sbp2_release_target(struct k
 	struct Scsi_Host *shost =
 		container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
 
+	/*
+	 * Make sure that the target is unblocked and won't be blocked anymore
+	 * before scsi_remove_device() is called.  Else it will deadlock.
+	 */
+	sbp2_unblock(tgt);
+
 	list_for_each_entry_safe(lu, next, &tgt->lu_list, link) {
 		if (lu->sdev)
 			scsi_remove_device(lu->sdev);
@@ -717,11 +775,14 @@ static void sbp2_login(struct work_struc
 
 	if (sbp2_send_management_orb(lu, node_id, generation,
 				SBP2_LOGIN_REQUEST, lu->lun, &response) < 0) {
-		if (lu->retries++ < 5)
+		if (lu->retries++ < 5) {
 			sbp2_queue_work(lu, DIV_ROUND_UP(HZ, 5));
-		else
+		} else {
 			fw_error("%s: failed to login to LUN %04x\n",
 				 tgt->bus_id, lu->lun);
+			/* Let any waiting I/O fail from now on. */
+			sbp2_unblock(lu->tgt);
+		}
 		goto out;
 	}
 
@@ -749,6 +810,7 @@ static void sbp2_login(struct work_struc
 	/* This was a re-login. */
 	if (lu->sdev) {
 		sbp2_cancel_orbs(lu);
+		sbp2_conditionally_unblock(lu);
 		goto out;
 	}
 
@@ -786,6 +848,8 @@ static void sbp2_login(struct work_struc
 		 * Can you believe it?  Everything went well.
 		 */
 		lu->sdev = sdev;
+		smp_wmb();  /* We need lu->sdev when we want to block lu. */
+		atomic_dec(&lu->tgt->dont_block);
 		scsi_device_put(sdev);
 		goto out;
 	}
@@ -828,6 +892,7 @@ static int sbp2_add_logical_unit(struct 
 	lu->sdev = NULL;
 	lu->lun  = lun_entry & 0xffff;
 	lu->retries = 0;
+	atomic_inc(&tgt->dont_block);
 	INIT_LIST_HEAD(&lu->orb_list);
 	INIT_DELAYED_WORK(&lu->work, sbp2_login);
 
@@ -1053,6 +1118,7 @@ static void sbp2_reconnect(struct work_s
 
 	sbp2_agent_reset(lu);
 	sbp2_cancel_orbs(lu);
+	sbp2_conditionally_unblock(lu);
  out:
 	sbp2_target_put(tgt);
 }
@@ -1172,6 +1238,7 @@ complete_command_orb(struct sbp2_orb *ba
 		 * or when sending the write (less likely).
 		 */
 		result = DID_BUS_BUSY << 16;
+		sbp2_conditionally_block(orb->lu);
 	}
 
 	dma_unmap_single(device->card->device, orb->base.request_bus,

-- 
Stefan Richter
-=====-==--- --=- ---==
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] firewire: fw-sbp2: wait for completion of fetch agent reset
  2008-02-03 22:08 ` [PATCH 4/9] firewire: fw-sbp2: wait for completion of fetch agent reset Stefan Richter
@ 2008-02-04  8:11   ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-04  8:11 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

I wrote:
> Like the old sbp2 driver, wait for the write transaction to the
> AGENT_RESET to complete before proceeding (after login, after reconnect,
> or in SCSI error handling).
> 
> There is one occasion where AGENT_RESET is written to from atomic
> context when getting DEAD status for a command ORB.  There we still
> continue without waiting for the transaction to complete because this
> is more difficult to fix...

In addition, we need to serialize agent resets by the SCSI error handler
and the command ORB completion against ongoing reconnects/ relogins.
-- 
Stefan Richter
-=====-==--- --=- --=--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/9] firewire: log GUID of new devices
  2008-02-03 22:03 ` [PATCH 1/9] firewire: log GUID of new devices Stefan Richter
@ 2008-02-04  8:14   ` Stefan Richter
  2008-02-11 16:53   ` Jarod Wilson
  1 sibling, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-04  8:14 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

I wrote:
> -	else
> -		fw_notify("created new fw device %s "
> -			  "(%d config rom retries, S%d00)\n",
> -			  device->device.bus_id, device->config_rom_retries,
> -			  1 << device->max_speed);
> +	} else {
> +		if (device->config_rom_retries)
> +			fw_notify("created device %s: GUID %08x%08x, S%d00, "
> +				  "%d config ROM retries\n",
> +				  device->device.bus_id,
> +				  device->config_rom[3], device->config_rom[4],
> +				  1 << device->max_speed,
> +				  device->config_rom_retries);

Still to do:  If the ROM reading failed, log how it failed and what was
read up until the failure.
-- 
Stefan Richter
-=====-==--- --=- --=--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] firewire-sbp2: misc hotplug related patches
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (8 preceding siblings ...)
  2008-02-03 22:13 ` [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect Stefan Richter
@ 2008-02-04 15:54 ` John Stoffel
  2008-02-04 17:48   ` Stefan Richter
  2008-02-06  5:17 ` Jarod Wilson
  2008-02-06 21:07 ` [PATCH 10/9] firewire: fw-sbp2: preemptively block sdev Stefan Richter
  11 siblings, 1 reply; 41+ messages in thread
From: John Stoffel @ 2008-02-04 15:54 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

>>>>> "Stefan" == Stefan Richter <stefanr@s5r6.in-berlin.de> writes:

Stefan> I got quite good results with several OxSemi based SBP-2
Stefan> devices, also with an Initio based dual-LU device and LSI
Stefan> based devices.  A Prolific PL-3505 based device with known
Stefan> buggy firmware didn't work too well; I don't know if (but
Stefan> doubt that) the ieee1394 stack would do better. 

I've completely given up on my Firewire/USB external enclosure with a
PL-3xxx chipset.  Do you want it for further testing purposes?  I'd be
happy to ship it to you if you like.  

John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] firewire-sbp2: misc hotplug related patches
  2008-02-04 15:54 ` [PATCH 0/9] firewire-sbp2: misc hotplug related patches John Stoffel
@ 2008-02-04 17:48   ` Stefan Richter
  2008-02-04 18:51     ` John Stoffel
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-04 17:48 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux1394-devel, linux-kernel

John Stoffel wrote:
> I've completely given up on my Firewire/USB external enclosure with a
> PL-3xxx chipset.

Is it a variant whose firmware cannot be updated?

> Do you want it for further testing purposes?  I'd be
> happy to ship it to you if you like.  

The offer is appreciated, but as mentioned in PM, the shipping costs
would be out of proportion.
-- 
Stefan Richter
-=====-==--- --=- --=--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] firewire-sbp2: misc hotplug related patches
  2008-02-04 17:48   ` Stefan Richter
@ 2008-02-04 18:51     ` John Stoffel
  0 siblings, 0 replies; 41+ messages in thread
From: John Stoffel @ 2008-02-04 18:51 UTC (permalink / raw)
  To: Stefan Richter; +Cc: John Stoffel, linux1394-devel, linux-kernel


Stefan> John Stoffel wrote:
>> I've completely given up on my Firewire/USB external enclosure with a
>> PL-3xxx chipset.

Stefan> Is it a variant whose firmware cannot be updated?

No, it can be updated, and I have done that once before.  It became a
little more stable, but not much.  I forget the details, but either
under USB or FireWire it would lockup or corrupt transfers.  

I seem to remember using the serialize_io=1 parameter too, but it
didn't help either.

>> Do you want it for further testing purposes?  I'd be
>> happy to ship it to you if you like.  

Stefan> The offer is appreciated, but as mentioned in PM, the shipping
Stefan> costs would be out of proportion.

Sure, I'd be happy to ship it to the developer you think could make
the most use of this thing.    I'll have to dig out the specs when I
get home.  Maybe look back in the archives, because I know I wrote
about this before, probably back in 2005 sometime.

John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] firewire-sbp2: misc hotplug related patches
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (9 preceding siblings ...)
  2008-02-04 15:54 ` [PATCH 0/9] firewire-sbp2: misc hotplug related patches John Stoffel
@ 2008-02-06  5:17 ` Jarod Wilson
  2008-02-06 18:27   ` Stefan Richter
  2008-02-06 21:07 ` [PATCH 10/9] firewire: fw-sbp2: preemptively block sdev Stefan Richter
  11 siblings, 1 reply; 41+ messages in thread
From: Jarod Wilson @ 2008-02-06  5:17 UTC (permalink / raw)
  To: linux1394-devel; +Cc: Stefan Richter, linux-kernel

On Sunday 03 February 2008 05:00:54 pm Stefan Richter wrote:
> Here is various stuff to hopefully improve fw-sbp2's behavior during bus
> resets.  The main piece is patch 9/9 which considerably raises the
> chance that ongoing I/O survives plugging and unplugging of other
> devices on the same bus as the device which services the I/O.
>
> The other patches are basically side products of patch 9/9 but contain
> quite useful fixes as well.
>
> I got quite good results with several OxSemi based SBP-2 devices

I've got one setup on which this doesn't seem to help much... Two firewire 
drives (both ox911 bridge, v4.0 firmware) hooked to a system, both of which 
are recognized, logged into, etc., on startup. However, pretty much without 
fail, at least one of them has to perform a reconnection. That claims to 
succeed, but the device isn't actually usable when this happens -- 
fdisk /dev/sdx fails with 'unable to read /dev/sdx'.

Example dmesg output when one of the two drives has to be reconnected:

firewire_core: created device fw0: GUID 00023c0031037366, S400
scsi6 : SBP-2 IEEE-1394
firewire_core: created device fw1: GUID 0050c501e001c394, S400
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
scsi 6:0:0:0: Direct-Access-RBC ST312002 6A               8.01 PQ: 0 ANSI: 4
firewire_core: created device fw2: GUID 000108000000f605, S800
sd 6:0:0:0: [sdc] 234441648 512-byte hardware sectors (120034 MB)
sd 6:0:0:0: [sdc] Write Protect is off
sd 6:0:0:0: [sdc] Mode Sense: 00 00 00 00
sd 6:0:0:0: [sdc] Asking for cache data failed
firewire_core: created device fw3: GUID 0000d10080a575eb, S400
sd 6:0:0:0: [sdc] Assuming drive cache: write through
sd 6:0:0:0: [sdc] READ CAPACITY failed
sd 6:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY 
driverbyte=DRIVER_OK,SUGGEST_OK
sd 6:0:0:0: [sdc] Sense not available.
sd 6:0:0:0: [sdc] Write Protect is off
sd 6:0:0:0: [sdc] Mode Sense: 00 00 00 00
sd 6:0:0:0: [sdc] Asking for cache data failed
sd 6:0:0:0: [sdc] Assuming drive cache: write through
sd 6:0:0:0: [sdc] Attached SCSI disk
scsi7 : SBP-2 IEEE-1394
firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
firewire_core: created device fw4: GUID 0050c501e00b23e9, S400
firewire_core: phy config: card 2, new root=ffc1, gap_count=5
firewire_sbp2: fw4.0: logged in to LUN 0000 (0 retries)
scsi 7:0:0:0: Direct-Access-RBC ST312002 2A               8.01 PQ: 0 ANSI: 4
sd 7:0:0:0: [sdd] 234441648 512-byte hardware sectors (120034 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 7:0:0:0: [sdd] 234441648 512-byte hardware sectors (120034 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sdd: sdd1
sd 7:0:0:0: [sdd] Attached SCSI disk

fw device decoder ring:
fw0 = fw400 card
fw1 = /dev/sdc (120G HD in ox911 case, hooked to fw0)
fw2 = fw800 card
fw3 = fw400 card
fw4 = /dev/sdd (120G HD in ox911 case, hooked to fw3)


# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid0] 
md4 : active raid1 sdd1[1]
      117218176 blocks [2/1] [_U]

# fdisk /dev/sdc

Unable to read /dev/sdc


Given the READ CAPACITY failed and DID_BUS_BUSY messages for sdc (and lack of 
notice about its partitions), it sort of looks like we never set the disk up 
correctly in the first place, and we're subsequently just reconnecting to 
that failed setup... So the reconnect code may be doing the right thing, and 
the real problem I'm looking at is us screwing up the setup of the device in 
the first place, for some reason...


-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] firewire-sbp2: misc hotplug related patches
  2008-02-06  5:17 ` Jarod Wilson
@ 2008-02-06 18:27   ` Stefan Richter
  2008-02-06 21:09     ` [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed Stefan Richter
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-06 18:27 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel

Jarod Wilson wrote:
> firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
> firewire_core: phy config: card 0, new root=ffc1, gap_count=5
> scsi 6:0:0:0: Direct-Access-RBC ST312002 6A               8.01 PQ: 0 ANSI: 4
> firewire_core: created device fw2: GUID 000108000000f605, S800
> sd 6:0:0:0: [sdc] 234441648 512-byte hardware sectors (120034 MB)
> sd 6:0:0:0: [sdc] Write Protect is off
> sd 6:0:0:0: [sdc] Mode Sense: 00 00 00 00
> sd 6:0:0:0: [sdc] Asking for cache data failed
> firewire_core: created device fw3: GUID 0000d10080a575eb, S400
> sd 6:0:0:0: [sdc] Assuming drive cache: write through
> sd 6:0:0:0: [sdc] READ CAPACITY failed
> sd 6:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY 
> driverbyte=DRIVER_OK,SUGGEST_OK
> sd 6:0:0:0: [sdc] Sense not available.
> sd 6:0:0:0: [sdc] Write Protect is off
> sd 6:0:0:0: [sdc] Mode Sense: 00 00 00 00
> sd 6:0:0:0: [sdc] Asking for cache data failed
> sd 6:0:0:0: [sdc] Assuming drive cache: write through
> sd 6:0:0:0: [sdc] Attached SCSI disk
> scsi7 : SBP-2 IEEE-1394
> firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
> firewire_core: created device fw4: GUID 0050c501e00b23e9, S400
[...]
> # fdisk /dev/sdc
> 
> Unable to read /dev/sdc
> 
> 
> Given the READ CAPACITY failed and DID_BUS_BUSY messages for sdc (and lack of 
> notice about its partitions), it sort of looks like we never set the disk up 
> correctly in the first place, and we're subsequently just reconnecting to 
> that failed setup... So the reconnect code may be doing the right thing, and 
> the real problem I'm looking at is us screwing up the setup of the device in 
> the first place, for some reason...

What happens is that the 2nd disk resets the bus while fw-sbp2 executes
__scsi_add_device in sbp2_login.  sbp2_reconnect would be necessary
immediately, but it can't run in parallel with sbp2_login... yet.

Kristian already had a solution for the case that __scsi_add_device
fails due to a bus reset or something else.  Just recently I added a
failover for the case that __scsi_add_device returns with alleged
success but leaves the scsi_device in offline state.  In your case,
__scsi_add_device returns with "succeeds" as well but the results of the
sd driver's probing aren't quite useful, even though the device was not
put offline in the process.

As a minimally invasive fix for setups like yours, we have to make
__scsi_add_device fail if the bus generation changed.  I will try
something later today.
-- 
Stefan Richter
-=====-==--- --=- --==-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 10/9] firewire: fw-sbp2: preemptively block sdev
  2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
                   ` (10 preceding siblings ...)
  2008-02-06  5:17 ` Jarod Wilson
@ 2008-02-06 21:07 ` Stefan Richter
  11 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-06 21:07 UTC (permalink / raw)
  To: linux1394-devel; +Cc: linux-kernel

Since "fw-sbp2: fix I/O errors during reconnect", a Scsi_Host will be
blocked as soon as a command failed due to bus generation change.  Now
we also block it when fw-core signalled a bus reset via sbp2_update.
This will avoid some command failures and retries (but not all because
commands are injected from tasklet context while sbp2_update runs from
workqueue thread context).

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
 drivers/firewire/fw-sbp2.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -1135,6 +1135,7 @@ static void sbp2_update(struct fw_unit *
 	 * Iteration over tgt->lu_list is therefore safe here.
 	 */
 	list_for_each_entry(lu, &tgt->lu_list, link) {
+		sbp2_conditionally_block(lu);
 		lu->retries = 0;
 		sbp2_queue_work(lu, 0);
 	}

-- 
Stefan Richter
-=====-==--- --=- --==-
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
  2008-02-06 18:27   ` Stefan Richter
@ 2008-02-06 21:09     ` Stefan Richter
  2008-02-08 18:54       ` Jarod Wilson
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-06 21:09 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel

fw-sbp2 is unable to reconnect while performing __scsi_add_device
because there is only a single workqueue thread context available for
both at the moment.  This should be fixed eventually.

Until then, take care that __scsi_add_device does not return success
even though the SCSI high-level driver probing failed (sd READ_CAPACITY
and friends) due to bus reset.  The trick to do so is to use a different
error indicator in the command completion as long as __scsi_add_device
did not return.

An actual failure of __scsi_add_device is easy to handle, but an
incomplete execution of __scsi_add_device with an sdev returned would
remain undetected and leave the SBP-2 target unusable.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---

Jarod, does this work like I assume and fixes your setup of two OXFW911
based disks?

 drivers/firewire/fw-sbp2.c |   29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -825,28 +825,17 @@ static void sbp2_login(struct work_struc
 	sdev = __scsi_add_device(shost, 0, 0,
 				 scsilun_to_int(&eight_bytes_lun), lu);
 	if (IS_ERR(sdev)) {
-		/*
-		 * The most frequent cause for __scsi_add_device() to fail
-		 * is a bus reset while sending the SCSI INQUIRY.  Try again.
-		 */
+		/* Probably a bus reset happened.  Try again. */
 		goto out_logout_login;
 
 	} else if (sdev->sdev_state == SDEV_OFFLINE) {
-		/*
-		 * FIXME:  We are unable to perform reconnects while in
-		 * sbp2_login().  Therefore __scsi_add_device() will get
-		 * into trouble if a bus reset happens in parallel.
-		 * It will either fail (that's OK, see above) or take sdev
-		 * offline.  Here is a crude workaround for the latter.
-		 */
+		/* Something else happened which left the device unusable. */
 		scsi_device_put(sdev);
 		scsi_remove_device(sdev);
 		goto out_logout_login;
 
 	} else {
-		/*
-		 * Can you believe it?  Everything went well.
-		 */
+		/* Can you believe it?  Everything went well. */
 		lu->sdev = sdev;
 		smp_wmb();  /* We need lu->sdev when we want to block lu. */
 		atomic_dec(&lu->tgt->dont_block);
@@ -1237,8 +1226,18 @@ complete_command_orb(struct sbp2_orb *ba
 		 * If the orb completes with status == NULL, something
 		 * went wrong, typically a bus reset happened mid-orb
 		 * or when sending the write (less likely).
+		 *
+		 * Normally, tell the SCSI core we are busy so that the
+		 * command is retried.  But if the sdev is still being
+		 * set up, pretend that the device went away so that
+		 * __scsi_add_device will fail. sbp2_login will retry it.
+		 *
+		 * FIXME:  There is potentially a small window after
+		 * __scsi_add_device returned but before lu->sdev is
+		 * set during which we rather want to say DID_BUS_BUSY.
 		 */
-		result = DID_BUS_BUSY << 16;
+		result = orb->lu->sdev != NULL ? DID_BUS_BUSY << 16 :
+						 DID_NO_CONNECT << 16;
 		sbp2_conditionally_block(orb->lu);
 	}
 

-- 
Stefan Richter
-=====-==--- --=- --==-
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
  2008-02-06 21:09     ` [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed Stefan Richter
@ 2008-02-08 18:54       ` Jarod Wilson
  2008-02-08 19:58         ` Stefan Richter
  0 siblings, 1 reply; 41+ messages in thread
From: Jarod Wilson @ 2008-02-08 18:54 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

On Wednesday 06 February 2008 04:09:47 pm Stefan Richter wrote:
> fw-sbp2 is unable to reconnect while performing __scsi_add_device
> because there is only a single workqueue thread context available for
> both at the moment.  This should be fixed eventually.
>
> Until then, take care that __scsi_add_device does not return success
> even though the SCSI high-level driver probing failed (sd READ_CAPACITY
> and friends) due to bus reset.  The trick to do so is to use a different
> error indicator in the command completion as long as __scsi_add_device
> did not return.
>
> An actual failure of __scsi_add_device is easy to handle, but an
> incomplete execution of __scsi_add_device with an sdev returned would
> remain undetected and leave the SBP-2 target unusable.
>
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
> ---
>
> Jarod, does this work like I assume and fixes your setup of two OXFW911
> based disks?

Well, it results in the dmesg spew saying "sd 6:0:0:0: [sdc] Result: 
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK" -- i.e., 
DID_NO_CONNECT instead of DID_BUS_BUSY, but other than that, no change in 
behavior, sdc remains unusable just as before.


-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
  2008-02-08 18:54       ` Jarod Wilson
@ 2008-02-08 19:58         ` Stefan Richter
  2008-02-08 21:33             ` Stefan Richter
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-08 19:58 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel, linux-scsi

(Adding Cc: LSML)

Jarod Wilson wrote:
> On Wednesday 06 February 2008 04:09:47 pm Stefan Richter wrote:
>> take care that __scsi_add_device does not return success
>> even though the SCSI high-level driver probing failed (sd READ_CAPACITY
>> and friends) due to bus reset.  The trick to do so is to use a different
>> error indicator in the command completion as long as __scsi_add_device
>> did not return.

Or so did I guess, and...

>> An actual failure of __scsi_add_device is easy to handle, but an
>> incomplete execution of __scsi_add_device with an sdev returned would
>> remain undetected and leave the SBP-2 target unusable.
>>
>> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
>> ---
>>
>> Jarod, does this work like I assume and fixes your setup of two OXFW911
>> based disks?
> 
> Well, it results in the dmesg spew saying "sd 6:0:0:0: [sdc] Result: 
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK" -- i.e., 
> DID_NO_CONNECT instead of DID_BUS_BUSY, but other than that, no change in 
> behavior, sdc remains unusable just as before.

...my guess was wrong then.

Either I misunderstood the semantics of the various hostbyte codes in
the command completion return (and then these semantics are
insufficient) --- or SCSI mid layer or high-level implements them wrong.

But before we dive into the SCSI stack or implement parellelism of SBP-2
reconnect and SCSI probing in fw-sbp2, there is another simple and in
hindsight obvious trick we can try.  Stay tuned.

Background for LSML:

In case of unrecoverable transport failures during the execution of
__scsi_add_device, I would like to send appropriate error indicators
from the LLD up to SCSI midlayer so that __scsi_add_device ends in
failure (i.e. returns an error pointer rather than a scsi_device pointer).

Sometimes SCSI core decides to let __scsi_add_device fail, sometimes it
takes the scsi_device offline, sometimes it doesn't do either but
pretends to the LLD that __scsi_add_device was an utter success.  Except
that userspace can't do anything with the scsi_device because e.g. READ
CAPACITY couldn't be executed.
-- 
Stefan Richter
-=====-==--- --=- -=---
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
  2008-02-08 19:58         ` Stefan Richter
@ 2008-02-08 21:33             ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-08 21:33 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel, linux-scsi

Date: 
From: Stefan Richter <stefanr@s5r6.in-berlin.de>
Subject: firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed

fw-sbp2 is unable to reconnect while performing __scsi_add_device
because there is only a single workqueue thread context available for
both at the moment.  This should be fixed eventually.

An actual failure of __scsi_add_device is easy to handle, but an
incomplete execution of __scsi_add_device with an sdev returned would
remain undetected and leave the SBP-2 target unusable.

Therefore we use a workaround:  If there was a bus reset during
__scsi_add_device (i.e. during the SCSI probe), we remove the new sdev
immediately, log out, and attempt login and SCSI probe again.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---

The previous patch "firewire: fw-sbp2: retry login if scsi_device was
offlined early" should be folded into this one before upstream
submission.

 drivers/firewire/fw-sbp2.c |   43 ++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 24 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -824,36 +824,31 @@ static void sbp2_login(struct work_struc
 
 	sdev = __scsi_add_device(shost, 0, 0,
 				 scsilun_to_int(&eight_bytes_lun), lu);
-	if (IS_ERR(sdev)) {
-		/*
-		 * The most frequent cause for __scsi_add_device() to fail
-		 * is a bus reset while sending the SCSI INQUIRY.  Try again.
-		 */
+	/*
+	 * FIXME:  We are unable to perform reconnects while in sbp2_login().
+	 * Therefore __scsi_add_device() will get into trouble if a bus reset
+	 * happens in parallel.  It will either fail or leave us with an
+	 * unusable sdev.  As a workaround we check for this and retry the
+	 * whole login and SCSI probing.
+	 */
+
+	if (IS_ERR(sdev))
 		goto out_logout_login;
 
-	} else if (sdev->sdev_state == SDEV_OFFLINE) {
-		/*
-		 * FIXME:  We are unable to perform reconnects while in
-		 * sbp2_login().  Therefore __scsi_add_device() will get
-		 * into trouble if a bus reset happens in parallel.
-		 * It will either fail (that's OK, see above) or take sdev
-		 * offline.  Here is a crude workaround for the latter.
-		 */
-		scsi_device_put(sdev);
+	scsi_device_put(sdev);
+
+	smp_rmb(); /* get current card generation */
+	if (generation != device->card->generation) {
 		scsi_remove_device(sdev);
 		goto out_logout_login;
-
-	} else {
-		/*
-		 * Can you believe it?  Everything went well.
-		 */
-		lu->sdev = sdev;
-		smp_wmb();  /* We need lu->sdev when we want to block lu. */
-		atomic_dec(&lu->tgt->dont_block);
-		scsi_device_put(sdev);
-		goto out;
 	}
 
+	/* Everything went well. */
+	lu->sdev = sdev;
+	smp_wmb();  /* We need lu->sdev when we want to block lu. */
+	atomic_dec(&lu->tgt->dont_block);
+	goto out;
+
  out_logout_login:
 	smp_rmb(); /* generation may have changed */
 	generation = device->generation;

-- 
Stefan Richter
-=====-==--- --=- -=---
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
@ 2008-02-08 21:33             ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-08 21:33 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel, linux-scsi

Date: 
From: Stefan Richter <stefanr@s5r6.in-berlin.de>
Subject: firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed

fw-sbp2 is unable to reconnect while performing __scsi_add_device
because there is only a single workqueue thread context available for
both at the moment.  This should be fixed eventually.

An actual failure of __scsi_add_device is easy to handle, but an
incomplete execution of __scsi_add_device with an sdev returned would
remain undetected and leave the SBP-2 target unusable.

Therefore we use a workaround:  If there was a bus reset during
__scsi_add_device (i.e. during the SCSI probe), we remove the new sdev
immediately, log out, and attempt login and SCSI probe again.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---

The previous patch "firewire: fw-sbp2: retry login if scsi_device was
offlined early" should be folded into this one before upstream
submission.

 drivers/firewire/fw-sbp2.c |   43 ++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 24 deletions(-)

Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -824,36 +824,31 @@ static void sbp2_login(struct work_struc
 
 	sdev = __scsi_add_device(shost, 0, 0,
 				 scsilun_to_int(&eight_bytes_lun), lu);
-	if (IS_ERR(sdev)) {
-		/*
-		 * The most frequent cause for __scsi_add_device() to fail
-		 * is a bus reset while sending the SCSI INQUIRY.  Try again.
-		 */
+	/*
+	 * FIXME:  We are unable to perform reconnects while in sbp2_login().
+	 * Therefore __scsi_add_device() will get into trouble if a bus reset
+	 * happens in parallel.  It will either fail or leave us with an
+	 * unusable sdev.  As a workaround we check for this and retry the
+	 * whole login and SCSI probing.
+	 */
+
+	if (IS_ERR(sdev))
 		goto out_logout_login;
 
-	} else if (sdev->sdev_state == SDEV_OFFLINE) {
-		/*
-		 * FIXME:  We are unable to perform reconnects while in
-		 * sbp2_login().  Therefore __scsi_add_device() will get
-		 * into trouble if a bus reset happens in parallel.
-		 * It will either fail (that's OK, see above) or take sdev
-		 * offline.  Here is a crude workaround for the latter.
-		 */
-		scsi_device_put(sdev);
+	scsi_device_put(sdev);
+
+	smp_rmb(); /* get current card generation */
+	if (generation != device->card->generation) {
 		scsi_remove_device(sdev);
 		goto out_logout_login;
-
-	} else {
-		/*
-		 * Can you believe it?  Everything went well.
-		 */
-		lu->sdev = sdev;
-		smp_wmb();  /* We need lu->sdev when we want to block lu. */
-		atomic_dec(&lu->tgt->dont_block);
-		scsi_device_put(sdev);
-		goto out;
 	}
 
+	/* Everything went well. */
+	lu->sdev = sdev;
+	smp_wmb();  /* We need lu->sdev when we want to block lu. */
+	atomic_dec(&lu->tgt->dont_block);
+	goto out;
+
  out_logout_login:
 	smp_rmb(); /* generation may have changed */
 	generation = device->generation;

-- 
Stefan Richter
-=====-==--- --=- -=---
http://arcgraph.de/sr/


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
  2008-02-08 21:33             ` Stefan Richter
@ 2008-02-10 18:36               ` Jarod Wilson
  -1 siblings, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-10 18:36 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel, linux-scsi

On Friday 08 February 2008 04:33:29 pm Stefan Richter wrote:
> fw-sbp2 is unable to reconnect while performing __scsi_add_device
> because there is only a single workqueue thread context available for
> both at the moment.  This should be fixed eventually.
>
> An actual failure of __scsi_add_device is easy to handle, but an
> incomplete execution of __scsi_add_device with an sdev returned would
> remain undetected and leave the SBP-2 target unusable.
>
> Therefore we use a workaround:  If there was a bus reset during
> __scsi_add_device (i.e. during the SCSI probe), we remove the new sdev
> immediately, log out, and attempt login and SCSI probe again.
>
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Now *this* does the trick. I get the 'READ CAPACITY failed' as before, 
then 'firewire_sbp2: fw1.0: error status: 0:4', followed by a new login and 
SCSI probe, both of which are successful this time, disk is available for use 
and all that good stuff.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>


-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
@ 2008-02-10 18:36               ` Jarod Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-10 18:36 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel, linux-scsi

On Friday 08 February 2008 04:33:29 pm Stefan Richter wrote:
> fw-sbp2 is unable to reconnect while performing __scsi_add_device
> because there is only a single workqueue thread context available for
> both at the moment.  This should be fixed eventually.
>
> An actual failure of __scsi_add_device is easy to handle, but an
> incomplete execution of __scsi_add_device with an sdev returned would
> remain undetected and leave the SBP-2 target unusable.
>
> Therefore we use a workaround:  If there was a bus reset during
> __scsi_add_device (i.e. during the SCSI probe), we remove the new sdev
> immediately, log out, and attempt login and SCSI probe again.
>
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Now *this* does the trick. I get the 'READ CAPACITY failed' as before, 
then 'firewire_sbp2: fw1.0: error status: 0:4', followed by a new login and 
SCSI probe, both of which are successful this time, disk is available for use 
and all that good stuff.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>


-- 
Jarod Wilson
jwilson@redhat.com

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/9] firewire: log GUID of new devices
  2008-02-03 22:03 ` [PATCH 1/9] firewire: log GUID of new devices Stefan Richter
  2008-02-04  8:14   ` Stefan Richter
@ 2008-02-11 16:53   ` Jarod Wilson
  1 sibling, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 16:53 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> This should help to interpret user reports.  E.g. one can look up the
> vendor OUI (first three bytes of the GUID) and thus tell what is what.
> 
> Also simplifies the math in the GUID sysfs attribute.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Rather liking this one, makes it much easier to figure out what's what 
without having to poke around in config roms.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround
  2008-02-03 22:04 ` [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround Stefan Richter
@ 2008-02-11 17:01   ` Jarod Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 17:01 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> Several different SBP-2 bridges accept a login early while the IDE
> device is still powering up.  They are therefore unable to respond to
> SCSI INQUIRY immediately, and the SCSI core has to retry the INQUIRY.
> One of these retries is typically successful, and all is well.
> 
> But in case of Momobay FX-3A, the INQUIRY retries tend to fail entirely.
> This can usually be avoided by waiting a little while after login before
> letting the SCSI core send the INQUIRY.  The old sbp2 driver handles
> this more gracefully for as yet unknown reasons (perhaps because it
> waits for fetch agent resets to complete, unlike fw-sbp2 which quickly
> proceeds after requesting the agent reset).  Therefore the workaround is
> not as much necessary for sbp2.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Looks reasonable to me. I'm guessing there are some other devices out 
there that may well benefit from this (once added to the workarounds 
table, of course). Interesting that this is an encore appearance for a 
DViCO Momobay device...

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

-- 
Jarod Wilson
jwilson@redhat.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/9] ieee1394: sbp2: add INQUIRY delay workaround
  2008-02-03 22:07 ` [PATCH 3/9] ieee1394: sbp2: " Stefan Richter
@ 2008-02-11 17:03   ` Jarod Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 17:03 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> Add the same workaround as found in fw-sbp2 for feature parity and
> compatibility of the workarounds module parameter.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Same deal as 2/9.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures
  2008-02-03 22:09 ` [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures Stefan Richter
@ 2008-02-11 17:16   ` Jarod Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 17:16 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> for easier readable logs if more than one SBP-2 device is present.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Agree this makes for easier to follow logs, esp. with lots of devices 
(not just spb2).

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice
  2008-02-03 22:10 ` [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice Stefan Richter
@ 2008-02-11 17:19   ` Jarod Wilson
  2008-02-11 19:42     ` Stefan Richter
  0 siblings, 1 reply; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 17:19 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> When a reconnect failed but re-login succeeded, __scsi_add_device was
> called again.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Was the spurious __scsi_add_device simply failing, or was it causing 
other problems as well? I can't remember if I've hit that or not. In 
either case, not calling it when we know we don't need to makes sense.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>


-- 
Jarod Wilson
jwilson@redhat.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect
  2008-02-03 22:11 ` [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect Stefan Richter
@ 2008-02-11 17:32   ` Jarod Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 17:32 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> If fw-sbp2 was too late with requesting the reconnect, the target would
> reject this.  In this case, log out before attempting the reconnect.
> Else several firmwares will deny the re-login because they somehow
> didn't invalidate the old login.
> 
> Also, don't retry reconnects in this situation.  The retries won't
> succeed either.
> 
> These changes improve chances for successful re-login and shorten the
> period during which the logical unit is inaccessible.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Good stuff, easy to see how these will help our chances.

Signed-off-by: Jarod Wilson <jwilson@redhat.com>

-- 
Jarod Wilson
jwilson@redhat.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-03 22:13 ` [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect Stefan Richter
@ 2008-02-11 18:09   ` Jarod Wilson
  2008-02-11 20:21     ` Stefan Richter
  0 siblings, 1 reply; 41+ messages in thread
From: Jarod Wilson @ 2008-02-11 18:09 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> While fw-sbp2 takes the necessary time to reconnect to a logical unit
> after bus reset, the SCSI core keeps sending new commands.  They are all
> immediately completed with host busy status, and application clients or
> filesystems will break quickly.  The SCSI device might even be taken
> offline:  http://bugzilla.kernel.org/show_bug.cgi?id=9734
> 
> The only remedy seems to be to block the SCSI device until reconnect.
> Alas the SCSI core has no useful API to block only one logical unit i.e.
> the scsi_device, therefore we block the entire Scsi_Host.  This
> currently corresponds to an SBP-2 target.  In case of targets with
> multiple logical units, we need to satisfy the dependencies between
> logical units by carefully tracking the blocking state of the target and
> its units.  We block all logical units of a target as soon as one of
> them needs to be blocked, and keep them blocked until all of them are
> ready to be unblocked.
> 
> Furthermore, as the history of the old sbp2 driver has shown, the
> scsi_block_requests() API is a minefield with high potential of
> deadlocks.  We therefore take extra measures to keep logical units
> unblocked during __scsi_add_device() and during shutdown.
> 
> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

> +/*
> + * Blocks lu->tgt if all of the following conditions are met:
> + *   - Login, INQUIRY, and high-level SCSI setup of all logical units of the
> + *     target have been successfully finished (indicated by dont_block == 0).
> + *   - The lu->generation is stale.  sbp2_reconnect will unblock lu later.
> + */
> +static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
> +{
> +	struct fw_card *card = fw_device(lu->tgt->unit->device.parent)->card;
> +
> +	if (!atomic_read(&lu->tgt->dont_block) &&
> +	    lu->generation != card->generation &&
> +	    atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {

Just to be absolutely sure, we don't need any barriers here to ensure we 
get the right generations, do we?

Also, this isn't expected to let I/O survive a disk being unplugged 
briefly, then plugged back in, is it? (I recall that being discussed, 
but I think it was as a 'would be nice to do in the future' thing).

-- 
Jarod Wilson
jwilson@redhat.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice
  2008-02-11 17:19   ` Jarod Wilson
@ 2008-02-11 19:42     ` Stefan Richter
  2008-02-12  8:55       ` Stefan Richter
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-11 19:42 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel

Jarod Wilson wrote:
> Stefan Richter wrote:
>> When a reconnect failed but re-login succeeded, __scsi_add_device was
>> called again.
>>
>> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
> 
> Was the spurious __scsi_add_device simply failing, or was it causing
> other problems as well? I can't remember if I've hit that or not. In
> either case, not calling it when we know we don't need to makes sense.

SCSI core looks up whether the scsi_target and the scsi_device already
exist.  If so, __scsi_add_device succeeds (there should also be a log
message "scsi scan: device exists on %s\n") and returns the pointer to
the existing scsi_device.

The code in fw-sbp2 before this patch series would then continue
appropriately ... except that it misses to call sbp2_cancel_orbs.  Right
now I am not quite sure which consequences that has; probably no serious
ones right now.  SCSI core would call fw-sbp2's eh_abort_handler
eventually which would reset the fetch agent once more and cancel the
timed out ORBs.  So, besides the needless lookups and temporary
allocations in SCSI core, we avoid the additional stall and timeout
until eh_abort_handler hits.

I shall add this to the changelog.

BTW, since the addition of the blocking and unblocking between bus reset
and reconnect/re-login we also need the unblocking call in the re-login
case but not (necessarily) in the first login case.
-- 
Stefan Richter
-=====-==--- --=- -=-==
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-11 18:09   ` Jarod Wilson
@ 2008-02-11 20:21     ` Stefan Richter
  2008-02-12  5:07       ` Jarod Wilson
  2008-02-16 15:37       ` Stefan Richter
  0 siblings, 2 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-11 20:21 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel

Jarod Wilson wrote:
> Stefan Richter wrote:
>> +static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
>> +{
>> +    struct fw_card *card =
>> fw_device(lu->tgt->unit->device.parent)->card;
>> +
>> +    if (!atomic_read(&lu->tgt->dont_block) &&
>> +        lu->generation != card->generation &&
>> +        atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {
> 
> Just to be absolutely sure, we don't need any barriers here to ensure we
> get the right generations, do we?

I didn't think so.  But I will carefully look at it again later this
week.  The function definitely must not block the device when the
generation is current.  We look at two data fields here which makes this
even more problematic.  Could be that we need locks after all.

> Also, this isn't expected to let I/O survive a disk being unplugged
> briefly, then plugged back in, is it?

No, this only tells the SCSI core to not bother fw-sbp2's
.queuecommand() with new commands before reconnect.  This will
mysteriously convince the SCSI core to not put the device offline too
quickly and will stabilize application client behavior thanks to
considerably fewer command retries.

To survive real or perceived temporary unplugs ("perceived" unplugs can
happen if a third node is slowly plugged in or out), we need to do
something in fw-device.c.  We have to keep the fw_device around after
node removal event until a timeout, to check newly added devices whether
they are in fact one of the undead devices, and to revive that one
rather than creating a new one.

> (I recall that being discussed, but I think it was as a 'would be
> nice to do in the future' thing).

I realized now that it is a 'need it sooner than later' thing because of
these "perceived" unplugs.  We need this feature at least with a minimal
timeout, otherwise people will sometimes lose connection to their
devices (the scsi_device will be destroyed and a new one created) when
they plug a 3rd or 4th or nth node.  As mentioned in another post, this
is an actual regression for those who migrated from ieee1394 to fw-core.
But fear not, it looks like I will have a prolonged weekend.  :-)
-- 
Stefan Richter
-=====-==--- --=- -=-==
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-11 20:21     ` Stefan Richter
@ 2008-02-12  5:07       ` Jarod Wilson
  2008-02-12  8:01         ` Stefan Richter
  2008-02-16 15:37       ` Stefan Richter
  1 sibling, 1 reply; 41+ messages in thread
From: Jarod Wilson @ 2008-02-12  5:07 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> Jarod Wilson wrote:
>> Stefan Richter wrote:
>>> +static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
>>> +{
>>> +    struct fw_card *card =
>>> fw_device(lu->tgt->unit->device.parent)->card;
>>> +
>>> +    if (!atomic_read(&lu->tgt->dont_block) &&
>>> +        lu->generation != card->generation &&
>>> +        atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {
>> Just to be absolutely sure, we don't need any barriers here to ensure we
>> get the right generations, do we?
> 
> I didn't think so.  But I will carefully look at it again later this
> week.  The function definitely must not block the device when the
> generation is current.  We look at two data fields here which makes this
> even more problematic.  Could be that we need locks after all.

I didn't see anything else earlier on that guaranteed we got the current 
generation in both cases, but I didn't look exhaustively, and you know 
this code a lot better than I do, so I definitely defer to your better 
judgment, just wanted to make sure it had been considered.


>> Also, this isn't expected to let I/O survive a disk being unplugged
>> briefly, then plugged back in, is it?
> 
> No, this only tells the SCSI core to not bother fw-sbp2's
> .queuecommand() with new commands before reconnect.  This will
> mysteriously convince the SCSI core to not put the device offline too
> quickly and will stabilize application client behavior thanks to
> considerably fewer command retries.

Okay, that's what I thought and what I observed in operation.

> To survive real or perceived temporary unplugs ("perceived" unplugs can
> happen if a third node is slowly plugged in or out), we need to do
> something in fw-device.c.  We have to keep the fw_device around after
> node removal event until a timeout, to check newly added devices whether
> they are in fact one of the undead devices, and to revive that one
> rather than creating a new one.

Gotcha. So basically, a temporary table of devices recently "removed", 
which will expire to full/actual removal after some relatively short 
interval, but which we'll also check for matches of "newly" connected 
devices to see if we should cancel removing the device and just pretend 
like it never actually went away. Right? I wonder if there's any sort of 
guidance on this sort of thing in the firewire specs... I'll make an 
effort to search for relevant info, unless you've already got it.

>> (I recall that being discussed, but I think it was as a 'would be
>> nice to do in the future' thing).
> 
> I realized now that it is a 'need it sooner than later' thing because of
> these "perceived" unplugs.  We need this feature at least with a minimal
> timeout, otherwise people will sometimes lose connection to their
> devices (the scsi_device will be destroyed and a new one created) when
> they plug a 3rd or 4th or nth node.  As mentioned in another post, this
> is an actual regression for those who migrated from ieee1394 to fw-core.
> But fear not, it looks like I will have a prolonged weekend.  :-)

Heh, sounds good. I missed out on my entire past weekend (and half a 
week of work) due to family illnesses. Hoping to throw a bunch of time 
at further firewire work this week though.

-- 
Jarod Wilson
jwilson@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-12  5:07       ` Jarod Wilson
@ 2008-02-12  8:01         ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-12  8:01 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: Stefan Richter, linux1394-devel, linux-kernel

Jarod Wilson wrote:
> Stefan Richter wrote:
>> "perceived" unplugs can happen if a third node is slowly plugged in or out
[...]
> I wonder if there's any sort of
> guidance on this sort of thing in the firewire specs...

I am not aware of any.

The hardware related parts of the 1394a PHY layer spec have AFAIR
enhancements over 1394-1995 to avoid bursts of bus reset events when
connectors scrape into or out of the ports.  But what I saw on a few
occasions now were proper self ID complete events, not merely bus reset
events.  (The firewire stack only cares for self ID complete, therefore
multiple bus reset events before one self ID complete event aren't a
problem in the first place.  I don't know though whether these bursts of
bus resets which 1394a addressed had self ID complete events in them or
not.)
-- 
Stefan Richter
-=====-==--- --=- -==--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice
  2008-02-11 19:42     ` Stefan Richter
@ 2008-02-12  8:55       ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-12  8:55 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel

Stefan Richter wrote:
> Jarod Wilson wrote:
>> Was the spurious __scsi_add_device simply failing, or was it causing
>> other problems as well?

> SCSI core looks up whether the scsi_target and the scsi_device already
> exist.  If so, __scsi_add_device succeeds (there should also be a log
> message "scsi scan: device exists on %s\n") and returns the pointer to
> the existing scsi_device.

PS:  This behavior is undocumented and we shouldn't rely on it.
-- 
Stefan Richter
-=====-==--- --=- -==--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
  2008-02-10 18:36               ` Jarod Wilson
@ 2008-02-16 15:01                 ` Stefan Richter
  -1 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-16 15:01 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel, linux-scsi

Jarod Wilson wrote on 2008-02-10:
> Now *this* does the trick. I get the 'READ CAPACITY failed' as before, 
> then 'firewire_sbp2: fw1.0: error status: 0:4', followed by a new login and 
> SCSI probe, both of which are successful this time, disk is available for use 
> and all that good stuff.

FYI, I now committed this patch to linux1394-2.6.git after folding patch
 "firewire: fw-sbp2: retry login if scsi_device was offlined early" into it.
-- 
Stefan Richter
-=====-==--- --=- =----
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed
@ 2008-02-16 15:01                 ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-16 15:01 UTC (permalink / raw)
  To: Jarod Wilson; +Cc: linux1394-devel, linux-kernel, linux-scsi

Jarod Wilson wrote on 2008-02-10:
> Now *this* does the trick. I get the 'READ CAPACITY failed' as before, 
> then 'firewire_sbp2: fw1.0: error status: 0:4', followed by a new login and 
> SCSI probe, both of which are successful this time, disk is available for use 
> and all that good stuff.

FYI, I now committed this patch to linux1394-2.6.git after folding patch
 "firewire: fw-sbp2: retry login if scsi_device was offlined early" into it.
-- 
Stefan Richter
-=====-==--- --=- =----
http://arcgraph.de/sr/

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-11 20:21     ` Stefan Richter
  2008-02-12  5:07       ` Jarod Wilson
@ 2008-02-16 15:37       ` Stefan Richter
  2008-02-16 15:51         ` Stefan Richter
  1 sibling, 1 reply; 41+ messages in thread
From: Stefan Richter @ 2008-02-16 15:37 UTC (permalink / raw)
  To: Jarod Wilson, Kristian Hoegsberg; +Cc: linux1394-devel, linux-kernel

I wrote:
> Jarod Wilson wrote:
>> Stefan Richter wrote:
>>> +static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
>>> +{
>>> +    struct fw_card *card =
>>> fw_device(lu->tgt->unit->device.parent)->card;
>>> +
>>> +    if (!atomic_read(&lu->tgt->dont_block) &&
>>> +        lu->generation != card->generation &&
>>> +        atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {
>> 
>> Just to be absolutely sure, we don't need any barriers here to ensure we
>> get the right generations, do we?
> 
> I didn't think so.  But I will carefully look at it again later this
> week.  The function definitely must not block the device when the
> generation is current.  We look at two data fields here which makes this
> even more problematic.  Could be that we need locks after all.

My lockless juggling with six variables was indeed broken.  Who would
have expected that.  So here is an update, also with a more realistic
patch title.

Without the patch, _all_ ongoing I/O during bus resets will _always_
fail.  With the patch, a lot but sadly not all I/O will survive.  But at
least this stuff is enough of an improvement while not too invasive to
be considered for upstream inclusion in a later 2.6.25-rc.


From: Stefan Richter <stefanr@s5r6.in-berlin.de>
Subject: firewire: fw-sbp2: (try to) avoid I/O errors during reconnect

While fw-sbp2 takes the necessary time to reconnect to a logical unit
after bus reset, the SCSI core keeps sending new commands.  They are all
immediately completed with host busy status, and application clients or
filesystems will break quickly.  The SCSI device might even be taken
offline:  http://bugzilla.kernel.org/show_bug.cgi?id=9734

The only remedy seems to be to block the SCSI device until reconnect.
Alas the SCSI core has no useful API to block only one logical unit i.e.
the scsi_device, therefore we block the entire Scsi_Host.  This
currently corresponds to an SBP-2 target.  In case of targets with
multiple logical units, we need to satisfy the dependencies between
logical units by carefully tracking the blocking state of the target and
its units.  We block all logical units of a target as soon as one of
them needs to be blocked, and keep them blocked until all of them are
ready to be unblocked.

Furthermore, as the history of the old sbp2 driver has shown, the
scsi_block_requests() API is a minefield with high potential of
deadlocks.  We therefore take extra measures to keep logical units
unblocked during __scsi_add_device() and during shutdown.

This avoids I/O errors during reconnect in many but alas not in all
cases.  There may still be errors after a re-login had to be performed.
Also, some bridges have been seen to cease fetching management ORBs if
I/O went on up until a bus reset.  In these cases, all management ORBs
time out after mgt_orb_timeout.  The old sbp2 driver is less vulnerable
or maybe not vulnerable to this, for as yet unknown reasons.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---

Update:
  - included patch "firewire: fw-sbp2: preemptively block sdev"
  - converted from lockless to spinlock protected accesses due to too
    many interdependencies
  - updated patch description WRT how well the fix works.

 drivers/firewire/fw-sbp2.c |  126 +++++++++++++++++++++++++++++++++++--
 1 file changed, 122 insertions(+), 4 deletions(-)

Index: linux-2.6.25-rc1/drivers/firewire/fw-sbp2.c
===================================================================
--- linux-2.6.25-rc1.orig/drivers/firewire/fw-sbp2.c
+++ linux-2.6.25-rc1/drivers/firewire/fw-sbp2.c
@@ -139,6 +139,7 @@ struct sbp2_logical_unit {
 	int generation;
 	int retries;
 	struct delayed_work work;
+	bool blocked;
 };
 
 /*
@@ -157,6 +158,9 @@ struct sbp2_target {
 	int address_high;
 	unsigned int workarounds;
 	unsigned int mgt_orb_timeout;
+
+	int dont_block;	/* counter for each logical unit */
+	int blocked;	/* ditto */
 };
 
 /*
@@ -646,6 +650,107 @@ static void sbp2_agent_reset_no_wait(str
 			&z, sizeof(z), complete_agent_reset_write_no_wait, t);
 }
 
+static void sbp2_set_generation(struct sbp2_logical_unit *lu, int generation)
+{
+	struct fw_card *card = fw_device(lu->tgt->unit->device.parent)->card;
+	unsigned long flags;
+
+	/* serialize with comparisons of lu->generation and card->generation */
+	spin_lock_irqsave(&card->lock, flags);
+	lu->generation = generation;
+	spin_unlock_irqrestore(&card->lock, flags);
+}
+
+static inline void sbp2_allow_block(struct sbp2_logical_unit *lu)
+{
+	/*
+	 * We may access dont_block without taking card->lock here:
+	 * All callers of sbp2_allow_block() and all callers of sbp2_unblock()
+	 * are currently serialized against each other.
+	 * And a wrong result in sbp2_conditionally_block()'s access of
+	 * dont_block is rather harmless, it simply misses its first chance.
+	 */
+	--lu->tgt->dont_block;
+}
+
+/*
+ * Blocks lu->tgt if all of the following conditions are met:
+ *   - Login, INQUIRY, and high-level SCSI setup of all of the target's
+ *     logical units have been finished (indicated by dont_block == 0).
+ *   - lu->generation is stale.
+ *
+ * Note, scsi_block_requests() must be called while holding card->lock,
+ * otherwise it might foil sbp2_[conditionally_]unblock()'s attempt to
+ * unblock the target.
+ */
+static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
+{
+	struct sbp2_target *tgt = lu->tgt;
+	struct fw_card *card = fw_device(tgt->unit->device.parent)->card;
+	struct Scsi_Host *shost =
+		container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
+	unsigned long flags;
+
+	spin_lock_irqsave(&card->lock, flags);
+	if (!tgt->dont_block && !lu->blocked &&
+	    lu->generation != card->generation) {
+		lu->blocked = true;
+		if (++tgt->blocked == 1) {
+			scsi_block_requests(shost);
+			fw_notify("blocked %s\n", lu->tgt->bus_id);
+		}
+	}
+	spin_unlock_irqrestore(&card->lock, flags);
+}
+
+/*
+ * Unblocks lu->tgt as soon as all its logical units can be unblocked.
+ * Note, it is harmless to run scsi_unblock_requests() outside the
+ * card->lock protected section.  On the other hand, running it inside
+ * the section might clash with shost->host_lock.
+ */
+static void sbp2_conditionally_unblock(struct sbp2_logical_unit *lu)
+{
+	struct sbp2_target *tgt = lu->tgt;
+	struct fw_card *card = fw_device(tgt->unit->device.parent)->card;
+	struct Scsi_Host *shost =
+		container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
+	unsigned long flags;
+	bool unblock = false;
+
+	spin_lock_irqsave(&card->lock, flags);
+	if (lu->blocked && lu->generation == card->generation) {
+		lu->blocked = false;
+		unblock = --tgt->blocked == 0;
+	}
+	spin_unlock_irqrestore(&card->lock, flags);
+
+	if (unblock) {
+		scsi_unblock_requests(shost);
+		fw_notify("unblocked %s\n", lu->tgt->bus_id);
+	}
+}
+
+/*
+ * Prevents future blocking of tgt and unblocks it.
+ * Note, it is harmless to run scsi_unblock_requests() outside the
+ * card->lock protected section.  On the other hand, running it inside
+ * the section might clash with shost->host_lock.
+ */
+static void sbp2_unblock(struct sbp2_target *tgt)
+{
+	struct fw_card *card = fw_device(tgt->unit->device.parent)->card;
+	struct Scsi_Host *shost =
+		container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
+	unsigned long flags;
+
+	spin_lock_irqsave(&card->lock, flags);
+	++tgt->dont_block;
+	spin_unlock_irqrestore(&card->lock, flags);
+
+	scsi_unblock_requests(shost);
+}
+
 static void sbp2_release_target(struct kref *kref)
 {
 	struct sbp2_target *tgt = container_of(kref, struct sbp2_target, kref);
@@ -653,6 +758,9 @@ static void sbp2_release_target(struct k
 	struct Scsi_Host *shost =
 		container_of((void *)tgt, struct Scsi_Host, hostdata[0]);
 
+	/* prevent deadlocks */
+	sbp2_unblock(tgt);
+
 	list_for_each_entry_safe(lu, next, &tgt->lu_list, link) {
 		if (lu->sdev)
 			scsi_remove_device(lu->sdev);
@@ -717,17 +825,20 @@ static void sbp2_login(struct work_struc
 
 	if (sbp2_send_management_orb(lu, node_id, generation,
 				SBP2_LOGIN_REQUEST, lu->lun, &response) < 0) {
-		if (lu->retries++ < 5)
+		if (lu->retries++ < 5) {
 			sbp2_queue_work(lu, DIV_ROUND_UP(HZ, 5));
-		else
+		} else {
 			fw_error("%s: failed to login to LUN %04x\n",
 				 tgt->bus_id, lu->lun);
+			/* Let any waiting I/O fail from now on. */
+			sbp2_unblock(lu->tgt);
+		}
 		goto out;
 	}
 
-	lu->generation    = generation;
 	tgt->node_id	  = node_id;
 	tgt->address_high = local_node_id << 16;
+	sbp2_set_generation(lu, generation);
 
 	/* Get command block agent offset and login id. */
 	lu->command_block_agent_address =
@@ -749,6 +860,7 @@ static void sbp2_login(struct work_struc
 	/* This was a re-login. */
 	if (lu->sdev) {
 		sbp2_cancel_orbs(lu);
+		sbp2_conditionally_unblock(lu);
 		goto out;
 	}
 
@@ -785,6 +897,7 @@ static void sbp2_login(struct work_struc
 
 	/* No error during __scsi_add_device() */
 	lu->sdev = sdev;
+	sbp2_allow_block(lu);
 	goto out;
 
  out_logout_login:
@@ -825,6 +938,8 @@ static int sbp2_add_logical_unit(struct 
 	lu->sdev = NULL;
 	lu->lun  = lun_entry & 0xffff;
 	lu->retries = 0;
+	lu->blocked = false;
+	++tgt->dont_block;
 	INIT_LIST_HEAD(&lu->orb_list);
 	INIT_DELAYED_WORK(&lu->work, sbp2_login);
 
@@ -1041,15 +1156,16 @@ static void sbp2_reconnect(struct work_s
 		goto out;
 	}
 
-	lu->generation    = generation;
 	tgt->node_id      = node_id;
 	tgt->address_high = local_node_id << 16;
+	sbp2_set_generation(lu, generation);
 
 	fw_notify("%s: reconnected to LUN %04x (%d retries)\n",
 		  tgt->bus_id, lu->lun, lu->retries);
 
 	sbp2_agent_reset(lu);
 	sbp2_cancel_orbs(lu);
+	sbp2_conditionally_unblock(lu);
  out:
 	sbp2_target_put(tgt);
 }
@@ -1066,6 +1182,7 @@ static void sbp2_update(struct fw_unit *
 	 * Iteration over tgt->lu_list is therefore safe here.
 	 */
 	list_for_each_entry(lu, &tgt->lu_list, link) {
+		sbp2_conditionally_block(lu);
 		lu->retries = 0;
 		sbp2_queue_work(lu, 0);
 	}
@@ -1169,6 +1286,7 @@ complete_command_orb(struct sbp2_orb *ba
 		 * or when sending the write (less likely).
 		 */
 		result = DID_BUS_BUSY << 16;
+		sbp2_conditionally_block(orb->lu);
 	}
 
 	dma_unmap_single(device->card->device, orb->base.request_bus,

-- 
Stefan Richter
-=====-==--- --=- =----
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect
  2008-02-16 15:37       ` Stefan Richter
@ 2008-02-16 15:51         ` Stefan Richter
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Richter @ 2008-02-16 15:51 UTC (permalink / raw)
  To: Jarod Wilson, Kristian Hoegsberg; +Cc: linux1394-devel, linux-kernel

I wrote:
> The only remedy seems to be to block the SCSI device until reconnect.

In the longer term, we should look into keeping commands enqueued in
fw-sbp2 during the reconnect phase.

Plus we need full parallelism of sbp2_reconnect and sbp2_login between
all attached targets.  The current single-threaded scheme does not work
too well for more than one target attached to the same bus.  The old
stack handles that case in a single thread too but is better tuned to
sbp2's needs.  The old stack blocks the Scsi_Hosts earlier, guaranteedly
performs all reconnects and re-logins before any new login, and has a
small delay between self-ID-complete and reconnect.  I looked into
changing fw-sbp2's own workqueue delays and into adding a counter to
fw-sbp2 to track outstanding reconnects in order to defer new logins,
but it is too messy.
-- 
Stefan Richter
-=====-==--- --=- =----
http://arcgraph.de/sr/


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2008-02-16 15:53 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-03 22:00 [PATCH 0/9] firewire-sbp2: misc hotplug related patches Stefan Richter
2008-02-03 22:03 ` [PATCH 1/9] firewire: log GUID of new devices Stefan Richter
2008-02-04  8:14   ` Stefan Richter
2008-02-11 16:53   ` Jarod Wilson
2008-02-03 22:04 ` [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround Stefan Richter
2008-02-11 17:01   ` Jarod Wilson
2008-02-03 22:07 ` [PATCH 3/9] ieee1394: sbp2: " Stefan Richter
2008-02-11 17:03   ` Jarod Wilson
2008-02-03 22:08 ` [PATCH 4/9] firewire: fw-sbp2: wait for completion of fetch agent reset Stefan Richter
2008-02-04  8:11   ` Stefan Richter
2008-02-03 22:09 ` [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures Stefan Richter
2008-02-11 17:16   ` Jarod Wilson
2008-02-03 22:10 ` [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice Stefan Richter
2008-02-11 17:19   ` Jarod Wilson
2008-02-11 19:42     ` Stefan Richter
2008-02-12  8:55       ` Stefan Richter
2008-02-03 22:11 ` [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect Stefan Richter
2008-02-11 17:32   ` Jarod Wilson
2008-02-03 22:12 ` [PATCH 8/9] firewire: fw-sbp2: sort includes Stefan Richter
2008-02-03 22:13 ` [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect Stefan Richter
2008-02-11 18:09   ` Jarod Wilson
2008-02-11 20:21     ` Stefan Richter
2008-02-12  5:07       ` Jarod Wilson
2008-02-12  8:01         ` Stefan Richter
2008-02-16 15:37       ` Stefan Richter
2008-02-16 15:51         ` Stefan Richter
2008-02-04 15:54 ` [PATCH 0/9] firewire-sbp2: misc hotplug related patches John Stoffel
2008-02-04 17:48   ` Stefan Richter
2008-02-04 18:51     ` John Stoffel
2008-02-06  5:17 ` Jarod Wilson
2008-02-06 18:27   ` Stefan Richter
2008-02-06 21:09     ` [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed Stefan Richter
2008-02-08 18:54       ` Jarod Wilson
2008-02-08 19:58         ` Stefan Richter
2008-02-08 21:33           ` [PATCH 11/9 update] " Stefan Richter
2008-02-08 21:33             ` Stefan Richter
2008-02-10 18:36             ` Jarod Wilson
2008-02-10 18:36               ` Jarod Wilson
2008-02-16 15:01               ` Stefan Richter
2008-02-16 15:01                 ` Stefan Richter
2008-02-06 21:07 ` [PATCH 10/9] firewire: fw-sbp2: preemptively block sdev Stefan Richter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.