All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
@ 2015-10-14 13:50 Johannes Thumshirn
  2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 13:50 UTC (permalink / raw)
  To: James E.J. Bottomley, Christoph Hellwig, Hannes Reinecke
  Cc: linux-scsi, linux-kernel, Johannes Thumshirn

Removing a SCSI target via scsi_remove_target() suspected to be racy. When a
sibling get's removed from the list it can occassionly happen that one CPU is
stuck endlessly looping around this code block

list_for_each_entry(starget, &shost->__targets, siblings) {
        if (starget->state == STARGET_DEL)
                continue;

Resulting in the following hard lockup.

Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[...]
Call Trace:
 [<ffffffff8100471d>] dump_trace+0x7d/0x2d0
 [<ffffffff81004a04>] show_stack_log_lvl+0x94/0x170
 [<ffffffff81005cc1>] show_stack+0x21/0x50
 [<ffffffff8151aa75>] dump_stack+0x41/0x51
 [<ffffffff8151545a>] panic+0xc8/0x1d7
 [<ffffffff810fbdda>] watchdog_overflow_callback+0xba/0xc0
 [<ffffffff811336c8>] __perf_event_overflow+0x88/0x240
 [<ffffffff8101e3aa>] intel_pmu_handle_irq+0x1fa/0x3e0
 [<ffffffff81522836>] perf_event_nmi_handler+0x26/0x40
 [<ffffffff81521fcd>] nmi_handle.isra.2+0x8d/0x180
 [<ffffffff815221e6>] do_nmi+0x126/0x3c0
 [<ffffffff8152159b>] end_repeat_nmi+0x1a/0x1e
 [<ffffffffa00212e8>] scsi_remove_target+0x68/0x240 [scsi_mod]
 [<ffffffff81072742>] process_one_work+0x172/0x420
 [<ffffffff810733ba>] worker_thread+0x11a/0x3c0
 [<ffffffff81079d34>] kthread+0xb4/0xc0
 [<ffffffff81528cd8>] ret_from_fork+0x58/0x90

This series attacks the issue by 1) decoupling the __targets and __devices
lists of struct Scsi_Host from the host_lock spinlock by introducing spinlocks
for each list and 2) de-coupling the list traversals needed for detecting
targets/devices to be removed from the actual removal by moving list entries to
be deleted to a second list and perform the deletion there.


The whole series survived a nearly 48h test loop of:
while [ $not_done  ]; do
	umount $mountpoint;
	rmmod $module;
	modprobe $module;
	mount $mountpoint;
done

This is a follow up of the patch proposed here:
http://marc.info/?l=linux-scsi&m=144377409311774&w=2
incorporating Christoph's comment

Johannes Thumshirn (3):
  SCSI: Introduce device_lock and target_lock in Scsi_Host
  SCSI: Rework list handling in scsi_target_remove
  SCSI: Rework list handling in __scsi_target_remove

 drivers/scsi/53c700.c     |  3 +++
 drivers/scsi/hosts.c      |  2 ++
 drivers/scsi/scsi.c       |  8 ++++----
 drivers/scsi/scsi_scan.c  | 10 +++++-----
 drivers/scsi/scsi_sysfs.c | 43 +++++++++++++++++++++----------------------
 include/scsi/scsi_host.h  |  2 ++
 6 files changed, 37 insertions(+), 31 deletions(-)

-- 
1.8.5.6


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host
  2015-10-14 13:50 [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() Johannes Thumshirn
@ 2015-10-14 13:50 ` Johannes Thumshirn
  2015-10-14 14:14   ` Hannes Reinecke
                     ` (2 more replies)
  2015-10-14 13:50 ` [PATCH 2/3] SCSI: Rework list handling in scsi_target_remove Johannes Thumshirn
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 13:50 UTC (permalink / raw)
  To: James E.J. Bottomley, Christoph Hellwig, Hannes Reinecke
  Cc: linux-scsi, linux-kernel, Johannes Thumshirn

Introduce target_lock and device_lock to untangle the __devices and __targets
lists from the host_lock.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/53c700.c     |  3 +++
 drivers/scsi/hosts.c      |  2 ++
 drivers/scsi/scsi.c       |  8 ++++----
 drivers/scsi/scsi_scan.c  | 10 +++++-----
 drivers/scsi/scsi_sysfs.c | 18 ++++++++----------
 include/scsi/scsi_host.h  |  2 ++
 6 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/53c700.c b/drivers/scsi/53c700.c
index a209c34..e2b4d04 100644
--- a/drivers/scsi/53c700.c
+++ b/drivers/scsi/53c700.c
@@ -1093,6 +1093,7 @@ process_script_interrupt(__u32 dsps, __u32 dsp, struct scsi_cmnd *SCp,
 		struct NCR_700_command_slot *slot;
 		__u8 reselection_id = hostdata->reselection_id;
 		struct scsi_device *SDp;
+		unsigned long flags;
 
 		lun = hostdata->msgin[0] & 0x1f;
 
@@ -1100,7 +1101,9 @@ process_script_interrupt(__u32 dsps, __u32 dsp, struct scsi_cmnd *SCp,
 		DEBUG(("scsi%d: (%d:%d) RESELECTED!\n",
 		       host->host_no, reselection_id, lun));
 		/* clear the reselection indicator */
+		spin_lock_irqsave(host->device_lock, flags);
 		SDp = __scsi_device_lookup(host, 0, reselection_id, lun);
+		spin_unlock_irqrestore(host->device_lock, flags);
 		if(unlikely(SDp == NULL)) {
 			printk(KERN_ERR "scsi%d: (%d:%d) HAS NO device\n",
 			       host->host_no, reselection_id, lun);
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 8bb173e..6855434 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -380,6 +380,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 
 	shost->host_lock = &shost->default_lock;
 	spin_lock_init(shost->host_lock);
+	spin_lock_init(&shost->device_lock);
+	spin_lock_init(&shost->target_lock);
 	shost->shost_state = SHOST_CREATED;
 	INIT_LIST_HEAD(&shost->__devices);
 	INIT_LIST_HEAD(&shost->__targets);
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 207d6a7..0e1046a 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -970,7 +970,7 @@ struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
 	struct scsi_device *next = NULL;
 	unsigned long flags;
 
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->device_lock, flags);
 	while (list->next != &shost->__devices) {
 		next = list_entry(list->next, struct scsi_device, siblings);
 		/* skip devices that we can't get a reference to */
@@ -979,7 +979,7 @@ struct scsi_device *__scsi_iterate_devices(struct Scsi_Host *shost,
 		next = NULL;
 		list = list->next;
 	}
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->device_lock, flags);
 
 	if (prev)
 		scsi_device_put(prev);
@@ -1144,11 +1144,11 @@ struct scsi_device *scsi_device_lookup(struct Scsi_Host *shost,
 	struct scsi_device *sdev;
 	unsigned long flags;
 
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->device_lock, flags);
 	sdev = __scsi_device_lookup(shost, channel, id, lun);
 	if (sdev && scsi_device_get(sdev))
 		sdev = NULL;
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->device_lock, flags);
 
 	return sdev;
 }
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index f9f3f82..ac68531 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -441,14 +441,14 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
 	starget->scsi_level = SCSI_2;
 	starget->max_target_blocked = SCSI_DEFAULT_TARGET_BLOCKED;
  retry:
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->target_lock, flags);
 
 	found_target = __scsi_find_target(parent, channel, id);
 	if (found_target)
 		goto found;
 
 	list_add_tail(&starget->siblings, &shost->__targets);
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->target_lock, flags);
 	/* allocate and add */
 	transport_setup_device(dev);
 	if (shost->hostt->target_alloc) {
@@ -1854,15 +1854,15 @@ void scsi_forget_host(struct Scsi_Host *shost)
 	unsigned long flags;
 
  restart:
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->device_lock, flags);
 	list_for_each_entry(sdev, &shost->__devices, siblings) {
 		if (sdev->sdev_state == SDEV_DEL)
 			continue;
-		spin_unlock_irqrestore(shost->host_lock, flags);
+		spin_unlock_irqrestore(&shost->device_lock, flags);
 		__scsi_remove_device(sdev);
 		goto restart;
 	}
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->device_lock, flags);
 }
 
 /**
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index b333389..d7afea9 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1131,20 +1131,18 @@ static void __scsi_remove_target(struct scsi_target *starget)
 	unsigned long flags;
 	struct scsi_device *sdev;
 
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->device_lock, flags);
  restart:
 	list_for_each_entry(sdev, &shost->__devices, siblings) {
 		if (sdev->channel != starget->channel ||
 		    sdev->id != starget->id ||
 		    scsi_device_get(sdev))
 			continue;
-		spin_unlock_irqrestore(shost->host_lock, flags);
 		scsi_remove_device(sdev);
 		scsi_device_put(sdev);
-		spin_lock_irqsave(shost->host_lock, flags);
 		goto restart;
 	}
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->device_lock, flags);
 }
 
 /**
@@ -1164,22 +1162,22 @@ void scsi_remove_target(struct device *dev)
 	/* remove targets being careful to lookup next entry before
 	 * deleting the last
 	 */
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->target_lock, flags);
 	list_for_each_entry(starget, &shost->__targets, siblings) {
 		if (starget->state == STARGET_DEL)
 			continue;
 		if (starget->dev.parent == dev || &starget->dev == dev) {
 			/* assuming new targets arrive at the end */
 			kref_get(&starget->reap_ref);
-			spin_unlock_irqrestore(shost->host_lock, flags);
+			spin_unlock_irqrestore(&shost->target_lock, flags);
 			if (last)
 				scsi_target_reap(last);
 			last = starget;
 			__scsi_remove_target(starget);
-			spin_lock_irqsave(shost->host_lock, flags);
+			spin_lock_irqsave(&shost->target_lock, flags);
 		}
 	}
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->target_lock, flags);
 
 	if (last)
 		scsi_target_reap(last);
@@ -1262,10 +1260,10 @@ void scsi_sysfs_device_initialize(struct scsi_device *sdev)
 		sdev->lun_in_cdb = 1;
 
 	transport_setup_device(&sdev->sdev_gendev);
-	spin_lock_irqsave(shost->host_lock, flags);
+	spin_lock_irqsave(&shost->device_lock, flags);
 	list_add_tail(&sdev->same_target_siblings, &starget->devices);
 	list_add_tail(&sdev->siblings, &shost->__devices);
-	spin_unlock_irqrestore(shost->host_lock, flags);
+	spin_unlock_irqrestore(&shost->device_lock, flags);
 	/*
 	 * device can now only be removed via __scsi_remove_device() so hold
 	 * the target.  Target will be held in CREATED state until something
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index e113c75..764020a 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -558,6 +558,8 @@ struct Scsi_Host {
 
 	spinlock_t		default_lock;
 	spinlock_t		*host_lock;
+	spinlock_t		device_lock; /* protects the __devices list */
+	spinlock_t		target_lock; /* protects the __targets list */
 
 	struct mutex		scan_mutex;/* serialize scanning activity */
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] SCSI: Rework list handling in scsi_target_remove
  2015-10-14 13:50 [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() Johannes Thumshirn
  2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
@ 2015-10-14 13:50 ` Johannes Thumshirn
  2015-10-14 14:18   ` Hannes Reinecke
  2015-10-14 13:50 ` [PATCH 3/3] SCSI: Rework list handling in __scsi_target_remove Johannes Thumshirn
  2015-10-14 14:30 ` [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() James Bottomley
  3 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 13:50 UTC (permalink / raw)
  To: James E.J. Bottomley, Christoph Hellwig, Hannes Reinecke
  Cc: linux-scsi, linux-kernel, Johannes Thumshirn

Rework the list handling in scsi_target_remove(). The new version introduces a
reap list for targets. Targets that shall be removed are placed on the reap
list and can then be reaped later on.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/scsi_sysfs.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index d7afea9..b41dcb3 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1156,31 +1156,28 @@ static void __scsi_remove_target(struct scsi_target *starget)
 void scsi_remove_target(struct device *dev)
 {
 	struct Scsi_Host *shost = dev_to_shost(dev->parent);
-	struct scsi_target *starget, *last = NULL;
+	struct scsi_target *starget, *tmp;
 	unsigned long flags;
+	LIST_HEAD(reap_list);
 
 	/* remove targets being careful to lookup next entry before
 	 * deleting the last
 	 */
 	spin_lock_irqsave(&shost->target_lock, flags);
-	list_for_each_entry(starget, &shost->__targets, siblings) {
+	list_for_each_entry_safe(starget, tmp, &shost->__targets, siblings) {
 		if (starget->state == STARGET_DEL)
 			continue;
 		if (starget->dev.parent == dev || &starget->dev == dev) {
-			/* assuming new targets arrive at the end */
 			kref_get(&starget->reap_ref);
-			spin_unlock_irqrestore(&shost->target_lock, flags);
-			if (last)
-				scsi_target_reap(last);
-			last = starget;
-			__scsi_remove_target(starget);
-			spin_lock_irqsave(&shost->target_lock, flags);
+			list_move_tail(&starget->siblings, &reap_list);
 		}
 	}
 	spin_unlock_irqrestore(&shost->target_lock, flags);
 
-	if (last)
-		scsi_target_reap(last);
+	list_for_each_entry_safe(starget, tmp, &reap_list, siblings) {
+		__scsi_remove_target(starget);
+		scsi_target_reap(starget);
+	}
 }
 EXPORT_SYMBOL(scsi_remove_target);
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3]  SCSI: Rework list handling in __scsi_target_remove
  2015-10-14 13:50 [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() Johannes Thumshirn
  2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
  2015-10-14 13:50 ` [PATCH 2/3] SCSI: Rework list handling in scsi_target_remove Johannes Thumshirn
@ 2015-10-14 13:50 ` Johannes Thumshirn
  2015-10-14 14:19   ` Hannes Reinecke
  2015-10-14 14:30 ` [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() James Bottomley
  3 siblings, 1 reply; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 13:50 UTC (permalink / raw)
  To: James E.J. Bottomley, Christoph Hellwig, Hannes Reinecke
  Cc: linux-scsi, linux-kernel, Johannes Thumshirn

Rework the list handling in __scsi_target_remove(). The new version introduces
a reap list for devices. Devices that shall be removed are placed on the reap
list and can then be removed later on.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/scsi_sysfs.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index b41dcb3..2bd88c6 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1129,20 +1129,24 @@ static void __scsi_remove_target(struct scsi_target *starget)
 {
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	unsigned long flags;
-	struct scsi_device *sdev;
+	struct scsi_device *sdev, *tmp;
+	LIST_HEAD(reap_list);
 
 	spin_lock_irqsave(&shost->device_lock, flags);
  restart:
-	list_for_each_entry(sdev, &shost->__devices, siblings) {
+	list_for_each_entry_safe(sdev, tmp, &shost->__devices, siblings) {
 		if (sdev->channel != starget->channel ||
 		    sdev->id != starget->id ||
 		    scsi_device_get(sdev))
 			continue;
-		scsi_remove_device(sdev);
+		list_move_tail(&sdev->siblings, &reap_list);
 		scsi_device_put(sdev);
 		goto restart;
 	}
 	spin_unlock_irqrestore(&shost->device_lock, flags);
+
+	list_for_each_entry_safe(sdev, tmp, &reap_list, siblings)
+		scsi_remove_device(sdev);
 }
 
 /**
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host
  2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
@ 2015-10-14 14:14   ` Hannes Reinecke
  2015-10-14 14:17   ` Hannes Reinecke
  2015-10-14 14:35     ` kbuild test robot
  2 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2015-10-14 14:14 UTC (permalink / raw)
  To: Johannes Thumshirn, James E.J. Bottomley, Christoph Hellwig
  Cc: linux-scsi, linux-kernel

On 10/14/2015 03:50 PM, Johannes Thumshirn wrote:
> Introduce target_lock and device_lock to untangle the __devices and __targets
> lists from the host_lock.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/53c700.c     |  3 +++
>  drivers/scsi/hosts.c      |  2 ++
>  drivers/scsi/scsi.c       |  8 ++++----
>  drivers/scsi/scsi_scan.c  | 10 +++++-----
>  drivers/scsi/scsi_sysfs.c | 18 ++++++++----------
>  include/scsi/scsi_host.h  |  2 ++
>  6 files changed, 24 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/scsi/53c700.c b/drivers/scsi/53c700.c
> index a209c34..e2b4d04 100644
> --- a/drivers/scsi/53c700.c
> +++ b/drivers/scsi/53c700.c
> @@ -1093,6 +1093,7 @@ process_script_interrupt(__u32 dsps, __u32 dsp, struct scsi_cmnd *SCp,
>  		struct NCR_700_command_slot *slot;
>  		__u8 reselection_id = hostdata->reselection_id;
>  		struct scsi_device *SDp;
> +		unsigned long flags;
>  
>  		lun = hostdata->msgin[0] & 0x1f;
>  
> @@ -1100,7 +1101,9 @@ process_script_interrupt(__u32 dsps, __u32 dsp, struct scsi_cmnd *SCp,
>  		DEBUG(("scsi%d: (%d:%d) RESELECTED!\n",
>  		       host->host_no, reselection_id, lun));
>  		/* clear the reselection indicator */
> +		spin_lock_irqsave(host->device_lock, flags);
>  		SDp = __scsi_device_lookup(host, 0, reselection_id, lun);
> +		spin_unlock_irqrestore(host->device_lock, flags);
>  		if(unlikely(SDp == NULL)) {
>  			printk(KERN_ERR "scsi%d: (%d:%d) HAS NO device\n",
>  			       host->host_no, reselection_id, lun);

Hmm. Unfortunate.

Can't we get rid of the underscore version altogether, seeing that
we have our own lock now?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host
  2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
  2015-10-14 14:14   ` Hannes Reinecke
@ 2015-10-14 14:17   ` Hannes Reinecke
  2015-10-14 14:35     ` kbuild test robot
  2 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2015-10-14 14:17 UTC (permalink / raw)
  To: Johannes Thumshirn, James E.J. Bottomley, Christoph Hellwig
  Cc: linux-scsi, linux-kernel

On 10/14/2015 03:50 PM, Johannes Thumshirn wrote:
> Introduce target_lock and device_lock to untangle the __devices and __targets
> lists from the host_lock.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
Actually, you need to convert scsi_lookup_by_target, too.

That, too, relies on the host_lock.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] SCSI: Rework list handling in scsi_target_remove
  2015-10-14 13:50 ` [PATCH 2/3] SCSI: Rework list handling in scsi_target_remove Johannes Thumshirn
@ 2015-10-14 14:18   ` Hannes Reinecke
  0 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2015-10-14 14:18 UTC (permalink / raw)
  To: Johannes Thumshirn, James E.J. Bottomley, Christoph Hellwig
  Cc: linux-scsi, linux-kernel

On 10/14/2015 03:50 PM, Johannes Thumshirn wrote:
> Rework the list handling in scsi_target_remove(). The new version introduces a
> reap list for targets. Targets that shall be removed are placed on the reap
> list and can then be reaped later on.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/scsi_sysfs.c | 19 ++++++++-----------
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index d7afea9..b41dcb3 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1156,31 +1156,28 @@ static void __scsi_remove_target(struct scsi_target *starget)
>  void scsi_remove_target(struct device *dev)
>  {
>  	struct Scsi_Host *shost = dev_to_shost(dev->parent);
> -	struct scsi_target *starget, *last = NULL;
> +	struct scsi_target *starget, *tmp;
>  	unsigned long flags;
> +	LIST_HEAD(reap_list);
>  
>  	/* remove targets being careful to lookup next entry before
>  	 * deleting the last
>  	 */
>  	spin_lock_irqsave(&shost->target_lock, flags);
> -	list_for_each_entry(starget, &shost->__targets, siblings) {
> +	list_for_each_entry_safe(starget, tmp, &shost->__targets, siblings) {
>  		if (starget->state == STARGET_DEL)
>  			continue;
Hmm. This can even be converted into a WARN_ON(), as with this patch
there should _never_ be an deleted target on the list.

Other than that:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3]  SCSI: Rework list handling in __scsi_target_remove
  2015-10-14 13:50 ` [PATCH 3/3] SCSI: Rework list handling in __scsi_target_remove Johannes Thumshirn
@ 2015-10-14 14:19   ` Hannes Reinecke
  0 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2015-10-14 14:19 UTC (permalink / raw)
  To: Johannes Thumshirn, James E.J. Bottomley, Christoph Hellwig
  Cc: linux-scsi, linux-kernel

On 10/14/2015 03:50 PM, Johannes Thumshirn wrote:
> Rework the list handling in __scsi_target_remove(). The new version introduces
> a reap list for devices. Devices that shall be removed are placed on the reap
> list and can then be removed later on.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/scsi_sysfs.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index b41dcb3..2bd88c6 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1129,20 +1129,24 @@ static void __scsi_remove_target(struct scsi_target *starget)
>  {
>  	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
>  	unsigned long flags;
> -	struct scsi_device *sdev;
> +	struct scsi_device *sdev, *tmp;
> +	LIST_HEAD(reap_list);
>  
>  	spin_lock_irqsave(&shost->device_lock, flags);
>   restart:
> -	list_for_each_entry(sdev, &shost->__devices, siblings) {
> +	list_for_each_entry_safe(sdev, tmp, &shost->__devices, siblings) {
>  		if (sdev->channel != starget->channel ||
>  		    sdev->id != starget->id ||
>  		    scsi_device_get(sdev))
>  			continue;
> -		scsi_remove_device(sdev);
> +		list_move_tail(&sdev->siblings, &reap_list);
>  		scsi_device_put(sdev);
>  		goto restart;
>  	}
>  	spin_unlock_irqrestore(&shost->device_lock, flags);
> +
> +	list_for_each_entry_safe(sdev, tmp, &reap_list, siblings)
> +		scsi_remove_device(sdev);
>  }
>  
>  /**
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 13:50 [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2015-10-14 13:50 ` [PATCH 3/3] SCSI: Rework list handling in __scsi_target_remove Johannes Thumshirn
@ 2015-10-14 14:30 ` James Bottomley
  2015-10-14 14:39   ` Johannes Thumshirn
  2015-10-14 20:22   ` Ewan Milne
  3 siblings, 2 replies; 20+ messages in thread
From: James Bottomley @ 2015-10-14 14:30 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-kernel

On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
> Removing a SCSI target via scsi_remove_target() suspected to be racy. When a
> sibling get's removed from the list it can occassionly happen that one CPU is
> stuck endlessly looping around this code block
> 
> list_for_each_entry(starget, &shost->__targets, siblings) {
>         if (starget->state == STARGET_DEL)
>                 continue;

How long is the __targets list?  It seems a bit unlikely that this is
the exact cause, because for a short list all in STARGET_DEL that loop
should exit very quickly.  Where in the code does scsi_remove_target
+0x68/0x240 actually point to?

Is it not a bit more likely that we're following a removed list element?
Since that points back to itself, the list_for_each_entry() would then
circulate forever.  If that's the case the simple fix would be to use
the safe version of the list traversal macro.

James


> Resulting in the following hard lockup.
> 
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [...]
> Call Trace:
>  [<ffffffff8100471d>] dump_trace+0x7d/0x2d0
>  [<ffffffff81004a04>] show_stack_log_lvl+0x94/0x170
>  [<ffffffff81005cc1>] show_stack+0x21/0x50
>  [<ffffffff8151aa75>] dump_stack+0x41/0x51
>  [<ffffffff8151545a>] panic+0xc8/0x1d7
>  [<ffffffff810fbdda>] watchdog_overflow_callback+0xba/0xc0
>  [<ffffffff811336c8>] __perf_event_overflow+0x88/0x240
>  [<ffffffff8101e3aa>] intel_pmu_handle_irq+0x1fa/0x3e0
>  [<ffffffff81522836>] perf_event_nmi_handler+0x26/0x40
>  [<ffffffff81521fcd>] nmi_handle.isra.2+0x8d/0x180
>  [<ffffffff815221e6>] do_nmi+0x126/0x3c0
>  [<ffffffff8152159b>] end_repeat_nmi+0x1a/0x1e
>  [<ffffffffa00212e8>] scsi_remove_target+0x68/0x240 [scsi_mod]
>  [<ffffffff81072742>] process_one_work+0x172/0x420
>  [<ffffffff810733ba>] worker_thread+0x11a/0x3c0
>  [<ffffffff81079d34>] kthread+0xb4/0xc0
>  [<ffffffff81528cd8>] ret_from_fork+0x58/0x90
> 
> This series attacks the issue by 1) decoupling the __targets and __devices
> lists of struct Scsi_Host from the host_lock spinlock by introducing spinlocks
> for each list and 2) de-coupling the list traversals needed for detecting
> targets/devices to be removed from the actual removal by moving list entries to
> be deleted to a second list and perform the deletion there.
> 
> 
> The whole series survived a nearly 48h test loop of:
> while [ $not_done  ]; do
> 	umount $mountpoint;
> 	rmmod $module;
> 	modprobe $module;
> 	mount $mountpoint;
> done
> 
> This is a follow up of the patch proposed here:
> http://marc.info/?l=linux-scsi&m=144377409311774&w=2
> incorporating Christoph's comment
> 
> Johannes Thumshirn (3):
>   SCSI: Introduce device_lock and target_lock in Scsi_Host
>   SCSI: Rework list handling in scsi_target_remove
>   SCSI: Rework list handling in __scsi_target_remove
> 
>  drivers/scsi/53c700.c     |  3 +++
>  drivers/scsi/hosts.c      |  2 ++
>  drivers/scsi/scsi.c       |  8 ++++----
>  drivers/scsi/scsi_scan.c  | 10 +++++-----
>  drivers/scsi/scsi_sysfs.c | 43 +++++++++++++++++++++----------------------
>  include/scsi/scsi_host.h  |  2 ++
>  6 files changed, 37 insertions(+), 31 deletions(-)
> 




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host
  2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
@ 2015-10-14 14:35     ` kbuild test robot
  2015-10-14 14:17   ` Hannes Reinecke
  2015-10-14 14:35     ` kbuild test robot
  2 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2015-10-14 14:35 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: kbuild-all, James E.J. Bottomley, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-kernel, Johannes Thumshirn

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]

Hi Johannes,

[auto build test ERROR on scsi/for-next -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/SCSI-Introduce-device_lock-and-target_lock-in-Scsi_Host/20151014-215532
config: i386-allmodconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mmzone.h:7:0,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/scsi/53c700.c:120:
   drivers/scsi/53c700.c: In function 'process_script_interrupt':
>> drivers/scsi/53c700.c:1104:21: error: incompatible type for argument 1 of 'spinlock_check'
      spin_lock_irqsave(host->device_lock, flags);
                        ^
   include/linux/spinlock.h:208:34: note: in definition of macro 'raw_spin_lock_irqsave'
      flags = _raw_spin_lock_irqsave(lock); \
                                     ^
>> drivers/scsi/53c700.c:1104:3: note: in expansion of macro 'spin_lock_irqsave'
      spin_lock_irqsave(host->device_lock, flags);
      ^
   In file included from include/linux/mmzone.h:7:0,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/scsi/53c700.c:120:
   include/linux/spinlock.h:289:40: note: expected 'spinlock_t * {aka struct spinlock *}' but argument is of type 'spinlock_t {aka struct spinlock}'
    static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
                                           ^
>> drivers/scsi/53c700.c:1106:26: error: incompatible type for argument 1 of 'spin_unlock_irqrestore'
      spin_unlock_irqrestore(host->device_lock, flags);
                             ^
   In file included from include/linux/mmzone.h:7:0,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/scsi/53c700.c:120:
   include/linux/spinlock.h:360:29: note: expected 'spinlock_t * {aka struct spinlock *}' but argument is of type 'spinlock_t {aka struct spinlock}'
    static __always_inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
                                ^

vim +/spinlock_check +1104 drivers/scsi/53c700.c

  1098			lun = hostdata->msgin[0] & 0x1f;
  1099	
  1100			hostdata->reselection_id = 0xff;
  1101			DEBUG(("scsi%d: (%d:%d) RESELECTED!\n",
  1102			       host->host_no, reselection_id, lun));
  1103			/* clear the reselection indicator */
> 1104			spin_lock_irqsave(host->device_lock, flags);
  1105			SDp = __scsi_device_lookup(host, 0, reselection_id, lun);
> 1106			spin_unlock_irqrestore(host->device_lock, flags);
  1107			if(unlikely(SDp == NULL)) {
  1108				printk(KERN_ERR "scsi%d: (%d:%d) HAS NO device\n",
  1109				       host->host_no, reselection_id, lun);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 51499 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host
@ 2015-10-14 14:35     ` kbuild test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2015-10-14 14:35 UTC (permalink / raw)
  Cc: kbuild-all, James E.J. Bottomley, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-kernel, Johannes Thumshirn

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]

Hi Johannes,

[auto build test ERROR on scsi/for-next -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Johannes-Thumshirn/SCSI-Introduce-device_lock-and-target_lock-in-Scsi_Host/20151014-215532
config: i386-allmodconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mmzone.h:7:0,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/scsi/53c700.c:120:
   drivers/scsi/53c700.c: In function 'process_script_interrupt':
>> drivers/scsi/53c700.c:1104:21: error: incompatible type for argument 1 of 'spinlock_check'
      spin_lock_irqsave(host->device_lock, flags);
                        ^
   include/linux/spinlock.h:208:34: note: in definition of macro 'raw_spin_lock_irqsave'
      flags = _raw_spin_lock_irqsave(lock); \
                                     ^
>> drivers/scsi/53c700.c:1104:3: note: in expansion of macro 'spin_lock_irqsave'
      spin_lock_irqsave(host->device_lock, flags);
      ^
   In file included from include/linux/mmzone.h:7:0,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/scsi/53c700.c:120:
   include/linux/spinlock.h:289:40: note: expected 'spinlock_t * {aka struct spinlock *}' but argument is of type 'spinlock_t {aka struct spinlock}'
    static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
                                           ^
>> drivers/scsi/53c700.c:1106:26: error: incompatible type for argument 1 of 'spin_unlock_irqrestore'
      spin_unlock_irqrestore(host->device_lock, flags);
                             ^
   In file included from include/linux/mmzone.h:7:0,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/scsi/53c700.c:120:
   include/linux/spinlock.h:360:29: note: expected 'spinlock_t * {aka struct spinlock *}' but argument is of type 'spinlock_t {aka struct spinlock}'
    static __always_inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
                                ^

vim +/spinlock_check +1104 drivers/scsi/53c700.c

  1098			lun = hostdata->msgin[0] & 0x1f;
  1099	
  1100			hostdata->reselection_id = 0xff;
  1101			DEBUG(("scsi%d: (%d:%d) RESELECTED!\n",
  1102			       host->host_no, reselection_id, lun));
  1103			/* clear the reselection indicator */
> 1104			spin_lock_irqsave(host->device_lock, flags);
  1105			SDp = __scsi_device_lookup(host, 0, reselection_id, lun);
> 1106			spin_unlock_irqrestore(host->device_lock, flags);
  1107			if(unlikely(SDp == NULL)) {
  1108				printk(KERN_ERR "scsi%d: (%d:%d) HAS NO device\n",
  1109				       host->host_no, reselection_id, lun);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 51499 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 14:30 ` [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() James Bottomley
@ 2015-10-14 14:39   ` Johannes Thumshirn
  2015-10-14 15:45     ` James Bottomley
  2015-10-14 16:12     ` Christoph Hellwig
  2015-10-14 20:22   ` Ewan Milne
  1 sibling, 2 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 14:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-kernel

On Wed, 2015-10-14 at 07:30 -0700, James Bottomley wrote:
> On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
> > Removing a SCSI target via scsi_remove_target() suspected to be
> > racy. When a
> > sibling get's removed from the list it can occassionly happen that
> > one CPU is
> > stuck endlessly looping around this code block
> > 
> > list_for_each_entry(starget, &shost->__targets, siblings) {
> >         if (starget->state == STARGET_DEL)
> >                 continue;
> 
> How long is the __targets list?  It seems a bit unlikely that this is
> the exact cause, because for a short list all in STARGET_DEL that
> loop
> should exit very quickly.  Where in the code does scsi_remove_target
> +0x68/0x240 actually point to?
> 
> Is it not a bit more likely that we're following a removed list
> element?
> Since that points back to itself, the list_for_each_entry() would
> then
> circulate forever.  If that's the case the simple fix would be to use
> the safe version of the list traversal macro.

Yes it is traversing a removed element and yes the patches 2/3 and 3/3
are introducing the safe version of list_for_each_entry(), but they
also decouple the search for elements to be removed from the actual
removal. This is what my initial proposal did as well. Christoph wanted
me to decouple the whole process from the host_lock though and this is
what this patches do as well now.

> 
> James
> 
> 
> > Resulting in the following hard lockup.
> > 
> > Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> > [...]
> > Call Trace:
> >  [<ffffffff8100471d>] dump_trace+0x7d/0x2d0
> >  [<ffffffff81004a04>] show_stack_log_lvl+0x94/0x170
> >  [<ffffffff81005cc1>] show_stack+0x21/0x50
> >  [<ffffffff8151aa75>] dump_stack+0x41/0x51
> >  [<ffffffff8151545a>] panic+0xc8/0x1d7
> >  [<ffffffff810fbdda>] watchdog_overflow_callback+0xba/0xc0
> >  [<ffffffff811336c8>] __perf_event_overflow+0x88/0x240
> >  [<ffffffff8101e3aa>] intel_pmu_handle_irq+0x1fa/0x3e0
> >  [<ffffffff81522836>] perf_event_nmi_handler+0x26/0x40
> >  [<ffffffff81521fcd>] nmi_handle.isra.2+0x8d/0x180
> >  [<ffffffff815221e6>] do_nmi+0x126/0x3c0
> >  [<ffffffff8152159b>] end_repeat_nmi+0x1a/0x1e
> >  [<ffffffffa00212e8>] scsi_remove_target+0x68/0x240 [scsi_mod]
> >  [<ffffffff81072742>] process_one_work+0x172/0x420
> >  [<ffffffff810733ba>] worker_thread+0x11a/0x3c0
> >  [<ffffffff81079d34>] kthread+0xb4/0xc0
> >  [<ffffffff81528cd8>] ret_from_fork+0x58/0x90
> > 
> > This series attacks the issue by 1) decoupling the __targets and
> > __devices
> > lists of struct Scsi_Host from the host_lock spinlock by
> > introducing spinlocks
> > for each list and 2) de-coupling the list traversals needed for
> > detecting
> > targets/devices to be removed from the actual removal by moving
> > list entries to
> > be deleted to a second list and perform the deletion there.
> > 
> > 
> > The whole series survived a nearly 48h test loop of:
> > while [ $not_done  ]; do
> > 	umount $mountpoint;
> > 	rmmod $module;
> > 	modprobe $module;
> > 	mount $mountpoint;
> > done
> > 
> > This is a follow up of the patch proposed here:
> > http://marc.info/?l=linux-scsi&m=144377409311774&w=2
> > incorporating Christoph's comment
> > 
> > Johannes Thumshirn (3):
> >   SCSI: Introduce device_lock and target_lock in Scsi_Host
> >   SCSI: Rework list handling in scsi_target_remove
> >   SCSI: Rework list handling in __scsi_target_remove
> > 
> >  drivers/scsi/53c700.c     |  3 +++
> >  drivers/scsi/hosts.c      |  2 ++
> >  drivers/scsi/scsi.c       |  8 ++++----
> >  drivers/scsi/scsi_scan.c  | 10 +++++-----
> >  drivers/scsi/scsi_sysfs.c | 43 +++++++++++++++++++++--------------
> > --------
> >  include/scsi/scsi_host.h  |  2 ++
> >  6 files changed, 37 insertions(+), 31 deletions(-)
> > 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 14:39   ` Johannes Thumshirn
@ 2015-10-14 15:45     ` James Bottomley
  2015-10-14 17:36       ` Johannes Thumshirn
  2015-10-14 18:18       ` Christoph Hellwig
  2015-10-14 16:12     ` Christoph Hellwig
  1 sibling, 2 replies; 20+ messages in thread
From: James Bottomley @ 2015-10-14 15:45 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-kernel

On Wed, 2015-10-14 at 16:39 +0200, Johannes Thumshirn wrote:
> On Wed, 2015-10-14 at 07:30 -0700, James Bottomley wrote:
> > On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
> > > Removing a SCSI target via scsi_remove_target() suspected to be
> > > racy. When a
> > > sibling get's removed from the list it can occassionly happen that
> > > one CPU is
> > > stuck endlessly looping around this code block
> > > 
> > > list_for_each_entry(starget, &shost->__targets, siblings) {
> > >         if (starget->state == STARGET_DEL)
> > >                 continue;
> > 
> > How long is the __targets list?  It seems a bit unlikely that this is
> > the exact cause, because for a short list all in STARGET_DEL that
> > loop
> > should exit very quickly.  Where in the code does scsi_remove_target
> > +0x68/0x240 actually point to?
> > 
> > Is it not a bit more likely that we're following a removed list
> > element?
> > Since that points back to itself, the list_for_each_entry() would
> > then
> > circulate forever.  If that's the case the simple fix would be to use
> > the safe version of the list traversal macro.
> 
> Yes it is traversing a removed element and yes the patches 2/3 and 3/3
> are introducing the safe version of list_for_each_entry(), but they
> also decouple the search for elements to be removed from the actual
> removal. This is what my initial proposal did as well. Christoph wanted
> me to decouple the whole process from the host_lock though and this is
> what this patches do as well now.

OK, so I really need you to separate the problems.  Fixing the bug
you're reporting does not require a complete rework of the locking
infrastructure; it just requires replacing the traversal macro with the
safe version, can you verify that and it can go into fixes?

Then we can discuss the merits of doing a locking rework in this area
separately from the idea that it's some sort of bug fix.

James



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 14:39   ` Johannes Thumshirn
  2015-10-14 15:45     ` James Bottomley
@ 2015-10-14 16:12     ` Christoph Hellwig
  2015-10-14 17:34       ` Johannes Thumshirn
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2015-10-14 16:12 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: James Bottomley, Christoph Hellwig, Hannes Reinecke, linux-scsi,
	linux-kernel

On Wed, Oct 14, 2015 at 04:39:07PM +0200, Johannes Thumshirn wrote:
> removal. This is what my initial proposal did as well. Christoph wanted
> me to decouple the whole process from the host_lock though and this is
> what this patches do as well now.

I think we've talked past each other, I didn't intend to say that and
my replies to you don't seem to imply that either.

I'll draft a patch for what I meant.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 16:12     ` Christoph Hellwig
@ 2015-10-14 17:34       ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 17:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Hannes Reinecke, linux-scsi, linux-kernel

Zitat von Christoph Hellwig <hch@infradead.org>:

> On Wed, Oct 14, 2015 at 04:39:07PM +0200, Johannes Thumshirn wrote:
>> removal. This is what my initial proposal did as well. Christoph wanted
>> me to decouple the whole process from the host_lock though and this is
>> what this patches do as well now.
>
> I think we've talked past each other, I didn't intend to say that and
> my replies to you don't seem to imply that either.

OK, I think I misunderstood you then.

>
> I'll draft a patch for what I meant.
>




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 15:45     ` James Bottomley
@ 2015-10-14 17:36       ` Johannes Thumshirn
  2015-10-14 18:18       ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-14 17:36 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-kernel

Zitat von James Bottomley <James.Bottomley@HansenPartnership.com>:

> On Wed, 2015-10-14 at 16:39 +0200, Johannes Thumshirn wrote:
>> On Wed, 2015-10-14 at 07:30 -0700, James Bottomley wrote:
>> > On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
>> > > Removing a SCSI target via scsi_remove_target() suspected to be
>> > > racy. When a
>> > > sibling get's removed from the list it can occassionly happen that
>> > > one CPU is
>> > > stuck endlessly looping around this code block
>> > >
>> > > list_for_each_entry(starget, &shost->__targets, siblings) {
>> > >         if (starget->state == STARGET_DEL)
>> > >                 continue;
>> >
>> > How long is the __targets list?  It seems a bit unlikely that this is
>> > the exact cause, because for a short list all in STARGET_DEL that
>> > loop
>> > should exit very quickly.  Where in the code does scsi_remove_target
>> > +0x68/0x240 actually point to?
>> >
>> > Is it not a bit more likely that we're following a removed list
>> > element?
>> > Since that points back to itself, the list_for_each_entry() would
>> > then
>> > circulate forever.  If that's the case the simple fix would be to use
>> > the safe version of the list traversal macro.
>>
>> Yes it is traversing a removed element and yes the patches 2/3 and 3/3
>> are introducing the safe version of list_for_each_entry(), but they
>> also decouple the search for elements to be removed from the actual
>> removal. This is what my initial proposal did as well. Christoph wanted
>> me to decouple the whole process from the host_lock though and this is
>> what this patches do as well now.
>
> OK, so I really need you to separate the problems.  Fixing the bug
> you're reporting does not require a complete rework of the locking
> infrastructure; it just requires replacing the traversal macro with the
> safe version, can you verify that and it can go into fixes?
>

Yes. I can sent a patch for it tomorrow.



> Then we can discuss the merits of doing a locking rework in this area
> separately from the idea that it's some sort of bug fix.
>
> James
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 15:45     ` James Bottomley
  2015-10-14 17:36       ` Johannes Thumshirn
@ 2015-10-14 18:18       ` Christoph Hellwig
  2015-10-16 11:24         ` Johannes Thumshirn
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:18 UTC (permalink / raw)
  To: James Bottomley
  Cc: Johannes Thumshirn, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-kernel

On Wed, Oct 14, 2015 at 08:45:56AM -0700, James Bottomley wrote:
> OK, so I really need you to separate the problems.  Fixing the bug
> you're reporting does not require a complete rework of the locking
> infrastructure; it just requires replacing the traversal macro with the
> safe version, can you verify that and it can go into fixes?

_safe only protects against deletions from yourself, it does not protect
against other threads once a lock is dropped.  After auditing the
target reap code I fear the list_move trick isn't safe either, as
scsi_target_alloc relies on a being able to find a target that is
currently in the process of being deleted.  So the only safe variant
we have is to keep the same sequence we currently have and restart the
loop once we've deleted the target.  Given that we'd normally only
ever delete a single target anyway (not sure when we'd even get a second
one ever) this does not seem to be a major efficieny problem.

Johannes, can you test the patch below?

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index b333389f..d3b34d8 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1158,31 +1158,23 @@ static void __scsi_remove_target(struct scsi_target *starget)
 void scsi_remove_target(struct device *dev)
 {
 	struct Scsi_Host *shost = dev_to_shost(dev->parent);
-	struct scsi_target *starget, *last = NULL;
+	struct scsi_target *starget;
 	unsigned long flags;
 
-	/* remove targets being careful to lookup next entry before
-	 * deleting the last
-	 */
+restart:
 	spin_lock_irqsave(shost->host_lock, flags);
 	list_for_each_entry(starget, &shost->__targets, siblings) {
 		if (starget->state == STARGET_DEL)
 			continue;
 		if (starget->dev.parent == dev || &starget->dev == dev) {
-			/* assuming new targets arrive at the end */
 			kref_get(&starget->reap_ref);
 			spin_unlock_irqrestore(shost->host_lock, flags);
-			if (last)
-				scsi_target_reap(last);
-			last = starget;
 			__scsi_remove_target(starget);
-			spin_lock_irqsave(shost->host_lock, flags);
+			scsi_target_reap(starget);
+			goto restart;
 		}
 	}
 	spin_unlock_irqrestore(shost->host_lock, flags);
-
-	if (last)
-		scsi_target_reap(last);
 }
 EXPORT_SYMBOL(scsi_remove_target);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 14:30 ` [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() James Bottomley
  2015-10-14 14:39   ` Johannes Thumshirn
@ 2015-10-14 20:22   ` Ewan Milne
  2015-10-15  7:07     ` Johannes Thumshirn
  1 sibling, 1 reply; 20+ messages in thread
From: Ewan Milne @ 2015-10-14 20:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Johannes Thumshirn, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-kernel

On Wed, 2015-10-14 at 07:30 -0700, James Bottomley wrote:
> On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
> > Removing a SCSI target via scsi_remove_target() suspected to be racy. When a
> > sibling get's removed from the list it can occassionly happen that one CPU is
> > stuck endlessly looping around this code block
> > 
> > list_for_each_entry(starget, &shost->__targets, siblings) {
> >         if (starget->state == STARGET_DEL)
> >                 continue;
> 
> How long is the __targets list?  It seems a bit unlikely that this is
> the exact cause, because for a short list all in STARGET_DEL that loop
> should exit very quickly.  Where in the code does scsi_remove_target
> +0x68/0x240 actually point to?
> 
> Is it not a bit more likely that we're following a removed list element?
> Since that points back to itself, the list_for_each_entry() would then
> circulate forever.  If that's the case the simple fix would be to use
> the safe version of the list traversal macro.
> 
> James

For what it's worth, I've seen a dump where this was exactly the case.
starget was in STARGET_DEL state, starget->siblings pointed to itself,
kref was 0, reap_ref was 0 (this was a while back).

The problem was not able to be reproduced at the time.

-Ewan



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 20:22   ` Ewan Milne
@ 2015-10-15  7:07     ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-15  7:07 UTC (permalink / raw)
  To: emilne, James Bottomley
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-kernel

On Wed, 2015-10-14 at 16:22 -0400, Ewan Milne wrote:
> On Wed, 2015-10-14 at 07:30 -0700, James Bottomley wrote:
> > On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
> > > Removing a SCSI target via scsi_remove_target() suspected to be
> > > racy. When a
> > > sibling get's removed from the list it can occassionly happen
> > > that one CPU is
> > > stuck endlessly looping around this code block
> > > 
> > > list_for_each_entry(starget, &shost->__targets, siblings) {
> > >         if (starget->state == STARGET_DEL)
> > >                 continue;
> > 
> > How long is the __targets list?  It seems a bit unlikely that this
> > is
> > the exact cause, because for a short list all in STARGET_DEL that
> > loop
> > should exit very quickly.  Where in the code does
> > scsi_remove_target
> > +0x68/0x240 actually point to?
> > 
> > Is it not a bit more likely that we're following a removed list
> > element?
> > Since that points back to itself, the list_for_each_entry() would
> > then
> > circulate forever.  If that's the case the simple fix would be to
> > use
> > the safe version of the list traversal macro.
> > 
> > James
> 
> For what it's worth, I've seen a dump where this was exactly the
> case.
> starget was in STARGET_DEL state, starget->siblings pointed to
> itself,
> kref was 0, reap_ref was 0 (this was a while back).
> 

That's exactly what I have here as well.

I'll give Christoph's patch a shot today and report back.

> The problem was not able to be reproduced at the time.
> 
> -Ewan
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()
  2015-10-14 18:18       ` Christoph Hellwig
@ 2015-10-16 11:24         ` Johannes Thumshirn
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Thumshirn @ 2015-10-16 11:24 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Hannes Reinecke, linux-scsi, linux-kernel

On Wed, 2015-10-14 at 11:18 -0700, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 08:45:56AM -0700, James Bottomley wrote:
> > OK, so I really need you to separate the problems.  Fixing the bug

[..]

> 
> Johannes, can you test the patch below?

I've tested your patch and it doesn't show the lockup anymore, so far
so good. But it seems as if I have a problem in my test setup, because
I can't reproduce the bug on vanilla 4.3-rc5 either. I will ask the
original reporter if it is possible to test your patch on their side.

Appart from that it looks good to me (and much simpler than my changes)
.

> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index b333389f..d3b34d8 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1158,31 +1158,23 @@ static void __scsi_remove_target(struct
> scsi_target *starget)
>  void scsi_remove_target(struct device *dev)
>  {
>  	struct Scsi_Host *shost = dev_to_shost(dev->parent);
> -	struct scsi_target *starget, *last = NULL;
> +	struct scsi_target *starget;
>  	unsigned long flags;
>  
> -	/* remove targets being careful to lookup next entry before
> -	 * deleting the last
> -	 */
> +restart:
>  	spin_lock_irqsave(shost->host_lock, flags);
>  	list_for_each_entry(starget, &shost->__targets, siblings) {
>  		if (starget->state == STARGET_DEL)
>  			continue;
>  		if (starget->dev.parent == dev || &starget->dev ==
> dev) {
> -			/* assuming new targets arrive at the end */
>  			kref_get(&starget->reap_ref);
>  			spin_unlock_irqrestore(shost->host_lock,
> flags);
> -			if (last)
> -				scsi_target_reap(last);
> -			last = starget;
>  			__scsi_remove_target(starget);
> -			spin_lock_irqsave(shost->host_lock, flags);
> +			scsi_target_reap(starget);
> +			goto restart;
>  		}
>  	}
>  	spin_unlock_irqrestore(shost->host_lock, flags);
> -
> -	if (last)
> -		scsi_target_reap(last);
>  }
>  EXPORT_SYMBOL(scsi_remove_target);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-10-16 11:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-14 13:50 [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() Johannes Thumshirn
2015-10-14 13:50 ` [PATCH 1/3] SCSI: Introduce device_lock and target_lock in Scsi_Host Johannes Thumshirn
2015-10-14 14:14   ` Hannes Reinecke
2015-10-14 14:17   ` Hannes Reinecke
2015-10-14 14:35   ` kbuild test robot
2015-10-14 14:35     ` kbuild test robot
2015-10-14 13:50 ` [PATCH 2/3] SCSI: Rework list handling in scsi_target_remove Johannes Thumshirn
2015-10-14 14:18   ` Hannes Reinecke
2015-10-14 13:50 ` [PATCH 3/3] SCSI: Rework list handling in __scsi_target_remove Johannes Thumshirn
2015-10-14 14:19   ` Hannes Reinecke
2015-10-14 14:30 ` [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() James Bottomley
2015-10-14 14:39   ` Johannes Thumshirn
2015-10-14 15:45     ` James Bottomley
2015-10-14 17:36       ` Johannes Thumshirn
2015-10-14 18:18       ` Christoph Hellwig
2015-10-16 11:24         ` Johannes Thumshirn
2015-10-14 16:12     ` Christoph Hellwig
2015-10-14 17:34       ` Johannes Thumshirn
2015-10-14 20:22   ` Ewan Milne
2015-10-15  7:07     ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.