All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-01 17:05 Andi Kleen
  2011-07-01 18:14 ` Dave Jones
  2011-07-01 19:20 ` James Bottomley
  0 siblings, 2 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-01 17:05 UTC (permalink / raw)
  To: linux-scsi, linux-kernel; +Cc: axboe, rjw

Hi,

I found I can reliably crash a 3.0 system by pulling the
USB cable of a mounted USB cdrom (or rather a USB device which
has a builtin fake CD-ROM) 

I suspect it's a regression too.

It ends with a NULL pointer reference on a NULL sdev in 
scsi_prep_state_check. 

Here's a somewhat incomplete backtrace (written down by hand)

scsi_prep_state_check
scsi_setup_blk_pc_cmnd
blk_peek_request
...
scsi_request_fn
...
ioctl_internal_command
...
scsi_set_medium_removal
sr_lock_door
cdrom_release
...
umount

I tried adding a 

	if (!sdev) 
		return BLKPREP_KILL;

to scsi_prep_state_check, but that caused a RCU CPU stall 
and a generally unhappy system instead.

The sdev must be still there in scsi_set_medium_removal because it's 
referenced, so it must get lost somewhere in SCSI or in the block layer.

Any ideas how to fix this?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 17:05 Linux 3.0 oopses when pulling a USB CDROM Andi Kleen
@ 2011-07-01 18:14 ` Dave Jones
  2011-07-01 18:32   ` Andi Kleen
  2011-07-01 20:29   ` James Bottomley
  2011-07-01 19:20 ` James Bottomley
  1 sibling, 2 replies; 63+ messages in thread
From: Dave Jones @ 2011-07-01 18:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw

On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:

 > I found I can reliably crash a 3.0 system by pulling the
 > USB cable of a mounted USB cdrom (or rather a USB device which
 > has a builtin fake CD-ROM) 
 > 
 > I suspect it's a regression too.

We've been seeing a lot of similar bugs in Fedora since we pushed
a 2.6.38.8 update.  Some of the traces are different, but some
look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)

The common cause seems to be 'device went away'. So USB CD drives,
USB memory sticks, and for some reason virtualbox shutdown.

	Dave


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 18:14 ` Dave Jones
@ 2011-07-01 18:32   ` Andi Kleen
  2011-07-01 18:40     ` Dave Jones
  2011-07-02 15:13     ` Christoph Fritz
  2011-07-01 20:29   ` James Bottomley
  1 sibling, 2 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-01 18:32 UTC (permalink / raw)
  To: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw

On Fri, Jul 01, 2011 at 02:14:06PM -0400, Dave Jones wrote:
> On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> 
>  > I found I can reliably crash a 3.0 system by pulling the
>  > USB cable of a mounted USB cdrom (or rather a USB device which
>  > has a builtin fake CD-ROM) 
>  > 
>  > I suspect it's a regression too.
> 
> We've been seeing a lot of similar bugs in Fedora since we pushed
> a 2.6.38.8 update.  Some of the traces are different, but some

I've also seen one with an earlier kernel where the NULL 
reference was in elv_.* something.  Don't have a written down backtrace
for this, but perhaps you have.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 18:32   ` Andi Kleen
@ 2011-07-01 18:40     ` Dave Jones
  2011-07-02 15:13     ` Christoph Fritz
  1 sibling, 0 replies; 63+ messages in thread
From: Dave Jones @ 2011-07-01 18:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw

On Fri, Jul 01, 2011 at 08:32:27PM +0200, Andi Kleen wrote:
 > On Fri, Jul 01, 2011 at 02:14:06PM -0400, Dave Jones wrote:
 > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
 > > 
 > >  > I found I can reliably crash a 3.0 system by pulling the
 > >  > USB cable of a mounted USB cdrom (or rather a USB device which
 > >  > has a builtin fake CD-ROM) 
 > >  > 
 > >  > I suspect it's a regression too.
 > > 
 > > We've been seeing a lot of similar bugs in Fedora since we pushed
 > > a 2.6.38.8 update.  Some of the traces are different, but some
 > 
 > I've also seen one with an earlier kernel where the NULL 
 > reference was in elv_.* something.  Don't have a written down backtrace
 > for this, but perhaps you have.

We had a bazillion reports of elv_may_queue oopses that looked like
https://bugzilla.redhat.com/show_bug.cgi?id=elv_may_queue

That should be fixed in .39 and newer.

	Dave


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 17:05 Linux 3.0 oopses when pulling a USB CDROM Andi Kleen
  2011-07-01 18:14 ` Dave Jones
@ 2011-07-01 19:20 ` James Bottomley
  2011-07-01 19:33   ` James Bottomley
  1 sibling, 1 reply; 63+ messages in thread
From: James Bottomley @ 2011-07-01 19:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw

On Fri, 2011-07-01 at 10:05 -0700, Andi Kleen wrote:
> Hi,
> 
> I found I can reliably crash a 3.0 system by pulling the
> USB cable of a mounted USB cdrom (or rather a USB device which
> has a builtin fake CD-ROM) 
> 
> I suspect it's a regression too.
> 
> It ends with a NULL pointer reference on a NULL sdev in 
> scsi_prep_state_check. 
> 
> Here's a somewhat incomplete backtrace (written down by hand)
> 
> scsi_prep_state_check
> scsi_setup_blk_pc_cmnd
> blk_peek_request
> ...
> scsi_request_fn
> ...
> ioctl_internal_command
> ...
> scsi_set_medium_removal
> sr_lock_door
> cdrom_release
> ...
> umount
> 
> I tried adding a 
> 
> 	if (!sdev) 
> 		return BLKPREP_KILL;
> 
> to scsi_prep_state_check, but that caused a RCU CPU stall 
> and a generally unhappy system instead.

Right, that wouldn't work.  The sdev in question comes from
request_queue->queuedata.  That only goes to null when the last
reference to the sdev has been released.  So the root cause is something
in sd holding a reference to sdev without actually getting an additional
refcount.

> The sdev must be still there in scsi_set_medium_removal because it's 
> referenced, so it must get lost somewhere in SCSI or in the block layer.
> 
> Any ideas how to fix this?

I'll see if I can find the refcounting problem.

Likely it's a longstanding bug which we didn't actually notice until
now.

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 19:20 ` James Bottomley
@ 2011-07-01 19:33   ` James Bottomley
  2011-07-01 19:45     ` James Bottomley
  0 siblings, 1 reply; 63+ messages in thread
From: James Bottomley @ 2011-07-01 19:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw

On Fri, 2011-07-01 at 14:20 -0500, James Bottomley wrote:
> I'll see if I can find the refcounting problem.
> 
> Likely it's a longstanding bug which we didn't actually notice until
> now.

OK, so it looks like we have correct refcounting on the sr_block ops,
but not on the actual cdrom ops, so the device is already gone by the
time the cdrom release function gets called.

This is my best guess at the fix (just add a get into the cdrom ops),
does it work?

James

---

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 4778e27..ee75983 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -575,7 +575,10 @@ static int sr_open(struct cdrom_device_info *cdi, int purpose)
 {
 	struct scsi_cd *cd = cdi->handle;
 	struct scsi_device *sdev = cd->device;
-	int retval;
+	int retval = scsi_device_get(sdev);
+
+	if (!retval)
+		return retval;
 
 	/*
 	 * If the device is in error recovery, wait until it is done.
@@ -588,6 +591,7 @@ static int sr_open(struct cdrom_device_info *cdi, int purpose)
 	return 0;
 
 error_out:
+	scsi_device_put(sdev);
 	return retval;	
 }
 
@@ -598,6 +602,7 @@ static void sr_release(struct cdrom_device_info *cdi)
 	if (cd->device->sector_size > 2048)
 		sr_set_blocklength(cd, 2048);
 
+	scsi_device_put(cd->device);
 }
 
 static int sr_probe(struct device *dev)



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 19:33   ` James Bottomley
@ 2011-07-01 19:45     ` James Bottomley
  0 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2011-07-01 19:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw

On Fri, 2011-07-01 at 14:33 -0500, James Bottomley wrote:
> On Fri, 2011-07-01 at 14:20 -0500, James Bottomley wrote:
> > I'll see if I can find the refcounting problem.
> > 
> > Likely it's a longstanding bug which we didn't actually notice until
> > now.
> 
> OK, so it looks like we have correct refcounting on the sr_block ops,
> but not on the actual cdrom ops, so the device is already gone by the
> time the cdrom release function gets called.
> 
> This is my best guess at the fix (just add a get into the cdrom ops),
> does it work?

Actually, forget that.  The cdrom ops open/release is entirely
subordinate to the block one.

unless we have some type of double put, the sdev should be held by
refcount right up until the cdrom code calls cdo->release().  However,
there is one ioctl operation after that, but it's tray_move() not
lock_door().  This should be the fix for that, but if we're oopsing in
lock_door(), it's almost certain we've run into some refcounting
imbalance.

James

---

diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 75fb965..512bd31 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -1220,12 +1220,12 @@ void cdrom_release(struct cdrom_device_info *cdi, fmode_t mode)
 	if (CDROM_CAN(CDC_RAM) && !cdi->use_count && cdi->for_data)
 		cdrom_close_write(cdi);
 
-	cdo->release(cdi);
 	if (cdi->use_count == 0) {      /* last process that closes dev*/
 		if (opened_for_data &&
 		    cdi->options & CDO_AUTO_EJECT && CDROM_CAN(CDC_OPEN_TRAY))
 			cdo->tray_move(cdi, 1);
 	}
+	cdo->release(cdi);
 }
 
 static int cdrom_read_mech_status(struct cdrom_device_info *cdi, 



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 18:14 ` Dave Jones
  2011-07-01 18:32   ` Andi Kleen
@ 2011-07-01 20:29   ` James Bottomley
  2011-07-01 20:43       ` Alan Stern
  2011-07-01 21:04       ` Alan Stern
  1 sibling, 2 replies; 63+ messages in thread
From: James Bottomley @ 2011-07-01 20:29 UTC (permalink / raw)
  To: Dave Jones; +Cc: Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> 
>  > I found I can reliably crash a 3.0 system by pulling the
>  > USB cable of a mounted USB cdrom (or rather a USB device which
>  > has a builtin fake CD-ROM) 
>  > 
>  > I suspect it's a regression too.
> 
> We've been seeing a lot of similar bugs in Fedora since we pushed
> a 2.6.38.8 update.  Some of the traces are different, but some
> look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> 
> The common cause seems to be 'device went away'. So USB CD drives,
> USB memory sticks, and for some reason virtualbox shutdown.

I think it's something specific in the USB path.  I can't reproduce on
3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] USB: fix regression occurring during device removal
  2011-07-01 20:29   ` James Bottomley
@ 2011-07-01 20:43       ` Alan Stern
  2011-07-01 21:04       ` Alan Stern
  1 sibling, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-01 20:43 UTC (permalink / raw)
  To: Greg KH
  Cc: Sarah Sharp, James Bottomley, Dave Jones, Andi Kleen,
	SCSI development list, Kernel development list, axboe,
	Rafael J. Wysocki, USB list

This patch (as1476) fixes a regression introduced by
fccf4e86200b8f5edd9a65da26f150e32ba79808 (USB: Free bandwidth when
usb_disable_device is called).  usb_disconnect() grabs the
bandwidth_mutex before calling usb_disable_device(), which calls down
indirectly to usb_set_interface(), which tries to acquire the
bandwidth_mutex.

The fix causes usb_set_interface() to return early when it is called
for an interface that has already been unregistered, which is what
happens in usb_disable_device().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
CC: <stable@kernel.org>

---

On Fri, 1 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > 
> >  > I found I can reliably crash a 3.0 system by pulling the
> >  > USB cable of a mounted USB cdrom (or rather a USB device which
> >  > has a builtin fake CD-ROM) 
> >  > 
> >  > I suspect it's a regression too.
> > 
> > We've been seeing a lot of similar bugs in Fedora since we pushed
> > a 2.6.38.8 update.  Some of the traces are different, but some
> > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > 
> > The common cause seems to be 'device went away'. So USB CD drives,
> > USB memory sticks, and for some reason virtualbox shutdown.
> 
> I think it's something specific in the USB path.  I can't reproduce on
> 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.
> 
> James

This should fix the problem.  Or at least, it does fix a very similar 
problem.

Alan Stern



 drivers/usb/core/message.c |    2 ++
 1 file changed, 2 insertions(+)

Index: usb-3.0/drivers/usb/core/message.c
===================================================================
--- usb-3.0.orig/drivers/usb/core/message.c
+++ usb-3.0/drivers/usb/core/message.c
@@ -1286,6 +1286,8 @@ int usb_set_interface(struct usb_device 
 			interface);
 		return -EINVAL;
 	}
+	if (iface->unregistering)
+		return -ENODEV;
 
 	alt = usb_altnum_to_altsetting(iface, alternate);
 	if (!alt) {


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] USB: fix regression occurring during device removal
@ 2011-07-01 20:43       ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-01 20:43 UTC (permalink / raw)
  To: Greg KH
  Cc: Sarah Sharp, James Bottomley, Dave Jones, Andi Kleen,
	SCSI development list, Kernel development list, axboe,
	Rafael J. Wysocki, USB list

This patch (as1476) fixes a regression introduced by
fccf4e86200b8f5edd9a65da26f150e32ba79808 (USB: Free bandwidth when
usb_disable_device is called).  usb_disconnect() grabs the
bandwidth_mutex before calling usb_disable_device(), which calls down
indirectly to usb_set_interface(), which tries to acquire the
bandwidth_mutex.

The fix causes usb_set_interface() to return early when it is called
for an interface that has already been unregistered, which is what
happens in usb_disable_device().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
CC: <stable@kernel.org>

---

On Fri, 1 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > 
> >  > I found I can reliably crash a 3.0 system by pulling the
> >  > USB cable of a mounted USB cdrom (or rather a USB device which
> >  > has a builtin fake CD-ROM) 
> >  > 
> >  > I suspect it's a regression too.
> > 
> > We've been seeing a lot of similar bugs in Fedora since we pushed
> > a 2.6.38.8 update.  Some of the traces are different, but some
> > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > 
> > The common cause seems to be 'device went away'. So USB CD drives,
> > USB memory sticks, and for some reason virtualbox shutdown.
> 
> I think it's something specific in the USB path.  I can't reproduce on
> 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.
> 
> James

This should fix the problem.  Or at least, it does fix a very similar 
problem.

Alan Stern



 drivers/usb/core/message.c |    2 ++
 1 file changed, 2 insertions(+)

Index: usb-3.0/drivers/usb/core/message.c
===================================================================
--- usb-3.0.orig/drivers/usb/core/message.c
+++ usb-3.0/drivers/usb/core/message.c
@@ -1286,6 +1286,8 @@ int usb_set_interface(struct usb_device 
 			interface);
 		return -EINVAL;
 	}
+	if (iface->unregistering)
+		return -ENODEV;
 
 	alt = usb_altnum_to_altsetting(iface, alternate);
 	if (!alt) {

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] USB: fix regression occurring during device removal
  2011-07-01 20:43       ` Alan Stern
  (?)
@ 2011-07-01 21:04       ` Andi Kleen
  -1 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-01 21:04 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, Sarah Sharp, James Bottomley, Dave Jones, Andi Kleen,
	SCSI development list, Kernel development list, axboe,
	Rafael J. Wysocki, USB list

> This should fix the problem.  Or at least, it does fix a very similar 
> problem.

Thanks. I tried it and it doesn't fix the problem unfortunately.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 20:29   ` James Bottomley
@ 2011-07-01 21:04       ` Alan Stern
  2011-07-01 21:04       ` Alan Stern
  1 sibling, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-01 21:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Fri, 1 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > 
> >  > I found I can reliably crash a 3.0 system by pulling the
> >  > USB cable of a mounted USB cdrom (or rather a USB device which
> >  > has a builtin fake CD-ROM) 
> >  > 
> >  > I suspect it's a regression too.
> > 
> > We've been seeing a lot of similar bugs in Fedora since we pushed
> > a 2.6.38.8 update.  Some of the traces are different, but some
> > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > 
> > The common cause seems to be 'device went away'. So USB CD drives,
> > USB memory sticks, and for some reason virtualbox shutdown.
> 
> I think it's something specific in the USB path.  I can't reproduce on
> 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.

I just took a look at the Red Hat bugzilla entry mentioned above.  It 
seems to be quite different from the issue addressed by the patch I 
just posted -- a crash with invalid memory access rather than a lockdep 
violation and hang of the khubd thread.

It's also notable that the stack dump in the bugzilla report doesn't 
contain any functions in the USB subsystem.  Of course this doesn't 
prove anything, but it is suggestive.

Evidently the sdev argument to scsi_prep_state_check() was NULL.  This
looks like the problem that cropped up before, where q->queuedata was
getting set to NULL while the queue was still in use.  I can't imagine
how anything in usb-storage could have caused that.

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-01 21:04       ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-01 21:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Fri, 1 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > 
> >  > I found I can reliably crash a 3.0 system by pulling the
> >  > USB cable of a mounted USB cdrom (or rather a USB device which
> >  > has a builtin fake CD-ROM) 
> >  > 
> >  > I suspect it's a regression too.
> > 
> > We've been seeing a lot of similar bugs in Fedora since we pushed
> > a 2.6.38.8 update.  Some of the traces are different, but some
> > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > 
> > The common cause seems to be 'device went away'. So USB CD drives,
> > USB memory sticks, and for some reason virtualbox shutdown.
> 
> I think it's something specific in the USB path.  I can't reproduce on
> 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.

I just took a look at the Red Hat bugzilla entry mentioned above.  It 
seems to be quite different from the issue addressed by the patch I 
just posted -- a crash with invalid memory access rather than a lockdep 
violation and hang of the khubd thread.

It's also notable that the stack dump in the bugzilla report doesn't 
contain any functions in the USB subsystem.  Of course this doesn't 
prove anything, but it is suggestive.

Evidently the sdev argument to scsi_prep_state_check() was NULL.  This
looks like the problem that cropped up before, where q->queuedata was
getting set to NULL while the queue was still in use.  I can't imagine
how anything in usb-storage could have caused that.

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 21:04       ` Alan Stern
  (?)
@ 2011-07-01 21:13       ` James Bottomley
  2011-07-02  2:03           ` Alan Stern
  -1 siblings, 1 reply; 63+ messages in thread
From: James Bottomley @ 2011-07-01 21:13 UTC (permalink / raw)
  To: Alan Stern
  Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Fri, 2011-07-01 at 17:04 -0400, Alan Stern wrote:
> On Fri, 1 Jul 2011, James Bottomley wrote:
> 
> > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > > 
> > >  > I found I can reliably crash a 3.0 system by pulling the
> > >  > USB cable of a mounted USB cdrom (or rather a USB device which
> > >  > has a builtin fake CD-ROM) 
> > >  > 
> > >  > I suspect it's a regression too.
> > > 
> > > We've been seeing a lot of similar bugs in Fedora since we pushed
> > > a 2.6.38.8 update.  Some of the traces are different, but some
> > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > > 
> > > The common cause seems to be 'device went away'. So USB CD drives,
> > > USB memory sticks, and for some reason virtualbox shutdown.
> > 
> > I think it's something specific in the USB path.  I can't reproduce on
> > 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.
> 
> I just took a look at the Red Hat bugzilla entry mentioned above.  It 
> seems to be quite different from the issue addressed by the patch I 
> just posted -- a crash with invalid memory access rather than a lockdep 
> violation and hang of the khubd thread.
> 
> It's also notable that the stack dump in the bugzilla report doesn't 
> contain any functions in the USB subsystem.  Of course this doesn't 
> prove anything, but it is suggestive.
> 
> Evidently the sdev argument to scsi_prep_state_check() was NULL.  This
> looks like the problem that cropped up before, where q->queuedata was
> getting set to NULL while the queue was still in use.  I can't imagine
> how anything in usb-storage could have caused that.

Right ... the device release function has already been called, so it's
some type of refcounting cockup.  The fact that I can't reproduce with a
SATA unplug is what makes me think it might be USB specific ... of
course, that's not definitive.

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 21:13       ` James Bottomley
@ 2011-07-02  2:03           ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02  2:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Fri, 1 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-01 at 17:04 -0400, Alan Stern wrote:
> > On Fri, 1 Jul 2011, James Bottomley wrote:
> > 
> > > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > > > 
> > > >  > I found I can reliably crash a 3.0 system by pulling the
> > > >  > USB cable of a mounted USB cdrom (or rather a USB device which
> > > >  > has a builtin fake CD-ROM) 
> > > >  > 
> > > >  > I suspect it's a regression too.
> > > > 
> > > > We've been seeing a lot of similar bugs in Fedora since we pushed
> > > > a 2.6.38.8 update.  Some of the traces are different, but some
> > > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > > > 
> > > > The common cause seems to be 'device went away'. So USB CD drives,
> > > > USB memory sticks, and for some reason virtualbox shutdown.
> > > 
> > > I think it's something specific in the USB path.  I can't reproduce on
> > > 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.
> > 
> > I just took a look at the Red Hat bugzilla entry mentioned above.  It 
> > seems to be quite different from the issue addressed by the patch I 
> > just posted -- a crash with invalid memory access rather than a lockdep 
> > violation and hang of the khubd thread.
> > 
> > It's also notable that the stack dump in the bugzilla report doesn't 
> > contain any functions in the USB subsystem.  Of course this doesn't 
> > prove anything, but it is suggestive.
> > 
> > Evidently the sdev argument to scsi_prep_state_check() was NULL.  This
> > looks like the problem that cropped up before, where q->queuedata was
> > getting set to NULL while the queue was still in use.  I can't imagine
> > how anything in usb-storage could have caused that.
> 
> Right ... the device release function has already been called, so it's
> some type of refcounting cockup.  The fact that I can't reproduce with a
> SATA unplug is what makes me think it might be USB specific ... of
> course, that's not definitive.

I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
give the exact sequence of steps you went through to trigger the bug?

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02  2:03           ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02  2:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dave Jones, Andi Kleen, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0,
	linux-usb-u79uwXL29TY76Z2rM5mHXA

On Fri, 1 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-01 at 17:04 -0400, Alan Stern wrote:
> > On Fri, 1 Jul 2011, James Bottomley wrote:
> > 
> > > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote:
> > > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > > > 
> > > >  > I found I can reliably crash a 3.0 system by pulling the
> > > >  > USB cable of a mounted USB cdrom (or rather a USB device which
> > > >  > has a builtin fake CD-ROM) 
> > > >  > 
> > > >  > I suspect it's a regression too.
> > > > 
> > > > We've been seeing a lot of similar bugs in Fedora since we pushed
> > > > a 2.6.38.8 update.  Some of the traces are different, but some
> > > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830)
> > > > 
> > > > The common cause seems to be 'device went away'. So USB CD drives,
> > > > USB memory sticks, and for some reason virtualbox shutdown.
> > > 
> > > I think it's something specific in the USB path.  I can't reproduce on
> > > 3.0-rc5 with a SATA DVD hot unplug.  USB cc's added.
> > 
> > I just took a look at the Red Hat bugzilla entry mentioned above.  It 
> > seems to be quite different from the issue addressed by the patch I 
> > just posted -- a crash with invalid memory access rather than a lockdep 
> > violation and hang of the khubd thread.
> > 
> > It's also notable that the stack dump in the bugzilla report doesn't 
> > contain any functions in the USB subsystem.  Of course this doesn't 
> > prove anything, but it is suggestive.
> > 
> > Evidently the sdev argument to scsi_prep_state_check() was NULL.  This
> > looks like the problem that cropped up before, where q->queuedata was
> > getting set to NULL while the queue was still in use.  I can't imagine
> > how anything in usb-storage could have caused that.
> 
> Right ... the device release function has already been called, so it's
> some type of refcounting cockup.  The fact that I can't reproduce with a
> SATA unplug is what makes me think it might be USB specific ... of
> course, that's not definitive.

I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
give the exact sequence of steps you went through to trigger the bug?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02  2:03           ` Alan Stern
  (?)
@ 2011-07-02  6:08           ` Andi Kleen
  2011-07-02 12:24             ` James Bottomley
                               ` (2 more replies)
  -1 siblings, 3 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-02  6:08 UTC (permalink / raw)
  To: Alan Stern
  Cc: James Bottomley, Dave Jones, Andi Kleen, linux-scsi,
	linux-kernel, axboe, rjw, linux-usb

> I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> give the exact sequence of steps you went through to trigger the bug?

Connect USB storage device with builtin fake CD rom. Wait for udisk
to mount it. Pull cable. udisk does umount. Oops.

I also got a log of the refcounting now if you want it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02  6:08           ` Andi Kleen
@ 2011-07-02 12:24             ` James Bottomley
  2011-07-02 17:05               ` Andi Kleen
  2011-07-02 17:37                 ` Alan Stern
  2011-07-02 12:38               ` Alan Stern
  2011-07-02 12:48               ` Rafael J. Wysocki
  2 siblings, 2 replies; 63+ messages in thread
From: James Bottomley @ 2011-07-02 12:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Stern, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Sat, 2011-07-02 at 08:08 +0200, Andi Kleen wrote:
> > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > give the exact sequence of steps you went through to trigger the bug?
> 
> Connect USB storage device with builtin fake CD rom. Wait for udisk
> to mount it. Pull cable. udisk does umount. Oops.
> 
> I also got a log of the refcounting now if you want it.

So I've got the log, but this is the relevant section:

---
usb 2-1.5: USB disconnect, device number 4
sr 5:0:0:1: scsi put_device 13 from device_del+0x177/0x1c0
sr 5:0:0:1: scsi put_device 12 from bsg_kref_release_function+0x28/0x30
sr 5:0:0:1: scsi put_device 10 from device_del+0x177/0x1c0
sr 5:0:0:1: scsi put_device 8 from device_del+0x177/0x1c0
sr 5:0:0:1: scsi put_device 7 from scsi_device_cls_release+0x15/0x20
sr 5:0:0:1: scsi put_device 6 from klist_children_put+0x12/0x20
sr 5:0:0:1: scsi put_device 5 from klist_devices_put+0x12/0x20
sr 5:0:0:1: scsi put_device 3 from device_del+0x177/0x1c0
scsi: killing requests for dead queue
BUG: sleeping function called from invalid context
at /home/ak/lsrc/git/linux-2.6/arch/x86/mm/fault.c:1103
in_atomic(): 0, irqs_disabled(): 1, pid: 2527, name: umount
Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8
Call Trace:
 [<ffffffff8103af8c>] __might_sleep+0xcc/0xf0
 [<ffffffff8155af42>] do_page_fault+0x142/0x4c0
 [<ffffffffa01d5385>] ? write_msg+0x105/0x120 [netconsole]
 [<ffffffff810514f7>] ? __call_console_drivers+0x97/0xb0
 [<ffffffff81079692>] ? up+0x32/0x50
 [<ffffffff81557f5f>] page_fault+0x1f/0x30
 [<ffffffff81389a70>] ? scsi_setup_blk_pc_cmnd+0x170/0x170
 [<ffffffff81388e19>] ? scsi_prep_state_check+0x9/0x90
 [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170
 [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60
 [<ffffffff812847ad>] blk_peek_request+0xbd/0x230
 [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470
 [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20
 [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0
 [<ffffffff81288676>] blk_execute_rq+0x86/0xf0
 [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0
 [<ffffffff81389c6c>] scsi_execute+0xfc/0x160
 [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140
 [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0
 [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30
 [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0
 [<ffffffff8139b390>] sr_lock_door+0x20/0x30
 [<ffffffff813c4d63>] cdrom_release+0xa3/0x260
 [<ffffffff8118157e>] ? invalidate_bh_lru+0x2e/0x50
 [<ffffffff81181550>] ? buffer_cpu_notify+0xa0/0xa0
 [<ffffffff8139a088>] sr_block_release+0x38/0x60
 [<ffffffff8118833c>] __blkdev_put+0x16c/0x1b0
 [<ffffffff811883b2>] blkdev_put+0x32/0x130
 [<ffffffff8115650e>] kill_block_super+0x4e/0x80
 [<ffffffff81156865>] deactivate_locked_super+0x45/0x70
 [<ffffffff8115724a>] deactivate_super+0x4a/0x70
 [<ffffffff81171d54>] mntput_no_expire+0xc4/0x110
 [<ffffffff81172a2c>] sys_umount+0x6c/0x360
 [<ffffffff8155f52b>] system_call_fastpath+0x16/0x1b
BUG: unable to handle kernel NULL pointer dereference at
0000000000000650
IP: [<ffffffff81388e19>] scsi_prep_state_check+0x9/0x90
PGD 0 
Oops: 0000 [#1] SMP 
CPU 2 
Modules linked in: nls_utf8 udf ses enclosure netconsole configfs fuse
sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 kvm_intel kvm
uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_seq snd_seq_device
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore
iTCO_wdt snd_page_alloc iTCO_vendor_support joydev i7core_edac edac_core
broadcom tg3 e1000 dcdbas microcode serio_raw pcspkr i2c_i801
firewire_ohci firewire_core crc_itu_t usb_storage radeon ttm
drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8 Dell Inc. Studio XPS
8000/0X231R
RIP: 0010:[<ffffffff81388e19>]  [<ffffffff81388e19>]
scsi_prep_state_check+0x9/0x90
RSP: 0018:ffff88021b3859c8  EFLAGS: 00010086
RAX: ffffffff81389a70 RBX: ffff88022d2c85a0 RCX: 0000000000001fa7
RDX: 0000000000000001 RSI: ffff88022d2c85a0 RDI: 0000000000000000
RBP: ffff88021b3859c8 R08: 0000000000000004 R09: 0000000000000002
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88022e0a9428
R13: ffff88022d2c85a0 R14: 0000000000000000 R15: ffff88022e404d20
FS:  00007f2d10454760(0000) GS:ffff88023fc80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000650 CR3: 0000000214443000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount (pid: 2527, threadinfo ffff88021b384000, task
ffff88022e02dc80)
Stack:
 ffff88021b3859f8 ffffffff8138992b ffff88022d2c85a0 ffff88022e0a9428
 ffff88022d2c85a0 ffff88021b385cf8 ffff88021b385a18 ffffffff81389abd
 ffff88022d2c85a0 ffff88022e0a9428 ffff88021b385a48 ffffffff812847ad
Call Trace:
 [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170
 [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60
 [<ffffffff812847ad>] blk_peek_request+0xbd/0x230
 [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470
 [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20
 [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0
 [<ffffffff81288676>] blk_execute_rq+0x86/0xf0
 [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0
 [<ffffffff81389c6c>] scsi_execute+0xfc/0x160
 [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140
 [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0
 [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30
 [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0
 [<ffffffff813c4d63>] cdrom_release+0xa3/0x260
 [<ffffffff8118157e>] ? invalidate_bh_lru+0x2e/0x50
 [<ffffffff81181550>] ? buffer_cpu_notify+0xa0/0xa0
 [<ffffffff8139a088>] sr_block_release+0x38/0x60
 [<ffffffff8118833c>] __blkdev_put+0x16c/0x1b0
 [<ffffffff811883b2>] blkdev_put+0x32/0x130
 [<ffffffff8115650e>] kill_block_super+0x4e/0x80
 [<ffffffff81156865>] deactivate_locked_super+0x45/0x70
 [<ffffffff8115724a>] deactivate_super+0x4a/0x70
 [<ffffffff81171d54>] mntput_no_expire+0xc4/0x110
 [<ffffffff81172a2c>] sys_umount+0x6c/0x360
 [<ffffffff8155f52b>] system_call_fastpath+0x16/0x1b
Code: 7b 58 e8 4b ea 1c 00 48 8b 4d a8 48 89 45 b8 48 89 cf e8 7b 88 ff
ff eb a1 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 <8b> 87
50 06 00 00 83 f8 02 75 04 31 c0 c9 c3 83 e8 04 83 f8 04 
RIP  [<ffffffff81388e19>] scsi_prep_state_check+0x9/0x90
 RSP <ffff88021b3859c8>
CR2: 0000000000000650
---[ end trace 06d5981e67b7b7c9 ]---
---

Which goes from the device unplug to the oops.  However, there are puts
missing from this; particularly the one where the reference goes to
zero.

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02  6:08           ` Andi Kleen
@ 2011-07-02 12:38               ` Alan Stern
  2011-07-02 12:38               ` Alan Stern
  2011-07-02 12:48               ` Rafael J. Wysocki
  2 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 12:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sat, 2 Jul 2011, Andi Kleen wrote:

> > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > give the exact sequence of steps you went through to trigger the bug?
> 
> Connect USB storage device with builtin fake CD rom. Wait for udisk
> to mount it. Pull cable. udisk does umount. Oops.
> 
> I also got a log of the refcounting now if you want it.

Yes, please.  The excerpt that James posted isn't enough.  In 
particular, it doesn't show what happens during device creation.

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 12:38               ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 12:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sat, 2 Jul 2011, Andi Kleen wrote:

> > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > give the exact sequence of steps you went through to trigger the bug?
> 
> Connect USB storage device with builtin fake CD rom. Wait for udisk
> to mount it. Pull cable. udisk does umount. Oops.
> 
> I also got a log of the refcounting now if you want it.

Yes, please.  The excerpt that James posted isn't enough.  In 
particular, it doesn't show what happens during device creation.

Alan Stern

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 12:48               ` Rafael J. Wysocki
  0 siblings, 0 replies; 63+ messages in thread
From: Rafael J. Wysocki @ 2011-07-02 12:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Stern, James Bottomley, Dave Jones, linux-scsi,
	linux-kernel, axboe, linux-usb

On Saturday, July 02, 2011, Andi Kleen wrote:
> > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > give the exact sequence of steps you went through to trigger the bug?
> 
> Connect USB storage device with builtin fake CD rom. Wait for udisk
> to mount it. Pull cable. udisk does umount. Oops.
> 
> I also got a log of the refcounting now if you want it.

Does reverting commit e1866b33b1e89f077b7132daae3dfd9a594e9a1a help?

Rafael

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 12:48               ` Rafael J. Wysocki
  0 siblings, 0 replies; 63+ messages in thread
From: Rafael J. Wysocki @ 2011-07-02 12:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Stern, James Bottomley, Dave Jones,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, linux-usb-u79uwXL29TY76Z2rM5mHXA

On Saturday, July 02, 2011, Andi Kleen wrote:
> > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > give the exact sequence of steps you went through to trigger the bug?
> 
> Connect USB storage device with builtin fake CD rom. Wait for udisk
> to mount it. Pull cable. udisk does umount. Oops.
> 
> I also got a log of the refcounting now if you want it.

Does reverting commit e1866b33b1e89f077b7132daae3dfd9a594e9a1a help?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-01 18:32   ` Andi Kleen
  2011-07-01 18:40     ` Dave Jones
@ 2011-07-02 15:13     ` Christoph Fritz
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Fritz @ 2011-07-02 15:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Jones, linux-scsi, linux-kernel, axboe, rjw

On Fri, Jul 01, 2011 at 08:32:27PM +0200, Andi Kleen wrote:
> On Fri, Jul 01, 2011 at 02:14:06PM -0400, Dave Jones wrote:
> > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote:
> > 
> >  > I found I can reliably crash a 3.0 system by pulling the
> >  > USB cable of a mounted USB cdrom (or rather a USB device which
> >  > has a builtin fake CD-ROM) 
> >  > 
> >  > I suspect it's a regression too.
> > 
> > We've been seeing a lot of similar bugs in Fedora since we pushed
> > a 2.6.38.8 update.  Some of the traces are different, but some
> 
> I've also seen one with an earlier kernel where the NULL 
> reference was in elv_.* something.  Don't have a written down backtrace
> for this, but perhaps you have.

Do you mean https://bugzilla.kernel.org/show_bug.cgi?id=38412 ?

[..]
Process eject (pid: 5846, ti=f5b9a000 task=f5b19a80 task.ti=f5b9a000)
Stack:
c1198600 c119ac69 00000256 00000000 00000002 c119b2f9 f4eed270 c11a3237
00000000 f5297200 f637e418 00005309 0000005d c11a3378 0000005d 00000000
f5c1b700 00000000 f5c02e00 00000000 f4a2e1f4 f59c0700 c10d1bc3 f53ed001
Call Trace:
[<c1198600>] ? elv_put_request+0x10/0x20
[<c119ac69>] ? __blk_put_request+0x99/0xc0
[<c119b2f9>] ? blk_put_request+0x19/0x40
[<c11a3237>] ? T.593+0x77/0x90
[<c11a3378>] ? scsi_cmd_ioctl+0x128/0x440
[<c10d1bc3>] ? blkdev_get+0x43/0x380
[<c10d1000>] ? bd_acquire+0x20/0xd0
[<c10a2a6d>] ? __dentry_open+0x1cd/0x2a0
[<c10a2c30>] ? nameidata_to_filp+0x70/0x80
[<c10d1f00>] ? blkdev_get+0x380/0x380
[<c125e582>] ? sd_ioctl+0x82/0xc0
[<c125e500>] ? sd_check_events+0x140/0x140
[<c11a02d9>] ? __blkdev_driver_ioctl+0x29/0x40
[<c11a07d3>] ? blkdev_ioctl+0x213/0x7e0
[<c10b1854>] ? do_filp_open+0x34/0x90
[<c10d0715>] ? block_ioctl+0x35/0x50
[<c10d06e0>] ? blkdev_get_block+0x50/0x50
[<c10b2b92>] ? do_vfs_ioctl+0x92/0x610
[<c10a262e>] ? do_sys_open+0x16e/0x1c0
[<c10a262e>] ? do_sys_open+0x16e/0x1c0
[<c10b314d>] ? sys_ioctl+0x3d/0x70
[<c1394ad0>] ? sysenter_do_call+0x12/0x26
Code:  Bad EIP value.
EIP: [<2f327473>] 0x2f327473 SS:ESP 0068:f5b9bdb0
CR2: 000000002f327473
---[ end trace 4877fe842f054f84 ]---


Thanks,
 -- Christoph


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 12:24             ` James Bottomley
@ 2011-07-02 17:05               ` Andi Kleen
  2011-07-02 17:09                   ` James Bottomley
  2011-07-02 17:37                 ` Alan Stern
  1 sibling, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 17:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andi Kleen, Alan Stern, Dave Jones, linux-scsi, linux-kernel,
	axboe, rjw, linux-usb

> Which goes from the device unplug to the oops.  However, there are puts
> missing from this; particularly the one where the reference goes to
> zero.

The count was always one off because it dumps it before the actual
put.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 12:48               ` Rafael J. Wysocki
  (?)
@ 2011-07-02 17:06               ` Andi Kleen
  -1 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 17:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andi Kleen, Alan Stern, James Bottomley, Dave Jones, linux-scsi,
	linux-kernel, axboe, linux-usb

On Sat, Jul 02, 2011 at 02:48:22PM +0200, Rafael J. Wysocki wrote:
> On Saturday, July 02, 2011, Andi Kleen wrote:
> > > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > > give the exact sequence of steps you went through to trigger the bug?
> > 
> > Connect USB storage device with builtin fake CD rom. Wait for udisk
> > to mount it. Pull cable. udisk does umount. Oops.
> > 
> > I also got a log of the refcounting now if you want it.
> 
> Does reverting commit e1866b33b1e89f077b7132daae3dfd9a594e9a1a help?

I'll try next week.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 17:09                   ` James Bottomley
  0 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2011-07-02 17:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Stern, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb

On Sat, 2011-07-02 at 19:05 +0200, Andi Kleen wrote:
> > Which goes from the device unplug to the oops.  However, there are puts
> > missing from this; particularly the one where the reference goes to
> > zero.
> 
> The count was always one off because it dumps it before the actual
> put.

I know this.  What I mean is the last seen put count before the oops is
3 (not 1).

And the transition after unplug goes

13->12->10->8->7->6->5->3

implying we're missing put events.

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 17:09                   ` James Bottomley
  0 siblings, 0 replies; 63+ messages in thread
From: James Bottomley @ 2011-07-02 17:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alan Stern, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0,
	linux-usb-u79uwXL29TY76Z2rM5mHXA

On Sat, 2011-07-02 at 19:05 +0200, Andi Kleen wrote:
> > Which goes from the device unplug to the oops.  However, there are puts
> > missing from this; particularly the one where the reference goes to
> > zero.
> 
> The count was always one off because it dumps it before the actual
> put.

I know this.  What I mean is the last seen put count before the oops is
3 (not 1).

And the transition after unplug goes

13->12->10->8->7->6->5->3

implying we're missing put events.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 12:24             ` James Bottomley
@ 2011-07-02 17:37                 ` Alan Stern
  2011-07-02 17:37                 ` Alan Stern
  1 sibling, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 17:37 UTC (permalink / raw)
  To: axboe, James Bottomley
  Cc: Andi Kleen, Dave Jones, SCSI development list,
	Kernel development list, Rafael J. Wysocki, USB list

Okay, I found the source of the problem.  Or more accurately, I found
two separate bugs.

The first bug is triggered in scsi_request_fn().  At the start of that
routine we have:

	struct scsi_device *sdev = q->queuedata;
	...

	if (!sdev) {
		printk("scsi: killing requests for dead queue\n");
		while ((req = blk_peek_request(q)) != NULL)
			scsi_kill_request(req, q);
		return;
	}

The problem is that blk_peek_request() calls scsi_prep_fn(), which 
does this:

	struct scsi_device *sdev = q->queuedata;
	int ret = BLKPREP_KILL;

	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
		ret = scsi_setup_blk_pc_cmnd(sdev, req);
	return scsi_prep_return(q, req, ret);

It doesn't check to see if sdev is NULL, nor does 
scsi_setup_blk_pc_cmnd().  That accounts for this error:

On Sat, 2 Jul 2011, James Bottomley wrote:

> On Sat, 2011-07-02 at 08:08 +0200, Andi Kleen wrote:
> > > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > > give the exact sequence of steps you went through to trigger the bug?
> > 
> > Connect USB storage device with builtin fake CD rom. Wait for udisk
> > to mount it. Pull cable. udisk does umount. Oops.
> > 
> > I also got a log of the refcounting now if you want it.
> 
> So I've got the log, but this is the relevant section:
> 
> ---
> usb 2-1.5: USB disconnect, device number 4
> sr 5:0:0:1: scsi put_device 13 from device_del+0x177/0x1c0
> sr 5:0:0:1: scsi put_device 12 from bsg_kref_release_function+0x28/0x30
> sr 5:0:0:1: scsi put_device 10 from device_del+0x177/0x1c0
> sr 5:0:0:1: scsi put_device 8 from device_del+0x177/0x1c0
> sr 5:0:0:1: scsi put_device 7 from scsi_device_cls_release+0x15/0x20
> sr 5:0:0:1: scsi put_device 6 from klist_children_put+0x12/0x20
> sr 5:0:0:1: scsi put_device 5 from klist_devices_put+0x12/0x20
> sr 5:0:0:1: scsi put_device 3 from device_del+0x177/0x1c0
> scsi: killing requests for dead queue
> BUG: sleeping function called from invalid context
> at /home/ak/lsrc/git/linux-2.6/arch/x86/mm/fault.c:1103
> in_atomic(): 0, irqs_disabled(): 1, pid: 2527, name: umount
> Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8
> Call Trace:
>  [<ffffffff8103af8c>] __might_sleep+0xcc/0xf0
>  [<ffffffff8155af42>] do_page_fault+0x142/0x4c0
>  [<ffffffffa01d5385>] ? write_msg+0x105/0x120 [netconsole]
>  [<ffffffff810514f7>] ? __call_console_drivers+0x97/0xb0
>  [<ffffffff81079692>] ? up+0x32/0x50
>  [<ffffffff81557f5f>] page_fault+0x1f/0x30
>  [<ffffffff81389a70>] ? scsi_setup_blk_pc_cmnd+0x170/0x170
>  [<ffffffff81388e19>] ? scsi_prep_state_check+0x9/0x90
>  [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170
>  [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60
>  [<ffffffff812847ad>] blk_peek_request+0xbd/0x230
>  [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470
>  [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20
>  [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0
>  [<ffffffff81288676>] blk_execute_rq+0x86/0xf0
>  [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0
>  [<ffffffff81389c6c>] scsi_execute+0xfc/0x160
>  [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140
>  [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0
>  [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30
>  [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0
>  [<ffffffff8139b390>] sr_lock_door+0x20/0x30
>  [<ffffffff813c4d63>] cdrom_release+0xa3/0x260

An easy fix is to have scsi_prep_fn() check if sdev is NULL and return 
BLKPREP_KILL if it is.

The second bug, which hit me but apparently not any of you, is that the 
request_queue's elevator gets deallocated while it is still in use.  
That's because __scsi_remove_device() calls scsi_free_queue(), which 
does blk_cleanup_queue(), which calls elevator_exit(), even though the 
device file is still open and more requests will be submitted when the 
file is closed.

I'm not sure of the right fix for this.  One possibility is to move the 
scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
maybe the elevator_exit() call should be moved to blk_release_queue().

Also, I have no idea why this shows up with USB drives but not other 
SCSI transports.  A fluke of timing?

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 17:37                 ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 17:37 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw, James Bottomley
  Cc: Andi Kleen, Dave Jones, SCSI development list,
	Kernel development list, Rafael J. Wysocki, USB list

Okay, I found the source of the problem.  Or more accurately, I found
two separate bugs.

The first bug is triggered in scsi_request_fn().  At the start of that
routine we have:

	struct scsi_device *sdev = q->queuedata;
	...

	if (!sdev) {
		printk("scsi: killing requests for dead queue\n");
		while ((req = blk_peek_request(q)) != NULL)
			scsi_kill_request(req, q);
		return;
	}

The problem is that blk_peek_request() calls scsi_prep_fn(), which 
does this:

	struct scsi_device *sdev = q->queuedata;
	int ret = BLKPREP_KILL;

	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
		ret = scsi_setup_blk_pc_cmnd(sdev, req);
	return scsi_prep_return(q, req, ret);

It doesn't check to see if sdev is NULL, nor does 
scsi_setup_blk_pc_cmnd().  That accounts for this error:

On Sat, 2 Jul 2011, James Bottomley wrote:

> On Sat, 2011-07-02 at 08:08 +0200, Andi Kleen wrote:
> > > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > > give the exact sequence of steps you went through to trigger the bug?
> > 
> > Connect USB storage device with builtin fake CD rom. Wait for udisk
> > to mount it. Pull cable. udisk does umount. Oops.
> > 
> > I also got a log of the refcounting now if you want it.
> 
> So I've got the log, but this is the relevant section:
> 
> ---
> usb 2-1.5: USB disconnect, device number 4
> sr 5:0:0:1: scsi put_device 13 from device_del+0x177/0x1c0
> sr 5:0:0:1: scsi put_device 12 from bsg_kref_release_function+0x28/0x30
> sr 5:0:0:1: scsi put_device 10 from device_del+0x177/0x1c0
> sr 5:0:0:1: scsi put_device 8 from device_del+0x177/0x1c0
> sr 5:0:0:1: scsi put_device 7 from scsi_device_cls_release+0x15/0x20
> sr 5:0:0:1: scsi put_device 6 from klist_children_put+0x12/0x20
> sr 5:0:0:1: scsi put_device 5 from klist_devices_put+0x12/0x20
> sr 5:0:0:1: scsi put_device 3 from device_del+0x177/0x1c0
> scsi: killing requests for dead queue
> BUG: sleeping function called from invalid context
> at /home/ak/lsrc/git/linux-2.6/arch/x86/mm/fault.c:1103
> in_atomic(): 0, irqs_disabled(): 1, pid: 2527, name: umount
> Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8
> Call Trace:
>  [<ffffffff8103af8c>] __might_sleep+0xcc/0xf0
>  [<ffffffff8155af42>] do_page_fault+0x142/0x4c0
>  [<ffffffffa01d5385>] ? write_msg+0x105/0x120 [netconsole]
>  [<ffffffff810514f7>] ? __call_console_drivers+0x97/0xb0
>  [<ffffffff81079692>] ? up+0x32/0x50
>  [<ffffffff81557f5f>] page_fault+0x1f/0x30
>  [<ffffffff81389a70>] ? scsi_setup_blk_pc_cmnd+0x170/0x170
>  [<ffffffff81388e19>] ? scsi_prep_state_check+0x9/0x90
>  [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170
>  [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60
>  [<ffffffff812847ad>] blk_peek_request+0xbd/0x230
>  [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470
>  [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20
>  [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0
>  [<ffffffff81288676>] blk_execute_rq+0x86/0xf0
>  [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0
>  [<ffffffff81389c6c>] scsi_execute+0xfc/0x160
>  [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140
>  [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0
>  [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30
>  [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0
>  [<ffffffff8139b390>] sr_lock_door+0x20/0x30
>  [<ffffffff813c4d63>] cdrom_release+0xa3/0x260

An easy fix is to have scsi_prep_fn() check if sdev is NULL and return 
BLKPREP_KILL if it is.

The second bug, which hit me but apparently not any of you, is that the 
request_queue's elevator gets deallocated while it is still in use.  
That's because __scsi_remove_device() calls scsi_free_queue(), which 
does blk_cleanup_queue(), which calls elevator_exit(), even though the 
device file is still open and more requests will be submitted when the 
file is closed.

I'm not sure of the right fix for this.  One possibility is to move the 
scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
maybe the elevator_exit() call should be moved to blk_release_queue().

Also, I have no idea why this shows up with USB drives but not other 
SCSI transports.  A fluke of timing?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 18:10                 ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 18:10 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, James Bottomley, Dave Jones, linux-scsi,
	linux-kernel, axboe, rjw, linux-usb

On Sat, Jul 02, 2011 at 08:38:20AM -0400, Alan Stern wrote:
> On Sat, 2 Jul 2011, Andi Kleen wrote:
> 
> > > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > > give the exact sequence of steps you went through to trigger the bug?
> > 
> > Connect USB storage device with builtin fake CD rom. Wait for udisk
> > to mount it. Pull cable. udisk does umount. Oops.
> > 
> > I also got a log of the refcounting now if you want it.
> 
> Yes, please.  The excerpt that James posted isn't enough.  In 
> particular, it doesn't show what happens during device creation.

http://halobates.de/tmp/usb-pull-oops-ref

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 18:10                 ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 18:10 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, James Bottomley, Dave Jones,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0,
	linux-usb-u79uwXL29TY76Z2rM5mHXA

On Sat, Jul 02, 2011 at 08:38:20AM -0400, Alan Stern wrote:
> On Sat, 2 Jul 2011, Andi Kleen wrote:
> 
> > > I'm not able to reproduce it on a vanilla 3.0-rc5 system.  Can anybody
> > > give the exact sequence of steps you went through to trigger the bug?
> > 
> > Connect USB storage device with builtin fake CD rom. Wait for udisk
> > to mount it. Pull cable. udisk does umount. Oops.
> > 
> > I also got a log of the refcounting now if you want it.
> 
> Yes, please.  The excerpt that James posted isn't enough.  In 
> particular, it doesn't show what happens during device creation.

http://halobates.de/tmp/usb-pull-oops-ref

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 17:37                 ` Alan Stern
  (?)
@ 2011-07-02 18:11                 ` Andi Kleen
  2011-07-02 19:59                   ` Alan Stern
  -1 siblings, 1 reply; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 18:11 UTC (permalink / raw)
  To: Alan Stern
  Cc: axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

> The problem is that blk_peek_request() calls scsi_prep_fn(), which 
> does this:
> 
> 	struct scsi_device *sdev = q->queuedata;
> 	int ret = BLKPREP_KILL;
> 
> 	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> 		ret = scsi_setup_blk_pc_cmnd(sdev, req);
> 	return scsi_prep_return(q, req, ret);
> 
> It doesn't check to see if sdev is NULL, nor does 
> scsi_setup_blk_pc_cmnd().  That accounts for this error:

I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
but that just caused RCU CPU stalls afterwards and then eventually
a hung system.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 17:09                   ` James Bottomley
@ 2011-07-02 18:15                     ` Andi Kleen
  -1 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 18:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andi Kleen, Alan Stern, Dave Jones, linux-scsi, linux-kernel,
	axboe, rjw, linux-usb

> And the transition after unplug goes
> 
> 13->12->10->8->7->6->5->3
> 
> implying we're missing put events.

Don't know why. I used this check (with trace_scsi_dev containing the name)
in put_device:

if (trace_scsi && dev->type == &scsi_dev_type &&
     trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) {
	dev_printk ...

If you want instrumentation somewhere else please let me know.

Maybe there simply isn't a put_device to 0 and the NULL pointer happens
for some other reason?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 18:15                     ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-02 18:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andi Kleen, Alan Stern, Dave Jones,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0,
	linux-usb-u79uwXL29TY76Z2rM5mHXA

> And the transition after unplug goes
> 
> 13->12->10->8->7->6->5->3
> 
> implying we're missing put events.

Don't know why. I used this check (with trace_scsi_dev containing the name)
in put_device:

if (trace_scsi && dev->type == &scsi_dev_type &&
     trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) {
	dev_printk ...

If you want instrumentation somewhere else please let me know.

Maybe there simply isn't a put_device to 0 and the NULL pointer happens
for some other reason?

-Andi

-- 
ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 18:11                 ` Andi Kleen
@ 2011-07-02 19:59                   ` Alan Stern
  2011-07-03  1:17                     ` Andi Kleen
                                       ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 19:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: axboe, James Bottomley, Dave Jones, SCSI development list,
	Kernel development list, Rafael J. Wysocki, USB list

On Sat, 2 Jul 2011, Andi Kleen wrote:

> > The problem is that blk_peek_request() calls scsi_prep_fn(), which 
> > does this:
> > 
> > 	struct scsi_device *sdev = q->queuedata;
> > 	int ret = BLKPREP_KILL;
> > 
> > 	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> > 		ret = scsi_setup_blk_pc_cmnd(sdev, req);
> > 	return scsi_prep_return(q, req, ret);
> > 
> > It doesn't check to see if sdev is NULL, nor does 
> > scsi_setup_blk_pc_cmnd().  That accounts for this error:
> 
> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
> but that just caused RCU CPU stalls afterwards and then eventually
> a hung system.

The RCU problem is likely to be a separate issue.  It might even be a 
result of the use-after-free problem with the elevator.

At any rate, it's clear that the crash in the refcounting log you
posted occurred because scsi_setup_blk_pc_cmnd() called
scsi_prep_state_check(), which tried to dereference the NULL pointer.

Would you like to try this patch to see if it fixes the problem?  As I 
said before, I'm not certain it's the best thing to do, but it worked 
on my system.

Alan Stern




Index: usb-3.0/drivers/scsi/scsi_lib.c
===================================================================
--- usb-3.0.orig/drivers/scsi/scsi_lib.c
+++ usb-3.0/drivers/scsi/scsi_lib.c
@@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
 	struct scsi_device *sdev = q->queuedata;
 	int ret = BLKPREP_KILL;
 
+	if (!sdev)
+		return ret;
 	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
 		ret = scsi_setup_blk_pc_cmnd(sdev, req);
 	return scsi_prep_return(q, req, ret);
Index: usb-3.0/drivers/scsi/scsi_sysfs.c
===================================================================
--- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
+++ usb-3.0/drivers/scsi/scsi_sysfs.c
@@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
 		kfree(evt);
 	}
 
+	/* Freeing the queue signals to block that we're done */
+	scsi_free_queue(sdev->request_queue);
 	blk_put_queue(sdev->request_queue);
 	/* NULL queue means the device can't be used */
 	sdev->request_queue = NULL;
@@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
 	/* cause the request function to reject all I/O requests */
 	sdev->request_queue->queuedata = NULL;
 
-	/* Freeing the queue signals to block that we're done */
-	scsi_free_queue(sdev->request_queue);
 	put_device(dev);
 }
 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 18:15                     ` Andi Kleen
@ 2011-07-02 20:05                       ` Alan Stern
  -1 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 20:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sat, 2 Jul 2011, Andi Kleen wrote:

> > And the transition after unplug goes
> > 
> > 13->12->10->8->7->6->5->3
> > 
> > implying we're missing put events.
> 
> Don't know why. I used this check (with trace_scsi_dev containing the name)
> in put_device:
> 
> if (trace_scsi && dev->type == &scsi_dev_type &&
>      trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) {
> 	dev_printk ...

These extra puts could be coming from deep inside the device-model
core, where they operate directly on the underlying kobject instead of
going through put_device().  For example, removal of sysfs symlinks
could have this effect.

> If you want instrumentation somewhere else please let me know.
> 
> Maybe there simply isn't a put_device to 0 and the NULL pointer happens
> for some other reason?

On my system, at least, the scsi_device's refcount dropped to 0 at the 
right time.  That wasn't the problem.  The NULL pointer occurs because 
the request_queue is used after the scsi_device has been removed from 
visibility; among other things, __scsi_remove_device() sets 
sdev->request_queue->queuedata to NULL.

As the comment says, this causes the request function to reject all I/O 
requests -- but not before trying to dereference the NULL pointer!

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-02 20:05                       ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-02 20:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sat, 2 Jul 2011, Andi Kleen wrote:

> > And the transition after unplug goes
> > 
> > 13->12->10->8->7->6->5->3
> > 
> > implying we're missing put events.
> 
> Don't know why. I used this check (with trace_scsi_dev containing the name)
> in put_device:
> 
> if (trace_scsi && dev->type == &scsi_dev_type &&
>      trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) {
> 	dev_printk ...

These extra puts could be coming from deep inside the device-model
core, where they operate directly on the underlying kobject instead of
going through put_device().  For example, removal of sysfs symlinks
could have this effect.

> If you want instrumentation somewhere else please let me know.
> 
> Maybe there simply isn't a put_device to 0 and the NULL pointer happens
> for some other reason?

On my system, at least, the scsi_device's refcount dropped to 0 at the 
right time.  That wasn't the problem.  The NULL pointer occurs because 
the request_queue is used after the scsi_device has been removed from 
visibility; among other things, __scsi_remove_device() sets 
sdev->request_queue->queuedata to NULL.

As the comment says, this causes the request function to reject all I/O 
requests -- but not before trying to dereference the NULL pointer!

Alan Stern

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-03  1:16                         ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-03  1:16 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, James Bottomley, Dave Jones, linux-scsi,
	linux-kernel, axboe, rjw, linux-usb

> > if (trace_scsi && dev->type == &scsi_dev_type &&
> >      trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) {
> > 	dev_printk ...
> 
> These extra puts could be coming from deep inside the device-model
> core, where they operate directly on the underlying kobject instead of
> going through put_device().  For example, removal of sysfs symlinks
> could have this effect.

Ok I could move it to kobject with some complications (need a real
back trace then)

> On my system, at least, the scsi_device's refcount dropped to 0 at the 
> right time.  That wasn't the problem.  The NULL pointer occurs because 
> the request_queue is used after the scsi_device has been removed from 
> visibility; among other things, __scsi_remove_device() sets 
> sdev->request_queue->queuedata to NULL.
> 
> As the comment says, this causes the request function to reject all I/O 
> requests -- but not before trying to dereference the NULL pointer!

Your explanation completely contradicts what James wrote earlier.

Maybe it's good if you guys come up with a common avenue of debugging
before I try further.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-03  1:16                         ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-03  1:16 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, James Bottomley, Dave Jones,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0,
	linux-usb-u79uwXL29TY76Z2rM5mHXA

> > if (trace_scsi && dev->type == &scsi_dev_type &&
> >      trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) {
> > 	dev_printk ...
> 
> These extra puts could be coming from deep inside the device-model
> core, where they operate directly on the underlying kobject instead of
> going through put_device().  For example, removal of sysfs symlinks
> could have this effect.

Ok I could move it to kobject with some complications (need a real
back trace then)

> On my system, at least, the scsi_device's refcount dropped to 0 at the 
> right time.  That wasn't the problem.  The NULL pointer occurs because 
> the request_queue is used after the scsi_device has been removed from 
> visibility; among other things, __scsi_remove_device() sets 
> sdev->request_queue->queuedata to NULL.
> 
> As the comment says, this causes the request function to reject all I/O 
> requests -- but not before trying to dereference the NULL pointer!

Your explanation completely contradicts what James wrote earlier.

Maybe it's good if you guys come up with a common avenue of debugging
before I try further.

-Andi

-- 
ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 19:59                   ` Alan Stern
@ 2011-07-03  1:17                     ` Andi Kleen
  2011-07-07 20:47                     ` solved was " Andi Kleen
  2011-07-18 16:59                     ` Dan Williams
  2 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-03  1:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, axboe, James Bottomley, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

> Would you like to try this patch to see if it fixes the problem?  As I 
> said before, I'm not certain it's the best thing to do, but it worked 
> on my system.

I can try next week.

Is this patch likely to fix the RCU stall?

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 17:37                 ` Alan Stern
  (?)
  (?)
@ 2011-07-03  9:14                 ` Dan Williams
  2011-07-03 18:16                   ` Andi Kleen
  -1 siblings, 1 reply; 63+ messages in thread
From: Dan Williams @ 2011-07-03  9:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Sat, Jul 2, 2011 at 10:37 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> The second bug, which hit me but apparently not any of you, is that the
> request_queue's elevator gets deallocated while it is still in use.
> That's because __scsi_remove_device() calls scsi_free_queue(), which
> does blk_cleanup_queue(), which calls elevator_exit(), even though the
> device file is still open and more requests will be submitted when the
> file is closed.
>
> I'm not sure of the right fix for this.  One possibility is to move the
> scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or
> maybe the elevator_exit() call should be moved to blk_release_queue().
>
> Also, I have no idea why this shows up with USB drives but not other
> SCSI transports.  A fluke of timing?
>

Not sure it's related, but it sounds like the hotplug failures being
seen with libsas hotplug.  Have not had a chance yet to track it down
further with isci, but mvsas developer Xiangliang Yu is reporting:

http://marc.info/?l=linux-scsi&m=130707166512002&w=2

He tracked it down to a suspected regression between .39-rc4 and
.39-rc5 but did not finalize the bisect.

The isci driver sees this and other signatures all seemingly related
to early device tear down and only when pulling a drive with in-flight
i/o.

--
Dan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-03 15:29                           ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-03 15:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sun, 3 Jul 2011, Andi Kleen wrote:

> > On my system, at least, the scsi_device's refcount dropped to 0 at the 
> > right time.  That wasn't the problem.  The NULL pointer occurs because 
> > the request_queue is used after the scsi_device has been removed from 
> > visibility; among other things, __scsi_remove_device() sets 
> > sdev->request_queue->queuedata to NULL.
> > 
> > As the comment says, this causes the request function to reject all I/O 
> > requests -- but not before trying to dereference the NULL pointer!
> 
> Your explanation completely contradicts what James wrote earlier.

Yes, it does.  I'm claiming that James was on a wrong track or looking
in the wrong direction.

> Maybe it's good if you guys come up with a common avenue of debugging
> before I try further.

It will be interesting to hear what James has to say about all this...

> > Would you like to try this patch to see if it fixes the problem?  As I 
> > said before, I'm not certain it's the best thing to do, but it worked 
> > on my system.
> 
> I can try next week.

Okay.

> Is this patch likely to fix the RCU stall?

I have no idea.  All I know about this RCU stall is what you wrote 
before, i.e., that it happened once.  No debugging information or 
anything else.  However if it happens again with the patch in place, 
we'll be in a good position to figure the cause and fix it.

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-03 15:29                           ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-03 15:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0,
	linux-usb-u79uwXL29TY76Z2rM5mHXA

On Sun, 3 Jul 2011, Andi Kleen wrote:

> > On my system, at least, the scsi_device's refcount dropped to 0 at the 
> > right time.  That wasn't the problem.  The NULL pointer occurs because 
> > the request_queue is used after the scsi_device has been removed from 
> > visibility; among other things, __scsi_remove_device() sets 
> > sdev->request_queue->queuedata to NULL.
> > 
> > As the comment says, this causes the request function to reject all I/O 
> > requests -- but not before trying to dereference the NULL pointer!
> 
> Your explanation completely contradicts what James wrote earlier.

Yes, it does.  I'm claiming that James was on a wrong track or looking
in the wrong direction.

> Maybe it's good if you guys come up with a common avenue of debugging
> before I try further.

It will be interesting to hear what James has to say about all this...

> > Would you like to try this patch to see if it fixes the problem?  As I 
> > said before, I'm not certain it's the best thing to do, but it worked 
> > on my system.
> 
> I can try next week.

Okay.

> Is this patch likely to fix the RCU stall?

I have no idea.  All I know about this RCU stall is what you wrote 
before, i.e., that it happened once.  No debugging information or 
anything else.  However if it happens again with the patch in place, 
we'll be in a good position to figure the cause and fix it.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-03 15:29                           ` Alan Stern
@ 2011-07-03 16:06                             ` Alan Stern
  -1 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-03 16:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sun, 3 Jul 2011, Alan Stern wrote:

> > Your explanation completely contradicts what James wrote earlier.
> 
> Yes, it does.  I'm claiming that James was on a wrong track or looking
> in the wrong direction.

To amplify: James wrote:

> Right, that wouldn't work.  The sdev in question comes from
> request_queue->queuedata.  That only goes to null when the last
> reference to the sdev has been released.  So the root cause is something
> in sd holding a reference to sdev without actually getting an additional
> refcount.

The third sentence is clearly wrong, because queuedata is set to NULL
when sdev is removed from visibility, not when the last reference to it
has been released, i.e., in __scsi_remove_device() rather than
scsi_device_dev_release_usercontext().

Maybe this means the assignment to queuedata should be moved.  I don't 
know; unlike James, I'm not an expert on the details of the SCSI stack.

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-03 16:06                             ` Alan Stern
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-03 16:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe,
	rjw, linux-usb

On Sun, 3 Jul 2011, Alan Stern wrote:

> > Your explanation completely contradicts what James wrote earlier.
> 
> Yes, it does.  I'm claiming that James was on a wrong track or looking
> in the wrong direction.

To amplify: James wrote:

> Right, that wouldn't work.  The sdev in question comes from
> request_queue->queuedata.  That only goes to null when the last
> reference to the sdev has been released.  So the root cause is something
> in sd holding a reference to sdev without actually getting an additional
> refcount.

The third sentence is clearly wrong, because queuedata is set to NULL
when sdev is removed from visibility, not when the last reference to it
has been released, i.e., in __scsi_remove_device() rather than
scsi_device_dev_release_usercontext().

Maybe this means the assignment to queuedata should be moved.  I don't 
know; unlike James, I'm not an expert on the details of the SCSI stack.

Alan Stern

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-03  9:14                 ` Dan Williams
@ 2011-07-03 18:16                   ` Andi Kleen
  0 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-03 18:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

> Not sure it's related, but it sounds like the hotplug failures being
> seen with libsas hotplug.  Have not had a chance yet to track it down
> further with isci, but mvsas developer Xiangliang Yu is reporting:

It looks somewhat different: he has a bad pointer, not a NULL pointer,
and it's in elv_completed_request() (but that may be related to 
the different request types)

> 
> http://marc.info/?l=linux-scsi&m=130707166512002&w=2
> 
> He tracked it down to a suspected regression between .39-rc4 and
> .39-rc5 but did not finalize the bisect.

I can check if it disappears with .39-rc4.

> The isci driver sees this and other signatures all seemingly related
> to early device tear down and only when pulling a drive with in-flight
> i/o.

In my case it's not really in flight IO, but umount after pull with an
idle drive.

-Andi

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 17:37                 ` Alan Stern
                                   ` (2 preceding siblings ...)
  (?)
@ 2011-07-03 20:37                 ` Stefan Richter
  2011-07-08 13:37                   ` Stefan Richter
  -1 siblings, 1 reply; 63+ messages in thread
From: Stefan Richter @ 2011-07-03 20:37 UTC (permalink / raw)
  To: Alan Stern
  Cc: axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Jul 02 Alan Stern wrote:
> Also, I have no idea why this shows up with USB drives but not other 
> SCSI transports.  A fluke of timing?

A while ago I frequently observed oopses at removal of a FireWire
CompactFlash card reader (an sd device which exposes itself as device with
removable medium).  At that time I wasn't motivated to track down whether
the bug resided in the block or SCSI or firewire subsystem.  The bug was
apparently triggered because hald was polling the device like crazy for
media changes, and that polling coincided with device hot unplug.

Furthermore, FireWire CD-ROM removal never has been an exactly glitch-free
experience because the SCSI stack occasionally went on to issue command
retries for many minutes after device removal a.k.a. DID_NO_CONNECT.  I
don't remember crashes during FireWire CD-ROM removal, but it has been a
while that I last used CD-ROM drives.

I might test the card reader and a CD-ROM later next week on 3.0-rc6 if I
find the time and the issue hadn't been resolved by then.  I don't have
hald anymore, but there are certainly other ways to force accesses during
device shutdown.
-- 
Stefan Richter
-=====-==-== -=== ---==
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-04 11:27                   ` Heiko Carstens
  0 siblings, 0 replies; 63+ messages in thread
From: Heiko Carstens @ 2011-07-04 11:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote:
> The second bug, which hit me but apparently not any of you, is that the 
> request_queue's elevator gets deallocated while it is still in use.  
> That's because __scsi_remove_device() calls scsi_free_queue(), which 
> does blk_cleanup_queue(), which calls elevator_exit(), even though the 
> device file is still open and more requests will be submitted when the 
> file is closed.
> 
> I'm not sure of the right fix for this.  One possibility is to move the 
> scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
> maybe the elevator_exit() call should be moved to blk_release_queue().
> 
> Also, I have no idea why this shows up with USB drives but not other 
> SCSI transports.  A fluke of timing?

FWIW, I reported a bug where the request_queue's elevator got deallocated
while it was still in use (fc transport with device hotplug):

http://www.spinics.net/lists/linux-scsi/msg52879.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-04 11:27                   ` Heiko Carstens
  0 siblings, 0 replies; 63+ messages in thread
From: Heiko Carstens @ 2011-07-04 11:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, James Bottomley, Andi Kleen,
	Dave Jones, SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote:
> The second bug, which hit me but apparently not any of you, is that the 
> request_queue's elevator gets deallocated while it is still in use.  
> That's because __scsi_remove_device() calls scsi_free_queue(), which 
> does blk_cleanup_queue(), which calls elevator_exit(), even though the 
> device file is still open and more requests will be submitted when the 
> file is closed.
> 
> I'm not sure of the right fix for this.  One possibility is to move the 
> scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
> maybe the elevator_exit() call should be moved to blk_release_queue().
> 
> Also, I have no idea why this shows up with USB drives but not other 
> SCSI transports.  A fluke of timing?

FWIW, I reported a bug where the request_queue's elevator got deallocated
while it was still in use (fc transport with device hotplug):

http://www.spinics.net/lists/linux-scsi/msg52879.html
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-04 11:27                   ` Heiko Carstens
  (?)
@ 2011-07-04 16:04                   ` Alan Stern
  2011-07-06  6:50                     ` Heiko Carstens
  2011-07-12 18:49                     ` Jonathan McDowell
  -1 siblings, 2 replies; 63+ messages in thread
From: Alan Stern @ 2011-07-04 16:04 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Mon, 4 Jul 2011, Heiko Carstens wrote:

> On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote:
> > The second bug, which hit me but apparently not any of you, is that the 
> > request_queue's elevator gets deallocated while it is still in use.  
> > That's because __scsi_remove_device() calls scsi_free_queue(), which 
> > does blk_cleanup_queue(), which calls elevator_exit(), even though the 
> > device file is still open and more requests will be submitted when the 
> > file is closed.
> > 
> > I'm not sure of the right fix for this.  One possibility is to move the 
> > scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
> > maybe the elevator_exit() call should be moved to blk_release_queue().
> > 
> > Also, I have no idea why this shows up with USB drives but not other 
> > SCSI transports.  A fluke of timing?
> 
> FWIW, I reported a bug where the request_queue's elevator got deallocated
> while it was still in use (fc transport with device hotplug):
> 
> http://www.spinics.net/lists/linux-scsi/msg52879.html

That does sound like the second bug I encountered.  Can you reproduce 
it?  Does the patch here:

	http://marc.info/?l=linux-kernel&m=130963676907731&w=2

fix the problem?

Alan Stern


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-04 16:04                   ` Alan Stern
@ 2011-07-06  6:50                     ` Heiko Carstens
  2011-07-12 18:49                     ` Jonathan McDowell
  1 sibling, 0 replies; 63+ messages in thread
From: Heiko Carstens @ 2011-07-06  6:50 UTC (permalink / raw)
  To: Alan Stern
  Cc: axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list, Manvanthara B. Puttashankar,
	Roland Dreier

On Mon, Jul 04, 2011 at 12:04:54PM -0400, Alan Stern wrote:
> On Mon, 4 Jul 2011, Heiko Carstens wrote:
> 
> > On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote:
> > > The second bug, which hit me but apparently not any of you, is that the 
> > > request_queue's elevator gets deallocated while it is still in use.  
> > > That's because __scsi_remove_device() calls scsi_free_queue(), which 
> > > does blk_cleanup_queue(), which calls elevator_exit(), even though the 
> > > device file is still open and more requests will be submitted when the 
> > > file is closed.
> > > 
> > > I'm not sure of the right fix for this.  One possibility is to move the 
> > > scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
> > > maybe the elevator_exit() call should be moved to blk_release_queue().
> > > 
> > > Also, I have no idea why this shows up with USB drives but not other 
> > > SCSI transports.  A fluke of timing?
> > 
> > FWIW, I reported a bug where the request_queue's elevator got deallocated
> > while it was still in use (fc transport with device hotplug):
> > 
> > http://www.spinics.net/lists/linux-scsi/msg52879.html
> 
> That does sound like the second bug I encountered.  Can you reproduce 
> it?  Does the patch here:
> 
> 	http://marc.info/?l=linux-kernel&m=130963676907731&w=2
> 
> fix the problem?

I used to be reproducible, however I have to ask Manvathara to try your
patch and see if it works.

Thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* solved was Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 19:59                   ` Alan Stern
  2011-07-03  1:17                     ` Andi Kleen
@ 2011-07-07 20:47                     ` Andi Kleen
  2011-07-18 16:59                     ` Dan Williams
  2 siblings, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-07 20:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, axboe, James Bottomley, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

> Would you like to try this patch to see if it fixes the problem?  As I 
> said before, I'm not certain it's the best thing to do, but it worked 
> on my system.

Sorry for the delay. I can confirm this patch fixes the problem for me.
I did a few pulls and all performed fine. Thanks!

-Andi
> 
> Alan Stern
> 
> 
> 
> 
> Index: usb-3.0/drivers/scsi/scsi_lib.c
> ===================================================================
> --- usb-3.0.orig/drivers/scsi/scsi_lib.c
> +++ usb-3.0/drivers/scsi/scsi_lib.c
> @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
>  	struct scsi_device *sdev = q->queuedata;
>  	int ret = BLKPREP_KILL;
>  
> +	if (!sdev)
> +		return ret;
>  	if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>  		ret = scsi_setup_blk_pc_cmnd(sdev, req);
>  	return scsi_prep_return(q, req, ret);
> Index: usb-3.0/drivers/scsi/scsi_sysfs.c
> ===================================================================
> --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
> +++ usb-3.0/drivers/scsi/scsi_sysfs.c
> @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
>  		kfree(evt);
>  	}
>  
> +	/* Freeing the queue signals to block that we're done */
> +	scsi_free_queue(sdev->request_queue);
>  	blk_put_queue(sdev->request_queue);
>  	/* NULL queue means the device can't be used */
>  	sdev->request_queue = NULL;
> @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
>  	/* cause the request function to reject all I/O requests */
>  	sdev->request_queue->queuedata = NULL;
>  
> -	/* Freeing the queue signals to block that we're done */
> -	scsi_free_queue(sdev->request_queue);
>  	put_device(dev);
>  }
>  
> 
> 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-03 20:37                 ` Stefan Richter
@ 2011-07-08 13:37                   ` Stefan Richter
  2011-07-08 13:41                       ` Stefan Richter
  0 siblings, 1 reply; 63+ messages in thread
From: Stefan Richter @ 2011-07-08 13:37 UTC (permalink / raw)
  To: SCSI development list
  Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones,
	Kernel development list, Rafael J. Wysocki, USB list

On Jul 03 Stefan Richter wrote:
> On Jul 02 Alan Stern wrote:
> > Also, I have no idea why this shows up with USB drives but not other 
> > SCSI transports.  A fluke of timing?
> 
> A while ago I frequently observed oopses at removal of a FireWire
> CompactFlash card reader (an sd device which exposes itself as device with
> removable medium).  At that time I wasn't motivated to track down whether
> the bug resided in the block or SCSI or firewire subsystem.  The bug was
> apparently triggered because hald was polling the device like crazy for
> media changes, and that polling coincided with device hot unplug.
> 
> Furthermore, FireWire CD-ROM removal never has been an exactly glitch-free
> experience because the SCSI stack occasionally went on to issue command
> retries for many minutes after device removal a.k.a. DID_NO_CONNECT.  I
> don't remember crashes during FireWire CD-ROM removal, but it has been a
> while that I last used CD-ROM drives.
> 
> I might test the card reader and a CD-ROM later next week on 3.0-rc6 if I
> find the time and the issue hadn't been resolved by then.  I don't have
> hald anymore, but there are certainly other ways to force accesses during
> device shutdown.

I have not tested FireWire CD-ROMs or card reader yet.

But today I did get the following crash on 3.0-rc6-71-g4dd1b49 x86-64 plus
Alan's "USB: additional regression fix for device removal" when I powered
down an USB hub with an empty USB card reader attached (or when I powered
it up, impossible for me to say):

general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC

PID: 8299, comm: kworker/0:0 Not tainted
RIP: [...] elv_put_request+0xb/0x1b
[...]
Call Trace:
__blk_put_request
? blk_put_request
blk_put_request
scsi_execute
scsi_execute_req
? sd_check_events
scsi_test_unit_ready
? kmem_cache_alloc
? sd_check_events
sd_check_events
disk_events_workfn
[...]

After this trace, the following messages were logged:
generic-usb 0003:046D:C51.0059: can't reset device, 0000:00:12.2-3.1.3/input/0, status -71
usb 1-3.1: clear tt 1 (0530) error -19
BUG: unable to handle kernel paging request at fffffffffffffff8
IP: [...] kthread_data+0xb/0x11
Pid: 8299, comm: kworker/0:0 Tainted: G D
RIP: 0010:[<ffffffff8104aa8f>]

I can upload a screenshot of this crash if desired.

The 2.6.39 and 3.0 block and/or SCSI layer are in a really sad state.
-- 
Stefan Richter
-=====-==-== -=== -=---
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-08 13:37                   ` Stefan Richter
@ 2011-07-08 13:41                       ` Stefan Richter
  0 siblings, 0 replies; 63+ messages in thread
From: Stefan Richter @ 2011-07-08 13:41 UTC (permalink / raw)
  To: linux-scsi
  Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones,
	Kernel development list, Rafael J. Wysocki, USB list

On Jul 08 Stefan Richter wrote:
> But today I did get the following crash on 3.0-rc6-71-g4dd1b49 x86-64 plus
> Alan's "USB: additional regression fix for device removal" when I powered
> down an USB hub with an empty USB card reader attached (or when I powered
> it up, impossible for me to say):

Judging from the missing hours in an application log, it happened at
power down of the hub = at removal of the card reader.
-- 
Stefan Richter
-=====-==-== -=== -=---
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-08 13:41                       ` Stefan Richter
  0 siblings, 0 replies; 63+ messages in thread
From: Stefan Richter @ 2011-07-08 13:41 UTC (permalink / raw)
  To: linux-scsi
  Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones,
	Kernel development list, Rafael J. Wysocki, USB list

On Jul 08 Stefan Richter wrote:
> But today I did get the following crash on 3.0-rc6-71-g4dd1b49 x86-64 plus
> Alan's "USB: additional regression fix for device removal" when I powered
> down an USB hub with an empty USB card reader attached (or when I powered
> it up, impossible for me to say):

Judging from the missing hours in an application log, it happened at
power down of the hub = at removal of the card reader.
-- 
Stefan Richter
-=====-==-== -=== -=---
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-04 16:04                   ` Alan Stern
  2011-07-06  6:50                     ` Heiko Carstens
@ 2011-07-12 18:49                     ` Jonathan McDowell
  1 sibling, 0 replies; 63+ messages in thread
From: Jonathan McDowell @ 2011-07-12 18:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Heiko Carstens, axboe, James Bottomley, Andi Kleen, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Mon, Jul 04, 2011 at 12:04:54PM -0400, Alan Stern wrote:
> On Mon, 4 Jul 2011, Heiko Carstens wrote:
> 
> > On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote:
> > > The second bug, which hit me but apparently not any of you, is that the 
> > > request_queue's elevator gets deallocated while it is still in use.  
> > > That's because __scsi_remove_device() calls scsi_free_queue(), which 
> > > does blk_cleanup_queue(), which calls elevator_exit(), even though the 
> > > device file is still open and more requests will be submitted when the 
> > > file is closed.
> > > 
> > > I'm not sure of the right fix for this.  One possibility is to move the 
> > > scsi_free_queue() call to scsi_device_dev_release_usercontext().  Or 
> > > maybe the elevator_exit() call should be moved to blk_release_queue().
> > > 
> > > Also, I have no idea why this shows up with USB drives but not other 
> > > SCSI transports.  A fluke of timing?
> > 
> > FWIW, I reported a bug where the request_queue's elevator got deallocated
> > while it was still in use (fc transport with device hotplug):
> > 
> > http://www.spinics.net/lists/linux-scsi/msg52879.html
> 
> That does sound like the second bug I encountered.  Can you reproduce 
> it?  Does the patch here:
> 
> 	http://marc.info/?l=linux-kernel&m=130963676907731&w=2
> 
> fix the problem?

FWIW I'm seeing crashes when FC devices go away while in use as well,
under 2.6.39 and 3.0.0-rc6. I will try the patch linked to above, but
the most recent Oops was:

[71286.103409] end_request: I/O error, dev sdaw, sector 0
[71286.113710] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[71286.117681] IP: [<ffffffff81197828>] elv_completed_request+0x38/0x47
[71286.117681] PGD 2571c8067 PUD 253b81067 PMD 0 
[71286.117681] Oops: 0000 [#1] SMP 
[71286.117681] CPU 0 
[71286.117681] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables autofs4 ipv6 kvm_intel kvm nfsd nfs lockd auth_rpcgss nfs_acl sunrpc dm_round_robin dm_multipath scsi_dh ipmi_devintf ipmi_si ipmi_msghandler sg evdev processor button thermal_sys serio_raw i5k_amb i2c_i801 ioatdma i2c_core dca rng_core tpm_tis tpm tpm_bios ext3 jbd dm_mod ses enclosure ata_generic ata_piix lpfc scsi_transport_fc scsi_tgt [last unloaded: scsi_wait_scan]
[71286.117681] 
[71286.117681] Pid: 0, comm: swapper Not tainted 3.0.0-rc6 #15 Intel S5000PAL./S5000PAL0 
[71286.117681] RIP: 0010:[<ffffffff81197828>]  [<ffffffff81197828>] elv_completed_request+0x38/0x47
[71286.117681] RSP: 0018:ffff88025fc03e10  EFLAGS: 00010002
[71286.117681] RAX: 0000000000000000 RBX: ffff880253cdc1c0 RCX: 00000000000003fe
[71286.117681] RDX: ffff880253155840 RSI: ffff880255e37c70 RDI: ffff880253cdc1c0
[71286.117681] RBP: ffff880255e37c70 R08: 00000001010ec65f R09: 0000000000000000
[71286.117681] R10: ffff880255e37c70 R11: ffffffff817e3e98 R12: 00000000fffffffb
[71286.117681] R13: 0000000000000246 R14: 0000000000000000 R15: 0000000000000000
[71286.117681] FS:  0000000000000000(0000) GS:ffff88025fc00000(0000) knlGS:0000000000000000
[71286.117681] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[71286.117681] CR2: 0000000000000048 CR3: 0000000257144000 CR4: 00000000000006f0
[71286.117681] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[71286.117681] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[71286.117681] Process swapper (pid: 0, threadinfo ffffffff81600000, task ffffffff8165b020)
[71286.117681] Stack:
[71286.117681]  ffff880255e37c70 ffffffff8119c27e ffff880255e37c70 ffff880253cdc1c0
[71286.117681]  00000000fffffffb ffffffff8119d0c1 0000000000000000 ffff880255d733c0
[71286.117681]  ffff880255e37c70 0000000000000000 00000000fffffffb ffffffff8122dfbb
[71286.117681] Call Trace:
[71286.117681]  <IRQ> 
[71286.117681]  [<ffffffff8119c27e>] ? __blk_put_request+0x2e/0xb0
[71286.117681]  [<ffffffff8119d0c1>] ? blk_end_bidi_request+0x3b/0x55
[71286.117681]  [<ffffffff8122dfbb>] ? scsi_io_completion+0x431/0x48e
[71286.117681]  [<ffffffff811a110f>] ? blk_done_softirq+0x5f/0x6c
[71286.117681]  [<ffffffff8103bc7d>] ? __do_softirq+0xbe/0x194
[71286.117681]  [<ffffffff810569c6>] ? timekeeping_get_ns+0xd/0x2a
[71286.117681]  [<ffffffff8130dc0c>] ? call_softirq+0x1c/0x30
[71286.117681]  [<ffffffff81003fc5>] ? do_softirq+0x31/0x63
[71286.117681]  [<ffffffff8103ba69>] ? irq_exit+0x3f/0x9f
[71286.117681]  [<ffffffff8130d873>] ? call_function_single_interrupt+0x13/0x20
[71286.117681]  <EOI> 
[71286.117681]  [<ffffffffa012d0ca>] ? acpi_idle_enter_simple+0xb4/0xe2 [processor]
[71286.117681]  [<ffffffffa012d0c5>] ? acpi_idle_enter_simple+0xaf/0xe2 [processor]
[71286.117681]  [<ffffffff81277aba>] ? cpuidle_idle_call+0xe4/0x162
[71286.117681]  [<ffffffff81001da4>] ? cpu_idle+0xa5/0xdb
[71286.117681]  [<ffffffff816c1ba8>] ? start_kernel+0x38e/0x399
[71286.117681]  [<ffffffff816c138f>] ? x86_64_start_kernel+0xee/0xf2
[71286.117681] Code: 40 74 35 83 7e 44 01 74 04 a8 40 74 2b 83 e0 11 ff c8 0f 95 c0 83 e0 01 48 05 fc 00 00 00 ff 4c 87 04 f6 46 41 04 74 10 48 8b 02 
[71286.117681]  8b 40 48 48 85 c0 74 04 41 58 ff e0 59 c3 48 83 ec 08 48 8d 
[71286.117681] RIP  [<ffffffff81197828>] elv_completed_request+0x38/0x47
[71286.117681]  RSP <ffff88025fc03e10>
[71286.117681] CR2: 0000000000000048
[71286.117681] ---[ end trace 242b012d98a46112 ]---
[71286.117681] Kernel panic - not syncing: Fatal exception in interrupt

J.

-- 
Listen to the words, they tell you what to do...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-02 19:59                   ` Alan Stern
  2011-07-03  1:17                     ` Andi Kleen
  2011-07-07 20:47                     ` solved was " Andi Kleen
@ 2011-07-18 16:59                     ` Dan Williams
  2011-07-18 18:00                       ` Andi Kleen
  2011-07-20  9:58                         ` Jack Wang
  2 siblings, 2 replies; 63+ messages in thread
From: Dan Williams @ 2011-07-18 16:59 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andi Kleen, axboe, James Bottomley, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sat, 2 Jul 2011, Andi Kleen wrote:
>
>> > The problem is that blk_peek_request() calls scsi_prep_fn(), which
>> > does this:
>> >
>> >     struct scsi_device *sdev = q->queuedata;
>> >     int ret = BLKPREP_KILL;
>> >
>> >     if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>> >             ret = scsi_setup_blk_pc_cmnd(sdev, req);
>> >     return scsi_prep_return(q, req, ret);
>> >
>> > It doesn't check to see if sdev is NULL, nor does
>> > scsi_setup_blk_pc_cmnd().  That accounts for this error:
>>
>> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
>> but that just caused RCU CPU stalls afterwards and then eventually
>> a hung system.
>
> The RCU problem is likely to be a separate issue.  It might even be a
> result of the use-after-free problem with the elevator.
>
> At any rate, it's clear that the crash in the refcounting log you
> posted occurred because scsi_setup_blk_pc_cmnd() called
> scsi_prep_state_check(), which tried to dereference the NULL pointer.
>
> Would you like to try this patch to see if it fixes the problem?  As I
> said before, I'm not certain it's the best thing to do, but it worked
> on my system.
>
> Alan Stern
>
>
>
>
> Index: usb-3.0/drivers/scsi/scsi_lib.c
> ===================================================================
> --- usb-3.0.orig/drivers/scsi/scsi_lib.c
> +++ usb-3.0/drivers/scsi/scsi_lib.c
> @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
>        struct scsi_device *sdev = q->queuedata;
>        int ret = BLKPREP_KILL;
>
> +       if (!sdev)
> +               return ret;
>        if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>                ret = scsi_setup_blk_pc_cmnd(sdev, req);
>        return scsi_prep_return(q, req, ret);
> Index: usb-3.0/drivers/scsi/scsi_sysfs.c
> ===================================================================
> --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
> +++ usb-3.0/drivers/scsi/scsi_sysfs.c
> @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
>                kfree(evt);
>        }
>
> +       /* Freeing the queue signals to block that we're done */
> +       scsi_free_queue(sdev->request_queue);
>        blk_put_queue(sdev->request_queue);
>        /* NULL queue means the device can't be used */
>        sdev->request_queue = NULL;
> @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
>        /* cause the request function to reject all I/O requests */
>        sdev->request_queue->queuedata = NULL;
>
> -       /* Freeing the queue signals to block that we're done */
> -       scsi_free_queue(sdev->request_queue);
>        put_device(dev);
>  }

This patch seems to resolve the block/scsi null-ptr de-references in
our libsas/isci environment, we have yet to try James' alternative
[1].  Do we potentially need both?

Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
seems only the "sdev->request_queue->queuedata = NULL" needed to be
moved?

The conversation appeared to be awaiting test results...

[1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2

--
Dan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-18 16:59                     ` Dan Williams
@ 2011-07-18 18:00                       ` Andi Kleen
  2011-07-20  9:58                         ` Jack Wang
  1 sibling, 0 replies; 63+ messages in thread
From: Andi Kleen @ 2011-07-18 18:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alan Stern, Andi Kleen, axboe, James Bottomley, Dave Jones,
	SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

> This patch seems to resolve the block/scsi null-ptr de-references in
> our libsas/isci environment, we have yet to try James' alternative
> [1].  Do we potentially need both?
> 
> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
> seems only the "sdev->request_queue->queuedata = NULL" needed to be
> moved?
> 
> The conversation appeared to be awaiting test results...

Alan's patch worked for me too.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-18 16:59                     ` Dan Williams
@ 2011-07-20  9:58                         ` Jack Wang
  2011-07-20  9:58                         ` Jack Wang
  1 sibling, 0 replies; 63+ messages in thread
From: Jack Wang @ 2011-07-20  9:58 UTC (permalink / raw)
  To: 'Dan Williams', 'Alan Stern'
  Cc: 'Andi Kleen', axboe, 'James Bottomley',
	'Dave Jones', 'SCSI development list',
	'Kernel development list', 'Rafael J. Wysocki',
	'USB list'

> 
> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu>
wrote:
> > On Sat, 2 Jul 2011, Andi Kleen wrote:
> >
> >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which
> >> > does this:
> >> >
> >> >     struct scsi_device *sdev = q->queuedata;
> >> >     int ret = BLKPREP_KILL;
> >> >
> >> >     if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> >> >             ret = scsi_setup_blk_pc_cmnd(sdev, req);
> >> >     return scsi_prep_return(q, req, ret);
> >> >
> >> > It doesn't check to see if sdev is NULL, nor does
> >> > scsi_setup_blk_pc_cmnd().  That accounts for this error:
> >>
> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
> >> but that just caused RCU CPU stalls afterwards and then eventually
> >> a hung system.
> >
> > The RCU problem is likely to be a separate issue.  It might even be a
> > result of the use-after-free problem with the elevator.
> >
> > At any rate, it's clear that the crash in the refcounting log you
> > posted occurred because scsi_setup_blk_pc_cmnd() called
> > scsi_prep_state_check(), which tried to dereference the NULL pointer.
> >
> > Would you like to try this patch to see if it fixes the problem?  As I
> > said before, I'm not certain it's the best thing to do, but it worked
> > on my system.
> >
> > Alan Stern
> >
> >
> >
> >
> > Index: usb-3.0/drivers/scsi/scsi_lib.c
> > ===================================================================
> > --- usb-3.0.orig/drivers/scsi/scsi_lib.c
> > +++ usb-3.0/drivers/scsi/scsi_lib.c
> > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
> >        struct scsi_device *sdev = q->queuedata;
> >        int ret = BLKPREP_KILL;
> >
> > +       if (!sdev)
> > +               return ret;
> >        if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> >                ret = scsi_setup_blk_pc_cmnd(sdev, req);
> >        return scsi_prep_return(q, req, ret);
> > Index: usb-3.0/drivers/scsi/scsi_sysfs.c
> > ===================================================================
> > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
> > +++ usb-3.0/drivers/scsi/scsi_sysfs.c
> > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
> >                kfree(evt);
> >        }
> >
> > +       /* Freeing the queue signals to block that we're done */
> > +       scsi_free_queue(sdev->request_queue);
> >        blk_put_queue(sdev->request_queue);
> >        /* NULL queue means the device can't be used */
> >        sdev->request_queue = NULL;
> > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
> >        /* cause the request function to reject all I/O requests */
> >        sdev->request_queue->queuedata = NULL;
> >
> > -       /* Freeing the queue signals to block that we're done */
> > -       scsi_free_queue(sdev->request_queue);
> >        put_device(dev);
> >  }
> 
> This patch seems to resolve the block/scsi null-ptr de-references in
> our libsas/isci environment, we have yet to try James' alternative
> [1].  Do we potentially need both?
> 
> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
> seems only the "sdev->request_queue->queuedata = NULL" needed to be
> moved?
> 
> The conversation appeared to be awaiting test results...
> 
> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2
> 
> --
> Dan
[Jack Wang] 
This patch fix kernel panic issue when hot-plut disk during I/O, I test it
using pm8001 with 3.0.0-rc6 with above patch.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: Linux 3.0 oopses when pulling a USB CDROM
@ 2011-07-20  9:58                         ` Jack Wang
  0 siblings, 0 replies; 63+ messages in thread
From: Jack Wang @ 2011-07-20  9:58 UTC (permalink / raw)
  To: 'Dan Williams', 'Alan Stern'
  Cc: 'Andi Kleen', axboe, 'James Bottomley',
	'Dave Jones', 'SCSI development list',
	'Kernel development list', 'Rafael J. Wysocki',
	'USB list'

> 
> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu>
wrote:
> > On Sat, 2 Jul 2011, Andi Kleen wrote:
> >
> >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which
> >> > does this:
> >> >
> >> >     struct scsi_device *sdev = q->queuedata;
> >> >     int ret = BLKPREP_KILL;
> >> >
> >> >     if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> >> >             ret = scsi_setup_blk_pc_cmnd(sdev, req);
> >> >     return scsi_prep_return(q, req, ret);
> >> >
> >> > It doesn't check to see if sdev is NULL, nor does
> >> > scsi_setup_blk_pc_cmnd().  That accounts for this error:
> >>
> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
> >> but that just caused RCU CPU stalls afterwards and then eventually
> >> a hung system.
> >
> > The RCU problem is likely to be a separate issue.  It might even be a
> > result of the use-after-free problem with the elevator.
> >
> > At any rate, it's clear that the crash in the refcounting log you
> > posted occurred because scsi_setup_blk_pc_cmnd() called
> > scsi_prep_state_check(), which tried to dereference the NULL pointer.
> >
> > Would you like to try this patch to see if it fixes the problem?  As I
> > said before, I'm not certain it's the best thing to do, but it worked
> > on my system.
> >
> > Alan Stern
> >
> >
> >
> >
> > Index: usb-3.0/drivers/scsi/scsi_lib.c
> > ===================================================================
> > --- usb-3.0.orig/drivers/scsi/scsi_lib.c
> > +++ usb-3.0/drivers/scsi/scsi_lib.c
> > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
> >        struct scsi_device *sdev = q->queuedata;
> >        int ret = BLKPREP_KILL;
> >
> > +       if (!sdev)
> > +               return ret;
> >        if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> >                ret = scsi_setup_blk_pc_cmnd(sdev, req);
> >        return scsi_prep_return(q, req, ret);
> > Index: usb-3.0/drivers/scsi/scsi_sysfs.c
> > ===================================================================
> > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
> > +++ usb-3.0/drivers/scsi/scsi_sysfs.c
> > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
> >                kfree(evt);
> >        }
> >
> > +       /* Freeing the queue signals to block that we're done */
> > +       scsi_free_queue(sdev->request_queue);
> >        blk_put_queue(sdev->request_queue);
> >        /* NULL queue means the device can't be used */
> >        sdev->request_queue = NULL;
> > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
> >        /* cause the request function to reject all I/O requests */
> >        sdev->request_queue->queuedata = NULL;
> >
> > -       /* Freeing the queue signals to block that we're done */
> > -       scsi_free_queue(sdev->request_queue);
> >        put_device(dev);
> >  }
> 
> This patch seems to resolve the block/scsi null-ptr de-references in
> our libsas/isci environment, we have yet to try James' alternative
> [1].  Do we potentially need both?
> 
> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
> seems only the "sdev->request_queue->queuedata = NULL" needed to be
> moved?
> 
> The conversation appeared to be awaiting test results...
> 
> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2
> 
> --
> Dan
[Jack Wang] 
This patch fix kernel panic issue when hot-plut disk during I/O, I test it
using pm8001 with 3.0.0-rc6 with above patch.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-07-20  9:58                         ` Jack Wang
  (?)
@ 2011-10-18 21:16                         ` Ankit Jain
  2011-10-18 21:30                           ` James Bottomley
  -1 siblings, 1 reply; 63+ messages in thread
From: Ankit Jain @ 2011-10-18 21:16 UTC (permalink / raw)
  To: Jack Wang
  Cc: Dan Williams, Alan Stern, Andi Kleen, axboe, James Bottomley,
	Dave Jones, SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Wed, Jul 20, 2011 at 3:28 PM, Jack Wang <jack_wang@usish.com> wrote:
>>
<snip>
>> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu>
> wrote:
>> > On Sat, 2 Jul 2011, Andi Kleen wrote:
>> >
>> >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which
>> >> > does this:
>> >> >
>> >> >     struct scsi_device *sdev = q->queuedata;
>> >> >     int ret = BLKPREP_KILL;
>> >> >
>> >> >     if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>> >> >             ret = scsi_setup_blk_pc_cmnd(sdev, req);
>> >> >     return scsi_prep_return(q, req, ret);
>> >> >
>> >> > It doesn't check to see if sdev is NULL, nor does
>> >> > scsi_setup_blk_pc_cmnd().  That accounts for this error:
>> >>
>> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
>> >> but that just caused RCU CPU stalls afterwards and then eventually
>> >> a hung system.
>> >
>> > The RCU problem is likely to be a separate issue.  It might even be a
>> > result of the use-after-free problem with the elevator.
>> >
>> > At any rate, it's clear that the crash in the refcounting log you
>> > posted occurred because scsi_setup_blk_pc_cmnd() called
>> > scsi_prep_state_check(), which tried to dereference the NULL pointer.
>> >
>> > Would you like to try this patch to see if it fixes the problem?  As I
>> > said before, I'm not certain it's the best thing to do, but it worked
>> > on my system.
>> >
>> > Alan Stern
>> >
>> >
>> >
>> >
>> > Index: usb-3.0/drivers/scsi/scsi_lib.c
>> > ===================================================================
>> > --- usb-3.0.orig/drivers/scsi/scsi_lib.c
>> > +++ usb-3.0/drivers/scsi/scsi_lib.c
>> > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
>> >        struct scsi_device *sdev = q->queuedata;
>> >        int ret = BLKPREP_KILL;
>> >
>> > +       if (!sdev)
>> > +               return ret;
>> >        if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>> >                ret = scsi_setup_blk_pc_cmnd(sdev, req);
>> >        return scsi_prep_return(q, req, ret);
>> > Index: usb-3.0/drivers/scsi/scsi_sysfs.c
>> > ===================================================================
>> > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
>> > +++ usb-3.0/drivers/scsi/scsi_sysfs.c
>> > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
>> >                kfree(evt);
>> >        }
>> >
>> > +       /* Freeing the queue signals to block that we're done */
>> > +       scsi_free_queue(sdev->request_queue);
>> >        blk_put_queue(sdev->request_queue);
>> >        /* NULL queue means the device can't be used */
>> >        sdev->request_queue = NULL;
>> > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
>> >        /* cause the request function to reject all I/O requests */
>> >        sdev->request_queue->queuedata = NULL;
>> >
>> > -       /* Freeing the queue signals to block that we're done */
>> > -       scsi_free_queue(sdev->request_queue);
>> >        put_device(dev);
>> >  }
>>
>> This patch seems to resolve the block/scsi null-ptr de-references in
>> our libsas/isci environment, we have yet to try James' alternative
>> [1].  Do we potentially need both?
>>
>> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
>> seems only the "sdev->request_queue->queuedata = NULL" needed to be
>> moved?
>>
>> The conversation appeared to be awaiting test results...
>>
>> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2
>>
>> --
>> Dan
> [Jack Wang]
> This patch fix kernel panic issue when hot-plut disk during I/O, I test it
> using pm8001 with 3.0.0-rc6 with above patch.

I don't see this patch in scsi-misc-2.6 or linus' tree. Is there a
different patch that fixes the
issue?

-Ankit

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-10-18 21:16                         ` Ankit Jain
@ 2011-10-18 21:30                           ` James Bottomley
  2011-10-21 13:26                             ` Hannes Reinecke
  0 siblings, 1 reply; 63+ messages in thread
From: James Bottomley @ 2011-10-18 21:30 UTC (permalink / raw)
  To: Ankit Jain
  Cc: Jack Wang, Dan Williams, Alan Stern, Andi Kleen, axboe,
	Dave Jones, SCSI development list, Kernel development list,
	Rafael J. Wysocki, USB list

On Wed, 2011-10-19 at 02:46 +0530, Ankit Jain wrote:
> On Wed, Jul 20, 2011 at 3:28 PM, Jack Wang <jack_wang@usish.com> wrote:
> >>
> <snip>
> >> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu>
> > wrote:
> >> > On Sat, 2 Jul 2011, Andi Kleen wrote:
> >> >
> >> >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which
> >> >> > does this:
> >> >> >
> >> >> >     struct scsi_device *sdev = q->queuedata;
> >> >> >     int ret = BLKPREP_KILL;
> >> >> >
> >> >> >     if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> >> >> >             ret = scsi_setup_blk_pc_cmnd(sdev, req);
> >> >> >     return scsi_prep_return(q, req, ret);
> >> >> >
> >> >> > It doesn't check to see if sdev is NULL, nor does
> >> >> > scsi_setup_blk_pc_cmnd().  That accounts for this error:
> >> >>
> >> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
> >> >> but that just caused RCU CPU stalls afterwards and then eventually
> >> >> a hung system.
> >> >
> >> > The RCU problem is likely to be a separate issue.  It might even be a
> >> > result of the use-after-free problem with the elevator.
> >> >
> >> > At any rate, it's clear that the crash in the refcounting log you
> >> > posted occurred because scsi_setup_blk_pc_cmnd() called
> >> > scsi_prep_state_check(), which tried to dereference the NULL pointer.
> >> >
> >> > Would you like to try this patch to see if it fixes the problem?  As I
> >> > said before, I'm not certain it's the best thing to do, but it worked
> >> > on my system.
> >> >
> >> > Alan Stern
> >> >
> >> >
> >> >
> >> >
> >> > Index: usb-3.0/drivers/scsi/scsi_lib.c
> >> > ===================================================================
> >> > --- usb-3.0.orig/drivers/scsi/scsi_lib.c
> >> > +++ usb-3.0/drivers/scsi/scsi_lib.c
> >> > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
> >> >        struct scsi_device *sdev = q->queuedata;
> >> >        int ret = BLKPREP_KILL;
> >> >
> >> > +       if (!sdev)
> >> > +               return ret;
> >> >        if (req->cmd_type == REQ_TYPE_BLOCK_PC)
> >> >                ret = scsi_setup_blk_pc_cmnd(sdev, req);
> >> >        return scsi_prep_return(q, req, ret);
> >> > Index: usb-3.0/drivers/scsi/scsi_sysfs.c
> >> > ===================================================================
> >> > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
> >> > +++ usb-3.0/drivers/scsi/scsi_sysfs.c
> >> > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
> >> >                kfree(evt);
> >> >        }
> >> >
> >> > +       /* Freeing the queue signals to block that we're done */
> >> > +       scsi_free_queue(sdev->request_queue);
> >> >        blk_put_queue(sdev->request_queue);
> >> >        /* NULL queue means the device can't be used */
> >> >        sdev->request_queue = NULL;
> >> > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
> >> >        /* cause the request function to reject all I/O requests */
> >> >        sdev->request_queue->queuedata = NULL;
> >> >
> >> > -       /* Freeing the queue signals to block that we're done */
> >> > -       scsi_free_queue(sdev->request_queue);
> >> >        put_device(dev);
> >> >  }
> >>
> >> This patch seems to resolve the block/scsi null-ptr de-references in
> >> our libsas/isci environment, we have yet to try James' alternative
> >> [1].  Do we potentially need both?
> >>
> >> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
> >> seems only the "sdev->request_queue->queuedata = NULL" needed to be
> >> moved?
> >>
> >> The conversation appeared to be awaiting test results...
> >>
> >> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2
> >>
> >> --
> >> Dan
> > [Jack Wang]
> > This patch fix kernel panic issue when hot-plut disk during I/O, I test it
> > using pm8001 with 3.0.0-rc6 with above patch.
> 
> I don't see this patch in scsi-misc-2.6 or linus' tree. Is there a
> different patch that fixes the
> issue?

It should be fixed by

commit 777eb1bf15b8532c396821774bf6451e563438f5
Author: Hannes Reinecke <hare@suse.de>
Date:   Wed Sep 28 08:07:01 2011 -0600

    block: Free queue resources at blk_release_queue()

James



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Linux 3.0 oopses when pulling a USB CDROM
  2011-10-18 21:30                           ` James Bottomley
@ 2011-10-21 13:26                             ` Hannes Reinecke
  0 siblings, 0 replies; 63+ messages in thread
From: Hannes Reinecke @ 2011-10-21 13:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ankit Jain, Jack Wang, Dan Williams, Alan Stern, Andi Kleen,
	axboe, Dave Jones, SCSI development list,
	Kernel development list, Rafael J. Wysocki, USB list

On 10/18/2011 11:30 PM, James Bottomley wrote:
> On Wed, 2011-10-19 at 02:46 +0530, Ankit Jain wrote:
>> On Wed, Jul 20, 2011 at 3:28 PM, Jack Wang<jack_wang@usish.com>  wrote:
>>>>
>> <snip>
>>>> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern<stern@rowland.harvard.edu>
>>> wrote:
>>>>> On Sat, 2 Jul 2011, Andi Kleen wrote:
>>>>>
>>>>>>> The problem is that blk_peek_request() calls scsi_prep_fn(), which
>>>>>>> does this:
>>>>>>>
>>>>>>>      struct scsi_device *sdev = q->queuedata;
>>>>>>>      int ret = BLKPREP_KILL;
>>>>>>>
>>>>>>>      if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>>>>>>>              ret = scsi_setup_blk_pc_cmnd(sdev, req);
>>>>>>>      return scsi_prep_return(q, req, ret);
>>>>>>>
>>>>>>> It doesn't check to see if sdev is NULL, nor does
>>>>>>> scsi_setup_blk_pc_cmnd().  That accounts for this error:
>>>>>>
>>>>>> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on,
>>>>>> but that just caused RCU CPU stalls afterwards and then eventually
>>>>>> a hung system.
>>>>>
>>>>> The RCU problem is likely to be a separate issue.  It might even be a
>>>>> result of the use-after-free problem with the elevator.
>>>>>
>>>>> At any rate, it's clear that the crash in the refcounting log you
>>>>> posted occurred because scsi_setup_blk_pc_cmnd() called
>>>>> scsi_prep_state_check(), which tried to dereference the NULL pointer.
>>>>>
>>>>> Would you like to try this patch to see if it fixes the problem?  As I
>>>>> said before, I'm not certain it's the best thing to do, but it worked
>>>>> on my system.
>>>>>
>>>>> Alan Stern
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Index: usb-3.0/drivers/scsi/scsi_lib.c
>>>>> ===================================================================
>>>>> --- usb-3.0.orig/drivers/scsi/scsi_lib.c
>>>>> +++ usb-3.0/drivers/scsi/scsi_lib.c
>>>>> @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q
>>>>>         struct scsi_device *sdev = q->queuedata;
>>>>>         int ret = BLKPREP_KILL;
>>>>>
>>>>> +       if (!sdev)
>>>>> +               return ret;
>>>>>         if (req->cmd_type == REQ_TYPE_BLOCK_PC)
>>>>>                 ret = scsi_setup_blk_pc_cmnd(sdev, req);
>>>>>         return scsi_prep_return(q, req, ret);
>>>>> Index: usb-3.0/drivers/scsi/scsi_sysfs.c
>>>>> ===================================================================
>>>>> --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c
>>>>> +++ usb-3.0/drivers/scsi/scsi_sysfs.c
>>>>> @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user
>>>>>                 kfree(evt);
>>>>>         }
>>>>>
>>>>> +       /* Freeing the queue signals to block that we're done */
>>>>> +       scsi_free_queue(sdev->request_queue);
>>>>>         blk_put_queue(sdev->request_queue);
>>>>>         /* NULL queue means the device can't be used */
>>>>>         sdev->request_queue = NULL;
>>>>> @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de
>>>>>         /* cause the request function to reject all I/O requests */
>>>>>         sdev->request_queue->queuedata = NULL;
>>>>>
>>>>> -       /* Freeing the queue signals to block that we're done */
>>>>> -       scsi_free_queue(sdev->request_queue);
>>>>>         put_device(dev);
>>>>>   }
>>>>
>>>> This patch seems to resolve the block/scsi null-ptr de-references in
>>>> our libsas/isci environment, we have yet to try James' alternative
>>>> [1].  Do we potentially need both?
>>>>
>>>> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it
>>>> seems only the "sdev->request_queue->queuedata = NULL" needed to be
>>>> moved?
>>>>
>>>> The conversation appeared to be awaiting test results...
>>>>
>>>> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2
>>>>
>>>> --
>>>> Dan
>>> [Jack Wang]
>>> This patch fix kernel panic issue when hot-plut disk during I/O, I test it
>>> using pm8001 with 3.0.0-rc6 with above patch.
>>
>> I don't see this patch in scsi-misc-2.6 or linus' tree. Is there a
>> different patch that fixes the
>> issue?
>
> It should be fixed by
>
> commit 777eb1bf15b8532c396821774bf6451e563438f5
> Author: Hannes Reinecke<hare@suse.de>
> Date:   Wed Sep 28 08:07:01 2011 -0600
>
>      block: Free queue resources at blk_release_queue()
>
As much as I've hate to admit it, but it looks as if this is only a 
fix for the second part of the original patch.
I've got reports that we still see crashes, which are fixed by the 
patch to scsi_lib.c.

So please include this part.
Do you need a resend?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-10-21 13:26 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-01 17:05 Linux 3.0 oopses when pulling a USB CDROM Andi Kleen
2011-07-01 18:14 ` Dave Jones
2011-07-01 18:32   ` Andi Kleen
2011-07-01 18:40     ` Dave Jones
2011-07-02 15:13     ` Christoph Fritz
2011-07-01 20:29   ` James Bottomley
2011-07-01 20:43     ` [PATCH] USB: fix regression occurring during device removal Alan Stern
2011-07-01 20:43       ` Alan Stern
2011-07-01 21:04       ` Andi Kleen
2011-07-01 21:04     ` Linux 3.0 oopses when pulling a USB CDROM Alan Stern
2011-07-01 21:04       ` Alan Stern
2011-07-01 21:13       ` James Bottomley
2011-07-02  2:03         ` Alan Stern
2011-07-02  2:03           ` Alan Stern
2011-07-02  6:08           ` Andi Kleen
2011-07-02 12:24             ` James Bottomley
2011-07-02 17:05               ` Andi Kleen
2011-07-02 17:09                 ` James Bottomley
2011-07-02 17:09                   ` James Bottomley
2011-07-02 18:15                   ` Andi Kleen
2011-07-02 18:15                     ` Andi Kleen
2011-07-02 20:05                     ` Alan Stern
2011-07-02 20:05                       ` Alan Stern
2011-07-03  1:16                       ` Andi Kleen
2011-07-03  1:16                         ` Andi Kleen
2011-07-03 15:29                         ` Alan Stern
2011-07-03 15:29                           ` Alan Stern
2011-07-03 16:06                           ` Alan Stern
2011-07-03 16:06                             ` Alan Stern
2011-07-02 17:37               ` Alan Stern
2011-07-02 17:37                 ` Alan Stern
2011-07-02 18:11                 ` Andi Kleen
2011-07-02 19:59                   ` Alan Stern
2011-07-03  1:17                     ` Andi Kleen
2011-07-07 20:47                     ` solved was " Andi Kleen
2011-07-18 16:59                     ` Dan Williams
2011-07-18 18:00                       ` Andi Kleen
2011-07-20  9:58                       ` Jack Wang
2011-07-20  9:58                         ` Jack Wang
2011-10-18 21:16                         ` Ankit Jain
2011-10-18 21:30                           ` James Bottomley
2011-10-21 13:26                             ` Hannes Reinecke
2011-07-03  9:14                 ` Dan Williams
2011-07-03 18:16                   ` Andi Kleen
2011-07-03 20:37                 ` Stefan Richter
2011-07-08 13:37                   ` Stefan Richter
2011-07-08 13:41                     ` Stefan Richter
2011-07-08 13:41                       ` Stefan Richter
2011-07-04 11:27                 ` Heiko Carstens
2011-07-04 11:27                   ` Heiko Carstens
2011-07-04 16:04                   ` Alan Stern
2011-07-06  6:50                     ` Heiko Carstens
2011-07-12 18:49                     ` Jonathan McDowell
2011-07-02 12:38             ` Alan Stern
2011-07-02 12:38               ` Alan Stern
2011-07-02 18:10               ` Andi Kleen
2011-07-02 18:10                 ` Andi Kleen
2011-07-02 12:48             ` Rafael J. Wysocki
2011-07-02 12:48               ` Rafael J. Wysocki
2011-07-02 17:06               ` Andi Kleen
2011-07-01 19:20 ` James Bottomley
2011-07-01 19:33   ` James Bottomley
2011-07-01 19:45     ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.