* Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-01 17:05 Andi Kleen 2011-07-01 18:14 ` Dave Jones 2011-07-01 19:20 ` James Bottomley 0 siblings, 2 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-01 17:05 UTC (permalink / raw) To: linux-scsi, linux-kernel; +Cc: axboe, rjw Hi, I found I can reliably crash a 3.0 system by pulling the USB cable of a mounted USB cdrom (or rather a USB device which has a builtin fake CD-ROM) I suspect it's a regression too. It ends with a NULL pointer reference on a NULL sdev in scsi_prep_state_check. Here's a somewhat incomplete backtrace (written down by hand) scsi_prep_state_check scsi_setup_blk_pc_cmnd blk_peek_request ... scsi_request_fn ... ioctl_internal_command ... scsi_set_medium_removal sr_lock_door cdrom_release ... umount I tried adding a if (!sdev) return BLKPREP_KILL; to scsi_prep_state_check, but that caused a RCU CPU stall and a generally unhappy system instead. The sdev must be still there in scsi_set_medium_removal because it's referenced, so it must get lost somewhere in SCSI or in the block layer. Any ideas how to fix this? -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 17:05 Linux 3.0 oopses when pulling a USB CDROM Andi Kleen @ 2011-07-01 18:14 ` Dave Jones 2011-07-01 18:32 ` Andi Kleen 2011-07-01 20:29 ` James Bottomley 2011-07-01 19:20 ` James Bottomley 1 sibling, 2 replies; 63+ messages in thread From: Dave Jones @ 2011-07-01 18:14 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > I found I can reliably crash a 3.0 system by pulling the > USB cable of a mounted USB cdrom (or rather a USB device which > has a builtin fake CD-ROM) > > I suspect it's a regression too. We've been seeing a lot of similar bugs in Fedora since we pushed a 2.6.38.8 update. Some of the traces are different, but some look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) The common cause seems to be 'device went away'. So USB CD drives, USB memory sticks, and for some reason virtualbox shutdown. Dave ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 18:14 ` Dave Jones @ 2011-07-01 18:32 ` Andi Kleen 2011-07-01 18:40 ` Dave Jones 2011-07-02 15:13 ` Christoph Fritz 2011-07-01 20:29 ` James Bottomley 1 sibling, 2 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-01 18:32 UTC (permalink / raw) To: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw On Fri, Jul 01, 2011 at 02:14:06PM -0400, Dave Jones wrote: > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > I found I can reliably crash a 3.0 system by pulling the > > USB cable of a mounted USB cdrom (or rather a USB device which > > has a builtin fake CD-ROM) > > > > I suspect it's a regression too. > > We've been seeing a lot of similar bugs in Fedora since we pushed > a 2.6.38.8 update. Some of the traces are different, but some I've also seen one with an earlier kernel where the NULL reference was in elv_.* something. Don't have a written down backtrace for this, but perhaps you have. -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 18:32 ` Andi Kleen @ 2011-07-01 18:40 ` Dave Jones 2011-07-02 15:13 ` Christoph Fritz 1 sibling, 0 replies; 63+ messages in thread From: Dave Jones @ 2011-07-01 18:40 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw On Fri, Jul 01, 2011 at 08:32:27PM +0200, Andi Kleen wrote: > On Fri, Jul 01, 2011 at 02:14:06PM -0400, Dave Jones wrote: > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > I found I can reliably crash a 3.0 system by pulling the > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > has a builtin fake CD-ROM) > > > > > > I suspect it's a regression too. > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > a 2.6.38.8 update. Some of the traces are different, but some > > I've also seen one with an earlier kernel where the NULL > reference was in elv_.* something. Don't have a written down backtrace > for this, but perhaps you have. We had a bazillion reports of elv_may_queue oopses that looked like https://bugzilla.redhat.com/show_bug.cgi?id=elv_may_queue That should be fixed in .39 and newer. Dave ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 18:32 ` Andi Kleen 2011-07-01 18:40 ` Dave Jones @ 2011-07-02 15:13 ` Christoph Fritz 1 sibling, 0 replies; 63+ messages in thread From: Christoph Fritz @ 2011-07-02 15:13 UTC (permalink / raw) To: Andi Kleen; +Cc: Dave Jones, linux-scsi, linux-kernel, axboe, rjw On Fri, Jul 01, 2011 at 08:32:27PM +0200, Andi Kleen wrote: > On Fri, Jul 01, 2011 at 02:14:06PM -0400, Dave Jones wrote: > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > I found I can reliably crash a 3.0 system by pulling the > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > has a builtin fake CD-ROM) > > > > > > I suspect it's a regression too. > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > a 2.6.38.8 update. Some of the traces are different, but some > > I've also seen one with an earlier kernel where the NULL > reference was in elv_.* something. Don't have a written down backtrace > for this, but perhaps you have. Do you mean https://bugzilla.kernel.org/show_bug.cgi?id=38412 ? [..] Process eject (pid: 5846, ti=f5b9a000 task=f5b19a80 task.ti=f5b9a000) Stack: c1198600 c119ac69 00000256 00000000 00000002 c119b2f9 f4eed270 c11a3237 00000000 f5297200 f637e418 00005309 0000005d c11a3378 0000005d 00000000 f5c1b700 00000000 f5c02e00 00000000 f4a2e1f4 f59c0700 c10d1bc3 f53ed001 Call Trace: [<c1198600>] ? elv_put_request+0x10/0x20 [<c119ac69>] ? __blk_put_request+0x99/0xc0 [<c119b2f9>] ? blk_put_request+0x19/0x40 [<c11a3237>] ? T.593+0x77/0x90 [<c11a3378>] ? scsi_cmd_ioctl+0x128/0x440 [<c10d1bc3>] ? blkdev_get+0x43/0x380 [<c10d1000>] ? bd_acquire+0x20/0xd0 [<c10a2a6d>] ? __dentry_open+0x1cd/0x2a0 [<c10a2c30>] ? nameidata_to_filp+0x70/0x80 [<c10d1f00>] ? blkdev_get+0x380/0x380 [<c125e582>] ? sd_ioctl+0x82/0xc0 [<c125e500>] ? sd_check_events+0x140/0x140 [<c11a02d9>] ? __blkdev_driver_ioctl+0x29/0x40 [<c11a07d3>] ? blkdev_ioctl+0x213/0x7e0 [<c10b1854>] ? do_filp_open+0x34/0x90 [<c10d0715>] ? block_ioctl+0x35/0x50 [<c10d06e0>] ? blkdev_get_block+0x50/0x50 [<c10b2b92>] ? do_vfs_ioctl+0x92/0x610 [<c10a262e>] ? do_sys_open+0x16e/0x1c0 [<c10a262e>] ? do_sys_open+0x16e/0x1c0 [<c10b314d>] ? sys_ioctl+0x3d/0x70 [<c1394ad0>] ? sysenter_do_call+0x12/0x26 Code: Bad EIP value. EIP: [<2f327473>] 0x2f327473 SS:ESP 0068:f5b9bdb0 CR2: 000000002f327473 ---[ end trace 4877fe842f054f84 ]--- Thanks, -- Christoph ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 18:14 ` Dave Jones 2011-07-01 18:32 ` Andi Kleen @ 2011-07-01 20:29 ` James Bottomley 2011-07-01 20:43 ` Alan Stern 2011-07-01 21:04 ` Alan Stern 1 sibling, 2 replies; 63+ messages in thread From: James Bottomley @ 2011-07-01 20:29 UTC (permalink / raw) To: Dave Jones; +Cc: Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > I found I can reliably crash a 3.0 system by pulling the > > USB cable of a mounted USB cdrom (or rather a USB device which > > has a builtin fake CD-ROM) > > > > I suspect it's a regression too. > > We've been seeing a lot of similar bugs in Fedora since we pushed > a 2.6.38.8 update. Some of the traces are different, but some > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > The common cause seems to be 'device went away'. So USB CD drives, > USB memory sticks, and for some reason virtualbox shutdown. I think it's something specific in the USB path. I can't reproduce on 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH] USB: fix regression occurring during device removal 2011-07-01 20:29 ` James Bottomley @ 2011-07-01 20:43 ` Alan Stern 2011-07-01 21:04 ` Alan Stern 1 sibling, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-01 20:43 UTC (permalink / raw) To: Greg KH Cc: Sarah Sharp, James Bottomley, Dave Jones, Andi Kleen, SCSI development list, Kernel development list, axboe, Rafael J. Wysocki, USB list This patch (as1476) fixes a regression introduced by fccf4e86200b8f5edd9a65da26f150e32ba79808 (USB: Free bandwidth when usb_disable_device is called). usb_disconnect() grabs the bandwidth_mutex before calling usb_disable_device(), which calls down indirectly to usb_set_interface(), which tries to acquire the bandwidth_mutex. The fix causes usb_set_interface() to return early when it is called for an interface that has already been unregistered, which is what happens in usb_disable_device(). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> CC: <stable@kernel.org> --- On Fri, 1 Jul 2011, James Bottomley wrote: > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > I found I can reliably crash a 3.0 system by pulling the > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > has a builtin fake CD-ROM) > > > > > > I suspect it's a regression too. > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > a 2.6.38.8 update. Some of the traces are different, but some > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > The common cause seems to be 'device went away'. So USB CD drives, > > USB memory sticks, and for some reason virtualbox shutdown. > > I think it's something specific in the USB path. I can't reproduce on > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. > > James This should fix the problem. Or at least, it does fix a very similar problem. Alan Stern drivers/usb/core/message.c | 2 ++ 1 file changed, 2 insertions(+) Index: usb-3.0/drivers/usb/core/message.c =================================================================== --- usb-3.0.orig/drivers/usb/core/message.c +++ usb-3.0/drivers/usb/core/message.c @@ -1286,6 +1286,8 @@ int usb_set_interface(struct usb_device interface); return -EINVAL; } + if (iface->unregistering) + return -ENODEV; alt = usb_altnum_to_altsetting(iface, alternate); if (!alt) { ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH] USB: fix regression occurring during device removal @ 2011-07-01 20:43 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-01 20:43 UTC (permalink / raw) To: Greg KH Cc: Sarah Sharp, James Bottomley, Dave Jones, Andi Kleen, SCSI development list, Kernel development list, axboe, Rafael J. Wysocki, USB list This patch (as1476) fixes a regression introduced by fccf4e86200b8f5edd9a65da26f150e32ba79808 (USB: Free bandwidth when usb_disable_device is called). usb_disconnect() grabs the bandwidth_mutex before calling usb_disable_device(), which calls down indirectly to usb_set_interface(), which tries to acquire the bandwidth_mutex. The fix causes usb_set_interface() to return early when it is called for an interface that has already been unregistered, which is what happens in usb_disable_device(). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> CC: <stable@kernel.org> --- On Fri, 1 Jul 2011, James Bottomley wrote: > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > I found I can reliably crash a 3.0 system by pulling the > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > has a builtin fake CD-ROM) > > > > > > I suspect it's a regression too. > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > a 2.6.38.8 update. Some of the traces are different, but some > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > The common cause seems to be 'device went away'. So USB CD drives, > > USB memory sticks, and for some reason virtualbox shutdown. > > I think it's something specific in the USB path. I can't reproduce on > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. > > James This should fix the problem. Or at least, it does fix a very similar problem. Alan Stern drivers/usb/core/message.c | 2 ++ 1 file changed, 2 insertions(+) Index: usb-3.0/drivers/usb/core/message.c =================================================================== --- usb-3.0.orig/drivers/usb/core/message.c +++ usb-3.0/drivers/usb/core/message.c @@ -1286,6 +1286,8 @@ int usb_set_interface(struct usb_device interface); return -EINVAL; } + if (iface->unregistering) + return -ENODEV; alt = usb_altnum_to_altsetting(iface, alternate); if (!alt) { ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] USB: fix regression occurring during device removal 2011-07-01 20:43 ` Alan Stern (?) @ 2011-07-01 21:04 ` Andi Kleen -1 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-01 21:04 UTC (permalink / raw) To: Alan Stern Cc: Greg KH, Sarah Sharp, James Bottomley, Dave Jones, Andi Kleen, SCSI development list, Kernel development list, axboe, Rafael J. Wysocki, USB list > This should fix the problem. Or at least, it does fix a very similar > problem. Thanks. I tried it and it doesn't fix the problem unfortunately. -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 20:29 ` James Bottomley @ 2011-07-01 21:04 ` Alan Stern 2011-07-01 21:04 ` Alan Stern 1 sibling, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-01 21:04 UTC (permalink / raw) To: James Bottomley Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Fri, 1 Jul 2011, James Bottomley wrote: > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > I found I can reliably crash a 3.0 system by pulling the > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > has a builtin fake CD-ROM) > > > > > > I suspect it's a regression too. > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > a 2.6.38.8 update. Some of the traces are different, but some > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > The common cause seems to be 'device went away'. So USB CD drives, > > USB memory sticks, and for some reason virtualbox shutdown. > > I think it's something specific in the USB path. I can't reproduce on > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. I just took a look at the Red Hat bugzilla entry mentioned above. It seems to be quite different from the issue addressed by the patch I just posted -- a crash with invalid memory access rather than a lockdep violation and hang of the khubd thread. It's also notable that the stack dump in the bugzilla report doesn't contain any functions in the USB subsystem. Of course this doesn't prove anything, but it is suggestive. Evidently the sdev argument to scsi_prep_state_check() was NULL. This looks like the problem that cropped up before, where q->queuedata was getting set to NULL while the queue was still in use. I can't imagine how anything in usb-storage could have caused that. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-01 21:04 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-01 21:04 UTC (permalink / raw) To: James Bottomley Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Fri, 1 Jul 2011, James Bottomley wrote: > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > I found I can reliably crash a 3.0 system by pulling the > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > has a builtin fake CD-ROM) > > > > > > I suspect it's a regression too. > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > a 2.6.38.8 update. Some of the traces are different, but some > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > The common cause seems to be 'device went away'. So USB CD drives, > > USB memory sticks, and for some reason virtualbox shutdown. > > I think it's something specific in the USB path. I can't reproduce on > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. I just took a look at the Red Hat bugzilla entry mentioned above. It seems to be quite different from the issue addressed by the patch I just posted -- a crash with invalid memory access rather than a lockdep violation and hang of the khubd thread. It's also notable that the stack dump in the bugzilla report doesn't contain any functions in the USB subsystem. Of course this doesn't prove anything, but it is suggestive. Evidently the sdev argument to scsi_prep_state_check() was NULL. This looks like the problem that cropped up before, where q->queuedata was getting set to NULL while the queue was still in use. I can't imagine how anything in usb-storage could have caused that. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 21:04 ` Alan Stern (?) @ 2011-07-01 21:13 ` James Bottomley 2011-07-02 2:03 ` Alan Stern -1 siblings, 1 reply; 63+ messages in thread From: James Bottomley @ 2011-07-01 21:13 UTC (permalink / raw) To: Alan Stern Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Fri, 2011-07-01 at 17:04 -0400, Alan Stern wrote: > On Fri, 1 Jul 2011, James Bottomley wrote: > > > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > > > I found I can reliably crash a 3.0 system by pulling the > > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > > has a builtin fake CD-ROM) > > > > > > > > I suspect it's a regression too. > > > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > > a 2.6.38.8 update. Some of the traces are different, but some > > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > > > The common cause seems to be 'device went away'. So USB CD drives, > > > USB memory sticks, and for some reason virtualbox shutdown. > > > > I think it's something specific in the USB path. I can't reproduce on > > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. > > I just took a look at the Red Hat bugzilla entry mentioned above. It > seems to be quite different from the issue addressed by the patch I > just posted -- a crash with invalid memory access rather than a lockdep > violation and hang of the khubd thread. > > It's also notable that the stack dump in the bugzilla report doesn't > contain any functions in the USB subsystem. Of course this doesn't > prove anything, but it is suggestive. > > Evidently the sdev argument to scsi_prep_state_check() was NULL. This > looks like the problem that cropped up before, where q->queuedata was > getting set to NULL while the queue was still in use. I can't imagine > how anything in usb-storage could have caused that. Right ... the device release function has already been called, so it's some type of refcounting cockup. The fact that I can't reproduce with a SATA unplug is what makes me think it might be USB specific ... of course, that's not definitive. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 21:13 ` James Bottomley @ 2011-07-02 2:03 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 2:03 UTC (permalink / raw) To: James Bottomley Cc: Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Fri, 1 Jul 2011, James Bottomley wrote: > On Fri, 2011-07-01 at 17:04 -0400, Alan Stern wrote: > > On Fri, 1 Jul 2011, James Bottomley wrote: > > > > > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > > > > > I found I can reliably crash a 3.0 system by pulling the > > > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > > > has a builtin fake CD-ROM) > > > > > > > > > > I suspect it's a regression too. > > > > > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > > > a 2.6.38.8 update. Some of the traces are different, but some > > > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > > > > > The common cause seems to be 'device went away'. So USB CD drives, > > > > USB memory sticks, and for some reason virtualbox shutdown. > > > > > > I think it's something specific in the USB path. I can't reproduce on > > > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. > > > > I just took a look at the Red Hat bugzilla entry mentioned above. It > > seems to be quite different from the issue addressed by the patch I > > just posted -- a crash with invalid memory access rather than a lockdep > > violation and hang of the khubd thread. > > > > It's also notable that the stack dump in the bugzilla report doesn't > > contain any functions in the USB subsystem. Of course this doesn't > > prove anything, but it is suggestive. > > > > Evidently the sdev argument to scsi_prep_state_check() was NULL. This > > looks like the problem that cropped up before, where q->queuedata was > > getting set to NULL while the queue was still in use. I can't imagine > > how anything in usb-storage could have caused that. > > Right ... the device release function has already been called, so it's > some type of refcounting cockup. The fact that I can't reproduce with a > SATA unplug is what makes me think it might be USB specific ... of > course, that's not definitive. I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody give the exact sequence of steps you went through to trigger the bug? Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 2:03 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 2:03 UTC (permalink / raw) To: James Bottomley Cc: Dave Jones, Andi Kleen, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0, linux-usb-u79uwXL29TY76Z2rM5mHXA On Fri, 1 Jul 2011, James Bottomley wrote: > On Fri, 2011-07-01 at 17:04 -0400, Alan Stern wrote: > > On Fri, 1 Jul 2011, James Bottomley wrote: > > > > > On Fri, 2011-07-01 at 14:14 -0400, Dave Jones wrote: > > > > On Fri, Jul 01, 2011 at 10:05:31AM -0700, Andi Kleen wrote: > > > > > > > > > I found I can reliably crash a 3.0 system by pulling the > > > > > USB cable of a mounted USB cdrom (or rather a USB device which > > > > > has a builtin fake CD-ROM) > > > > > > > > > > I suspect it's a regression too. > > > > > > > > We've been seeing a lot of similar bugs in Fedora since we pushed > > > > a 2.6.38.8 update. Some of the traces are different, but some > > > > look to be the same as yours. (here's one for eg: https://bugzilla.redhat.com/show_bug.cgi?id=712830) > > > > > > > > The common cause seems to be 'device went away'. So USB CD drives, > > > > USB memory sticks, and for some reason virtualbox shutdown. > > > > > > I think it's something specific in the USB path. I can't reproduce on > > > 3.0-rc5 with a SATA DVD hot unplug. USB cc's added. > > > > I just took a look at the Red Hat bugzilla entry mentioned above. It > > seems to be quite different from the issue addressed by the patch I > > just posted -- a crash with invalid memory access rather than a lockdep > > violation and hang of the khubd thread. > > > > It's also notable that the stack dump in the bugzilla report doesn't > > contain any functions in the USB subsystem. Of course this doesn't > > prove anything, but it is suggestive. > > > > Evidently the sdev argument to scsi_prep_state_check() was NULL. This > > looks like the problem that cropped up before, where q->queuedata was > > getting set to NULL while the queue was still in use. I can't imagine > > how anything in usb-storage could have caused that. > > Right ... the device release function has already been called, so it's > some type of refcounting cockup. The fact that I can't reproduce with a > SATA unplug is what makes me think it might be USB specific ... of > course, that's not definitive. I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody give the exact sequence of steps you went through to trigger the bug? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 2:03 ` Alan Stern (?) @ 2011-07-02 6:08 ` Andi Kleen 2011-07-02 12:24 ` James Bottomley ` (2 more replies) -1 siblings, 3 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-02 6:08 UTC (permalink / raw) To: Alan Stern Cc: James Bottomley, Dave Jones, Andi Kleen, linux-scsi, linux-kernel, axboe, rjw, linux-usb > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > give the exact sequence of steps you went through to trigger the bug? Connect USB storage device with builtin fake CD rom. Wait for udisk to mount it. Pull cable. udisk does umount. Oops. I also got a log of the refcounting now if you want it. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 6:08 ` Andi Kleen @ 2011-07-02 12:24 ` James Bottomley 2011-07-02 17:05 ` Andi Kleen 2011-07-02 17:37 ` Alan Stern 2011-07-02 12:38 ` Alan Stern 2011-07-02 12:48 ` Rafael J. Wysocki 2 siblings, 2 replies; 63+ messages in thread From: James Bottomley @ 2011-07-02 12:24 UTC (permalink / raw) To: Andi Kleen Cc: Alan Stern, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, 2011-07-02 at 08:08 +0200, Andi Kleen wrote: > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > give the exact sequence of steps you went through to trigger the bug? > > Connect USB storage device with builtin fake CD rom. Wait for udisk > to mount it. Pull cable. udisk does umount. Oops. > > I also got a log of the refcounting now if you want it. So I've got the log, but this is the relevant section: --- usb 2-1.5: USB disconnect, device number 4 sr 5:0:0:1: scsi put_device 13 from device_del+0x177/0x1c0 sr 5:0:0:1: scsi put_device 12 from bsg_kref_release_function+0x28/0x30 sr 5:0:0:1: scsi put_device 10 from device_del+0x177/0x1c0 sr 5:0:0:1: scsi put_device 8 from device_del+0x177/0x1c0 sr 5:0:0:1: scsi put_device 7 from scsi_device_cls_release+0x15/0x20 sr 5:0:0:1: scsi put_device 6 from klist_children_put+0x12/0x20 sr 5:0:0:1: scsi put_device 5 from klist_devices_put+0x12/0x20 sr 5:0:0:1: scsi put_device 3 from device_del+0x177/0x1c0 scsi: killing requests for dead queue BUG: sleeping function called from invalid context at /home/ak/lsrc/git/linux-2.6/arch/x86/mm/fault.c:1103 in_atomic(): 0, irqs_disabled(): 1, pid: 2527, name: umount Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8 Call Trace: [<ffffffff8103af8c>] __might_sleep+0xcc/0xf0 [<ffffffff8155af42>] do_page_fault+0x142/0x4c0 [<ffffffffa01d5385>] ? write_msg+0x105/0x120 [netconsole] [<ffffffff810514f7>] ? __call_console_drivers+0x97/0xb0 [<ffffffff81079692>] ? up+0x32/0x50 [<ffffffff81557f5f>] page_fault+0x1f/0x30 [<ffffffff81389a70>] ? scsi_setup_blk_pc_cmnd+0x170/0x170 [<ffffffff81388e19>] ? scsi_prep_state_check+0x9/0x90 [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170 [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60 [<ffffffff812847ad>] blk_peek_request+0xbd/0x230 [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470 [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20 [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0 [<ffffffff81288676>] blk_execute_rq+0x86/0xf0 [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0 [<ffffffff81389c6c>] scsi_execute+0xfc/0x160 [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140 [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0 [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30 [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0 [<ffffffff8139b390>] sr_lock_door+0x20/0x30 [<ffffffff813c4d63>] cdrom_release+0xa3/0x260 [<ffffffff8118157e>] ? invalidate_bh_lru+0x2e/0x50 [<ffffffff81181550>] ? buffer_cpu_notify+0xa0/0xa0 [<ffffffff8139a088>] sr_block_release+0x38/0x60 [<ffffffff8118833c>] __blkdev_put+0x16c/0x1b0 [<ffffffff811883b2>] blkdev_put+0x32/0x130 [<ffffffff8115650e>] kill_block_super+0x4e/0x80 [<ffffffff81156865>] deactivate_locked_super+0x45/0x70 [<ffffffff8115724a>] deactivate_super+0x4a/0x70 [<ffffffff81171d54>] mntput_no_expire+0xc4/0x110 [<ffffffff81172a2c>] sys_umount+0x6c/0x360 [<ffffffff8155f52b>] system_call_fastpath+0x16/0x1b BUG: unable to handle kernel NULL pointer dereference at 0000000000000650 IP: [<ffffffff81388e19>] scsi_prep_state_check+0x9/0x90 PGD 0 Oops: 0000 [#1] SMP CPU 2 Modules linked in: nls_utf8 udf ses enclosure netconsole configfs fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 kvm_intel kvm uinput snd_hda_codec_hdmi snd_hda_codec_realtek snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore iTCO_wdt snd_page_alloc iTCO_vendor_support joydev i7core_edac edac_core broadcom tg3 e1000 dcdbas microcode serio_raw pcspkr i2c_i801 firewire_ohci firewire_core crc_itu_t usb_storage radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8 Dell Inc. Studio XPS 8000/0X231R RIP: 0010:[<ffffffff81388e19>] [<ffffffff81388e19>] scsi_prep_state_check+0x9/0x90 RSP: 0018:ffff88021b3859c8 EFLAGS: 00010086 RAX: ffffffff81389a70 RBX: ffff88022d2c85a0 RCX: 0000000000001fa7 RDX: 0000000000000001 RSI: ffff88022d2c85a0 RDI: 0000000000000000 RBP: ffff88021b3859c8 R08: 0000000000000004 R09: 0000000000000002 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88022e0a9428 R13: ffff88022d2c85a0 R14: 0000000000000000 R15: ffff88022e404d20 FS: 00007f2d10454760(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000650 CR3: 0000000214443000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process umount (pid: 2527, threadinfo ffff88021b384000, task ffff88022e02dc80) Stack: ffff88021b3859f8 ffffffff8138992b ffff88022d2c85a0 ffff88022e0a9428 ffff88022d2c85a0 ffff88021b385cf8 ffff88021b385a18 ffffffff81389abd ffff88022d2c85a0 ffff88022e0a9428 ffff88021b385a48 ffffffff812847ad Call Trace: [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170 [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60 [<ffffffff812847ad>] blk_peek_request+0xbd/0x230 [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470 [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20 [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0 [<ffffffff81288676>] blk_execute_rq+0x86/0xf0 [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0 [<ffffffff81389c6c>] scsi_execute+0xfc/0x160 [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140 [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0 [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30 [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0 [<ffffffff813c4d63>] cdrom_release+0xa3/0x260 [<ffffffff8118157e>] ? invalidate_bh_lru+0x2e/0x50 [<ffffffff81181550>] ? buffer_cpu_notify+0xa0/0xa0 [<ffffffff8139a088>] sr_block_release+0x38/0x60 [<ffffffff8118833c>] __blkdev_put+0x16c/0x1b0 [<ffffffff811883b2>] blkdev_put+0x32/0x130 [<ffffffff8115650e>] kill_block_super+0x4e/0x80 [<ffffffff81156865>] deactivate_locked_super+0x45/0x70 [<ffffffff8115724a>] deactivate_super+0x4a/0x70 [<ffffffff81171d54>] mntput_no_expire+0xc4/0x110 [<ffffffff81172a2c>] sys_umount+0x6c/0x360 [<ffffffff8155f52b>] system_call_fastpath+0x16/0x1b Code: 7b 58 e8 4b ea 1c 00 48 8b 4d a8 48 89 45 b8 48 89 cf e8 7b 88 ff ff eb a1 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 <8b> 87 50 06 00 00 83 f8 02 75 04 31 c0 c9 c3 83 e8 04 83 f8 04 RIP [<ffffffff81388e19>] scsi_prep_state_check+0x9/0x90 RSP <ffff88021b3859c8> CR2: 0000000000000650 ---[ end trace 06d5981e67b7b7c9 ]--- --- Which goes from the device unplug to the oops. However, there are puts missing from this; particularly the one where the reference goes to zero. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 12:24 ` James Bottomley @ 2011-07-02 17:05 ` Andi Kleen 2011-07-02 17:09 ` James Bottomley 2011-07-02 17:37 ` Alan Stern 1 sibling, 1 reply; 63+ messages in thread From: Andi Kleen @ 2011-07-02 17:05 UTC (permalink / raw) To: James Bottomley Cc: Andi Kleen, Alan Stern, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb > Which goes from the device unplug to the oops. However, there are puts > missing from this; particularly the one where the reference goes to > zero. The count was always one off because it dumps it before the actual put. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 17:09 ` James Bottomley 0 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2011-07-02 17:09 UTC (permalink / raw) To: Andi Kleen Cc: Alan Stern, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, 2011-07-02 at 19:05 +0200, Andi Kleen wrote: > > Which goes from the device unplug to the oops. However, there are puts > > missing from this; particularly the one where the reference goes to > > zero. > > The count was always one off because it dumps it before the actual > put. I know this. What I mean is the last seen put count before the oops is 3 (not 1). And the transition after unplug goes 13->12->10->8->7->6->5->3 implying we're missing put events. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 17:09 ` James Bottomley 0 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2011-07-02 17:09 UTC (permalink / raw) To: Andi Kleen Cc: Alan Stern, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0, linux-usb-u79uwXL29TY76Z2rM5mHXA On Sat, 2011-07-02 at 19:05 +0200, Andi Kleen wrote: > > Which goes from the device unplug to the oops. However, there are puts > > missing from this; particularly the one where the reference goes to > > zero. > > The count was always one off because it dumps it before the actual > put. I know this. What I mean is the last seen put count before the oops is 3 (not 1). And the transition after unplug goes 13->12->10->8->7->6->5->3 implying we're missing put events. James -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 17:09 ` James Bottomley @ 2011-07-02 18:15 ` Andi Kleen -1 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-02 18:15 UTC (permalink / raw) To: James Bottomley Cc: Andi Kleen, Alan Stern, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb > And the transition after unplug goes > > 13->12->10->8->7->6->5->3 > > implying we're missing put events. Don't know why. I used this check (with trace_scsi_dev containing the name) in put_device: if (trace_scsi && dev->type == &scsi_dev_type && trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) { dev_printk ... If you want instrumentation somewhere else please let me know. Maybe there simply isn't a put_device to 0 and the NULL pointer happens for some other reason? -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 18:15 ` Andi Kleen 0 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-02 18:15 UTC (permalink / raw) To: James Bottomley Cc: Andi Kleen, Alan Stern, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0, linux-usb-u79uwXL29TY76Z2rM5mHXA > And the transition after unplug goes > > 13->12->10->8->7->6->5->3 > > implying we're missing put events. Don't know why. I used this check (with trace_scsi_dev containing the name) in put_device: if (trace_scsi && dev->type == &scsi_dev_type && trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) { dev_printk ... If you want instrumentation somewhere else please let me know. Maybe there simply isn't a put_device to 0 and the NULL pointer happens for some other reason? -Andi -- ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 18:15 ` Andi Kleen @ 2011-07-02 20:05 ` Alan Stern -1 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 20:05 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, 2 Jul 2011, Andi Kleen wrote: > > And the transition after unplug goes > > > > 13->12->10->8->7->6->5->3 > > > > implying we're missing put events. > > Don't know why. I used this check (with trace_scsi_dev containing the name) > in put_device: > > if (trace_scsi && dev->type == &scsi_dev_type && > trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) { > dev_printk ... These extra puts could be coming from deep inside the device-model core, where they operate directly on the underlying kobject instead of going through put_device(). For example, removal of sysfs symlinks could have this effect. > If you want instrumentation somewhere else please let me know. > > Maybe there simply isn't a put_device to 0 and the NULL pointer happens > for some other reason? On my system, at least, the scsi_device's refcount dropped to 0 at the right time. That wasn't the problem. The NULL pointer occurs because the request_queue is used after the scsi_device has been removed from visibility; among other things, __scsi_remove_device() sets sdev->request_queue->queuedata to NULL. As the comment says, this causes the request function to reject all I/O requests -- but not before trying to dereference the NULL pointer! Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 20:05 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 20:05 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, 2 Jul 2011, Andi Kleen wrote: > > And the transition after unplug goes > > > > 13->12->10->8->7->6->5->3 > > > > implying we're missing put events. > > Don't know why. I used this check (with trace_scsi_dev containing the name) > in put_device: > > if (trace_scsi && dev->type == &scsi_dev_type && > trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) { > dev_printk ... These extra puts could be coming from deep inside the device-model core, where they operate directly on the underlying kobject instead of going through put_device(). For example, removal of sysfs symlinks could have this effect. > If you want instrumentation somewhere else please let me know. > > Maybe there simply isn't a put_device to 0 and the NULL pointer happens > for some other reason? On my system, at least, the scsi_device's refcount dropped to 0 at the right time. That wasn't the problem. The NULL pointer occurs because the request_queue is used after the scsi_device has been removed from visibility; among other things, __scsi_remove_device() sets sdev->request_queue->queuedata to NULL. As the comment says, this causes the request function to reject all I/O requests -- but not before trying to dereference the NULL pointer! Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-03 1:16 ` Andi Kleen 0 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-03 1:16 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb > > if (trace_scsi && dev->type == &scsi_dev_type && > > trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) { > > dev_printk ... > > These extra puts could be coming from deep inside the device-model > core, where they operate directly on the underlying kobject instead of > going through put_device(). For example, removal of sysfs symlinks > could have this effect. Ok I could move it to kobject with some complications (need a real back trace then) > On my system, at least, the scsi_device's refcount dropped to 0 at the > right time. That wasn't the problem. The NULL pointer occurs because > the request_queue is used after the scsi_device has been removed from > visibility; among other things, __scsi_remove_device() sets > sdev->request_queue->queuedata to NULL. > > As the comment says, this causes the request function to reject all I/O > requests -- but not before trying to dereference the NULL pointer! Your explanation completely contradicts what James wrote earlier. Maybe it's good if you guys come up with a common avenue of debugging before I try further. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-03 1:16 ` Andi Kleen 0 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-03 1:16 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, James Bottomley, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0, linux-usb-u79uwXL29TY76Z2rM5mHXA > > if (trace_scsi && dev->type == &scsi_dev_type && > > trace_scsi_dev[0] == 0 || !strcmp(dev_name(dev), trace_scsi_dev))) { > > dev_printk ... > > These extra puts could be coming from deep inside the device-model > core, where they operate directly on the underlying kobject instead of > going through put_device(). For example, removal of sysfs symlinks > could have this effect. Ok I could move it to kobject with some complications (need a real back trace then) > On my system, at least, the scsi_device's refcount dropped to 0 at the > right time. That wasn't the problem. The NULL pointer occurs because > the request_queue is used after the scsi_device has been removed from > visibility; among other things, __scsi_remove_device() sets > sdev->request_queue->queuedata to NULL. > > As the comment says, this causes the request function to reject all I/O > requests -- but not before trying to dereference the NULL pointer! Your explanation completely contradicts what James wrote earlier. Maybe it's good if you guys come up with a common avenue of debugging before I try further. -Andi -- ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-03 15:29 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-03 15:29 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sun, 3 Jul 2011, Andi Kleen wrote: > > On my system, at least, the scsi_device's refcount dropped to 0 at the > > right time. That wasn't the problem. The NULL pointer occurs because > > the request_queue is used after the scsi_device has been removed from > > visibility; among other things, __scsi_remove_device() sets > > sdev->request_queue->queuedata to NULL. > > > > As the comment says, this causes the request function to reject all I/O > > requests -- but not before trying to dereference the NULL pointer! > > Your explanation completely contradicts what James wrote earlier. Yes, it does. I'm claiming that James was on a wrong track or looking in the wrong direction. > Maybe it's good if you guys come up with a common avenue of debugging > before I try further. It will be interesting to hear what James has to say about all this... > > Would you like to try this patch to see if it fixes the problem? As I > > said before, I'm not certain it's the best thing to do, but it worked > > on my system. > > I can try next week. Okay. > Is this patch likely to fix the RCU stall? I have no idea. All I know about this RCU stall is what you wrote before, i.e., that it happened once. No debugging information or anything else. However if it happens again with the patch in place, we'll be in a good position to figure the cause and fix it. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-03 15:29 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-03 15:29 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0, linux-usb-u79uwXL29TY76Z2rM5mHXA On Sun, 3 Jul 2011, Andi Kleen wrote: > > On my system, at least, the scsi_device's refcount dropped to 0 at the > > right time. That wasn't the problem. The NULL pointer occurs because > > the request_queue is used after the scsi_device has been removed from > > visibility; among other things, __scsi_remove_device() sets > > sdev->request_queue->queuedata to NULL. > > > > As the comment says, this causes the request function to reject all I/O > > requests -- but not before trying to dereference the NULL pointer! > > Your explanation completely contradicts what James wrote earlier. Yes, it does. I'm claiming that James was on a wrong track or looking in the wrong direction. > Maybe it's good if you guys come up with a common avenue of debugging > before I try further. It will be interesting to hear what James has to say about all this... > > Would you like to try this patch to see if it fixes the problem? As I > > said before, I'm not certain it's the best thing to do, but it worked > > on my system. > > I can try next week. Okay. > Is this patch likely to fix the RCU stall? I have no idea. All I know about this RCU stall is what you wrote before, i.e., that it happened once. No debugging information or anything else. However if it happens again with the patch in place, we'll be in a good position to figure the cause and fix it. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-03 15:29 ` Alan Stern @ 2011-07-03 16:06 ` Alan Stern -1 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-03 16:06 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sun, 3 Jul 2011, Alan Stern wrote: > > Your explanation completely contradicts what James wrote earlier. > > Yes, it does. I'm claiming that James was on a wrong track or looking > in the wrong direction. To amplify: James wrote: > Right, that wouldn't work. The sdev in question comes from > request_queue->queuedata. That only goes to null when the last > reference to the sdev has been released. So the root cause is something > in sd holding a reference to sdev without actually getting an additional > refcount. The third sentence is clearly wrong, because queuedata is set to NULL when sdev is removed from visibility, not when the last reference to it has been released, i.e., in __scsi_remove_device() rather than scsi_device_dev_release_usercontext(). Maybe this means the assignment to queuedata should be moved. I don't know; unlike James, I'm not an expert on the details of the SCSI stack. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-03 16:06 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-03 16:06 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sun, 3 Jul 2011, Alan Stern wrote: > > Your explanation completely contradicts what James wrote earlier. > > Yes, it does. I'm claiming that James was on a wrong track or looking > in the wrong direction. To amplify: James wrote: > Right, that wouldn't work. The sdev in question comes from > request_queue->queuedata. That only goes to null when the last > reference to the sdev has been released. So the root cause is something > in sd holding a reference to sdev without actually getting an additional > refcount. The third sentence is clearly wrong, because queuedata is set to NULL when sdev is removed from visibility, not when the last reference to it has been released, i.e., in __scsi_remove_device() rather than scsi_device_dev_release_usercontext(). Maybe this means the assignment to queuedata should be moved. I don't know; unlike James, I'm not an expert on the details of the SCSI stack. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 12:24 ` James Bottomley @ 2011-07-02 17:37 ` Alan Stern 2011-07-02 17:37 ` Alan Stern 1 sibling, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 17:37 UTC (permalink / raw) To: axboe, James Bottomley Cc: Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list Okay, I found the source of the problem. Or more accurately, I found two separate bugs. The first bug is triggered in scsi_request_fn(). At the start of that routine we have: struct scsi_device *sdev = q->queuedata; ... if (!sdev) { printk("scsi: killing requests for dead queue\n"); while ((req = blk_peek_request(q)) != NULL) scsi_kill_request(req, q); return; } The problem is that blk_peek_request() calls scsi_prep_fn(), which does this: struct scsi_device *sdev = q->queuedata; int ret = BLKPREP_KILL; if (req->cmd_type == REQ_TYPE_BLOCK_PC) ret = scsi_setup_blk_pc_cmnd(sdev, req); return scsi_prep_return(q, req, ret); It doesn't check to see if sdev is NULL, nor does scsi_setup_blk_pc_cmnd(). That accounts for this error: On Sat, 2 Jul 2011, James Bottomley wrote: > On Sat, 2011-07-02 at 08:08 +0200, Andi Kleen wrote: > > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > > give the exact sequence of steps you went through to trigger the bug? > > > > Connect USB storage device with builtin fake CD rom. Wait for udisk > > to mount it. Pull cable. udisk does umount. Oops. > > > > I also got a log of the refcounting now if you want it. > > So I've got the log, but this is the relevant section: > > --- > usb 2-1.5: USB disconnect, device number 4 > sr 5:0:0:1: scsi put_device 13 from device_del+0x177/0x1c0 > sr 5:0:0:1: scsi put_device 12 from bsg_kref_release_function+0x28/0x30 > sr 5:0:0:1: scsi put_device 10 from device_del+0x177/0x1c0 > sr 5:0:0:1: scsi put_device 8 from device_del+0x177/0x1c0 > sr 5:0:0:1: scsi put_device 7 from scsi_device_cls_release+0x15/0x20 > sr 5:0:0:1: scsi put_device 6 from klist_children_put+0x12/0x20 > sr 5:0:0:1: scsi put_device 5 from klist_devices_put+0x12/0x20 > sr 5:0:0:1: scsi put_device 3 from device_del+0x177/0x1c0 > scsi: killing requests for dead queue > BUG: sleeping function called from invalid context > at /home/ak/lsrc/git/linux-2.6/arch/x86/mm/fault.c:1103 > in_atomic(): 0, irqs_disabled(): 1, pid: 2527, name: umount > Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8 > Call Trace: > [<ffffffff8103af8c>] __might_sleep+0xcc/0xf0 > [<ffffffff8155af42>] do_page_fault+0x142/0x4c0 > [<ffffffffa01d5385>] ? write_msg+0x105/0x120 [netconsole] > [<ffffffff810514f7>] ? __call_console_drivers+0x97/0xb0 > [<ffffffff81079692>] ? up+0x32/0x50 > [<ffffffff81557f5f>] page_fault+0x1f/0x30 > [<ffffffff81389a70>] ? scsi_setup_blk_pc_cmnd+0x170/0x170 > [<ffffffff81388e19>] ? scsi_prep_state_check+0x9/0x90 > [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170 > [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60 > [<ffffffff812847ad>] blk_peek_request+0xbd/0x230 > [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470 > [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20 > [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0 > [<ffffffff81288676>] blk_execute_rq+0x86/0xf0 > [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0 > [<ffffffff81389c6c>] scsi_execute+0xfc/0x160 > [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140 > [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0 > [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30 > [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0 > [<ffffffff8139b390>] sr_lock_door+0x20/0x30 > [<ffffffff813c4d63>] cdrom_release+0xa3/0x260 An easy fix is to have scsi_prep_fn() check if sdev is NULL and return BLKPREP_KILL if it is. The second bug, which hit me but apparently not any of you, is that the request_queue's elevator gets deallocated while it is still in use. That's because __scsi_remove_device() calls scsi_free_queue(), which does blk_cleanup_queue(), which calls elevator_exit(), even though the device file is still open and more requests will be submitted when the file is closed. I'm not sure of the right fix for this. One possibility is to move the scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or maybe the elevator_exit() call should be moved to blk_release_queue(). Also, I have no idea why this shows up with USB drives but not other SCSI transports. A fluke of timing? Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 17:37 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 17:37 UTC (permalink / raw) To: axboe-tSWWG44O7X1aa/9Udqfwiw, James Bottomley Cc: Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list Okay, I found the source of the problem. Or more accurately, I found two separate bugs. The first bug is triggered in scsi_request_fn(). At the start of that routine we have: struct scsi_device *sdev = q->queuedata; ... if (!sdev) { printk("scsi: killing requests for dead queue\n"); while ((req = blk_peek_request(q)) != NULL) scsi_kill_request(req, q); return; } The problem is that blk_peek_request() calls scsi_prep_fn(), which does this: struct scsi_device *sdev = q->queuedata; int ret = BLKPREP_KILL; if (req->cmd_type == REQ_TYPE_BLOCK_PC) ret = scsi_setup_blk_pc_cmnd(sdev, req); return scsi_prep_return(q, req, ret); It doesn't check to see if sdev is NULL, nor does scsi_setup_blk_pc_cmnd(). That accounts for this error: On Sat, 2 Jul 2011, James Bottomley wrote: > On Sat, 2011-07-02 at 08:08 +0200, Andi Kleen wrote: > > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > > give the exact sequence of steps you went through to trigger the bug? > > > > Connect USB storage device with builtin fake CD rom. Wait for udisk > > to mount it. Pull cable. udisk does umount. Oops. > > > > I also got a log of the refcounting now if you want it. > > So I've got the log, but this is the relevant section: > > --- > usb 2-1.5: USB disconnect, device number 4 > sr 5:0:0:1: scsi put_device 13 from device_del+0x177/0x1c0 > sr 5:0:0:1: scsi put_device 12 from bsg_kref_release_function+0x28/0x30 > sr 5:0:0:1: scsi put_device 10 from device_del+0x177/0x1c0 > sr 5:0:0:1: scsi put_device 8 from device_del+0x177/0x1c0 > sr 5:0:0:1: scsi put_device 7 from scsi_device_cls_release+0x15/0x20 > sr 5:0:0:1: scsi put_device 6 from klist_children_put+0x12/0x20 > sr 5:0:0:1: scsi put_device 5 from klist_devices_put+0x12/0x20 > sr 5:0:0:1: scsi put_device 3 from device_del+0x177/0x1c0 > scsi: killing requests for dead queue > BUG: sleeping function called from invalid context > at /home/ak/lsrc/git/linux-2.6/arch/x86/mm/fault.c:1103 > in_atomic(): 0, irqs_disabled(): 1, pid: 2527, name: umount > Pid: 2527, comm: umount Not tainted 3.0.0-rc5+ #8 > Call Trace: > [<ffffffff8103af8c>] __might_sleep+0xcc/0xf0 > [<ffffffff8155af42>] do_page_fault+0x142/0x4c0 > [<ffffffffa01d5385>] ? write_msg+0x105/0x120 [netconsole] > [<ffffffff810514f7>] ? __call_console_drivers+0x97/0xb0 > [<ffffffff81079692>] ? up+0x32/0x50 > [<ffffffff81557f5f>] page_fault+0x1f/0x30 > [<ffffffff81389a70>] ? scsi_setup_blk_pc_cmnd+0x170/0x170 > [<ffffffff81388e19>] ? scsi_prep_state_check+0x9/0x90 > [<ffffffff8138992b>] scsi_setup_blk_pc_cmnd+0x2b/0x170 > [<ffffffff81389abd>] scsi_prep_fn+0x4d/0x60 > [<ffffffff812847ad>] blk_peek_request+0xbd/0x230 > [<ffffffff8138a1ea>] scsi_request_fn+0x44a/0x470 > [<ffffffff8127e42b>] __blk_run_queue+0x1b/0x20 > [<ffffffff812885a3>] blk_execute_rq_nowait+0x63/0xb0 > [<ffffffff81288676>] blk_execute_rq+0x86/0xf0 > [<ffffffff8128430d>] ? blk_get_request+0x6d/0xa0 > [<ffffffff81389c6c>] scsi_execute+0xfc/0x160 > [<ffffffff8138a40a>] scsi_execute_req+0xca/0x140 > [<ffffffff81383ea8>] ioctl_internal_command.clone.4+0x68/0x1a0 > [<ffffffff81103f82>] ? pagevec_lookup+0x22/0x30 > [<ffffffff8138405e>] scsi_set_medium_removal+0x7e/0xb0 > [<ffffffff8139b390>] sr_lock_door+0x20/0x30 > [<ffffffff813c4d63>] cdrom_release+0xa3/0x260 An easy fix is to have scsi_prep_fn() check if sdev is NULL and return BLKPREP_KILL if it is. The second bug, which hit me but apparently not any of you, is that the request_queue's elevator gets deallocated while it is still in use. That's because __scsi_remove_device() calls scsi_free_queue(), which does blk_cleanup_queue(), which calls elevator_exit(), even though the device file is still open and more requests will be submitted when the file is closed. I'm not sure of the right fix for this. One possibility is to move the scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or maybe the elevator_exit() call should be moved to blk_release_queue(). Also, I have no idea why this shows up with USB drives but not other SCSI transports. A fluke of timing? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 17:37 ` Alan Stern (?) @ 2011-07-02 18:11 ` Andi Kleen 2011-07-02 19:59 ` Alan Stern -1 siblings, 1 reply; 63+ messages in thread From: Andi Kleen @ 2011-07-02 18:11 UTC (permalink / raw) To: Alan Stern Cc: axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list > The problem is that blk_peek_request() calls scsi_prep_fn(), which > does this: > > struct scsi_device *sdev = q->queuedata; > int ret = BLKPREP_KILL; > > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > ret = scsi_setup_blk_pc_cmnd(sdev, req); > return scsi_prep_return(q, req, ret); > > It doesn't check to see if sdev is NULL, nor does > scsi_setup_blk_pc_cmnd(). That accounts for this error: I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, but that just caused RCU CPU stalls afterwards and then eventually a hung system. -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 18:11 ` Andi Kleen @ 2011-07-02 19:59 ` Alan Stern 2011-07-03 1:17 ` Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 19:59 UTC (permalink / raw) To: Andi Kleen Cc: axboe, James Bottomley, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Sat, 2 Jul 2011, Andi Kleen wrote: > > The problem is that blk_peek_request() calls scsi_prep_fn(), which > > does this: > > > > struct scsi_device *sdev = q->queuedata; > > int ret = BLKPREP_KILL; > > > > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > > ret = scsi_setup_blk_pc_cmnd(sdev, req); > > return scsi_prep_return(q, req, ret); > > > > It doesn't check to see if sdev is NULL, nor does > > scsi_setup_blk_pc_cmnd(). That accounts for this error: > > I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, > but that just caused RCU CPU stalls afterwards and then eventually > a hung system. The RCU problem is likely to be a separate issue. It might even be a result of the use-after-free problem with the elevator. At any rate, it's clear that the crash in the refcounting log you posted occurred because scsi_setup_blk_pc_cmnd() called scsi_prep_state_check(), which tried to dereference the NULL pointer. Would you like to try this patch to see if it fixes the problem? As I said before, I'm not certain it's the best thing to do, but it worked on my system. Alan Stern Index: usb-3.0/drivers/scsi/scsi_lib.c =================================================================== --- usb-3.0.orig/drivers/scsi/scsi_lib.c +++ usb-3.0/drivers/scsi/scsi_lib.c @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q struct scsi_device *sdev = q->queuedata; int ret = BLKPREP_KILL; + if (!sdev) + return ret; if (req->cmd_type == REQ_TYPE_BLOCK_PC) ret = scsi_setup_blk_pc_cmnd(sdev, req); return scsi_prep_return(q, req, ret); Index: usb-3.0/drivers/scsi/scsi_sysfs.c =================================================================== --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c +++ usb-3.0/drivers/scsi/scsi_sysfs.c @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user kfree(evt); } + /* Freeing the queue signals to block that we're done */ + scsi_free_queue(sdev->request_queue); blk_put_queue(sdev->request_queue); /* NULL queue means the device can't be used */ sdev->request_queue = NULL; @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de /* cause the request function to reject all I/O requests */ sdev->request_queue->queuedata = NULL; - /* Freeing the queue signals to block that we're done */ - scsi_free_queue(sdev->request_queue); put_device(dev); } ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 19:59 ` Alan Stern @ 2011-07-03 1:17 ` Andi Kleen 2011-07-07 20:47 ` solved was " Andi Kleen 2011-07-18 16:59 ` Dan Williams 2 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-03 1:17 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, axboe, James Bottomley, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list > Would you like to try this patch to see if it fixes the problem? As I > said before, I'm not certain it's the best thing to do, but it worked > on my system. I can try next week. Is this patch likely to fix the RCU stall? -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* solved was Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 19:59 ` Alan Stern 2011-07-03 1:17 ` Andi Kleen @ 2011-07-07 20:47 ` Andi Kleen 2011-07-18 16:59 ` Dan Williams 2 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-07 20:47 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, axboe, James Bottomley, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list > Would you like to try this patch to see if it fixes the problem? As I > said before, I'm not certain it's the best thing to do, but it worked > on my system. Sorry for the delay. I can confirm this patch fixes the problem for me. I did a few pulls and all performed fine. Thanks! -Andi > > Alan Stern > > > > > Index: usb-3.0/drivers/scsi/scsi_lib.c > =================================================================== > --- usb-3.0.orig/drivers/scsi/scsi_lib.c > +++ usb-3.0/drivers/scsi/scsi_lib.c > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q > struct scsi_device *sdev = q->queuedata; > int ret = BLKPREP_KILL; > > + if (!sdev) > + return ret; > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > ret = scsi_setup_blk_pc_cmnd(sdev, req); > return scsi_prep_return(q, req, ret); > Index: usb-3.0/drivers/scsi/scsi_sysfs.c > =================================================================== > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c > +++ usb-3.0/drivers/scsi/scsi_sysfs.c > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user > kfree(evt); > } > > + /* Freeing the queue signals to block that we're done */ > + scsi_free_queue(sdev->request_queue); > blk_put_queue(sdev->request_queue); > /* NULL queue means the device can't be used */ > sdev->request_queue = NULL; > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de > /* cause the request function to reject all I/O requests */ > sdev->request_queue->queuedata = NULL; > > - /* Freeing the queue signals to block that we're done */ > - scsi_free_queue(sdev->request_queue); > put_device(dev); > } > > > -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 19:59 ` Alan Stern 2011-07-03 1:17 ` Andi Kleen 2011-07-07 20:47 ` solved was " Andi Kleen @ 2011-07-18 16:59 ` Dan Williams 2011-07-18 18:00 ` Andi Kleen 2011-07-20 9:58 ` Jack Wang 2 siblings, 2 replies; 63+ messages in thread From: Dan Williams @ 2011-07-18 16:59 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, axboe, James Bottomley, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Sat, 2 Jul 2011, Andi Kleen wrote: > >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which >> > does this: >> > >> > struct scsi_device *sdev = q->queuedata; >> > int ret = BLKPREP_KILL; >> > >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); >> > return scsi_prep_return(q, req, ret); >> > >> > It doesn't check to see if sdev is NULL, nor does >> > scsi_setup_blk_pc_cmnd(). That accounts for this error: >> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, >> but that just caused RCU CPU stalls afterwards and then eventually >> a hung system. > > The RCU problem is likely to be a separate issue. It might even be a > result of the use-after-free problem with the elevator. > > At any rate, it's clear that the crash in the refcounting log you > posted occurred because scsi_setup_blk_pc_cmnd() called > scsi_prep_state_check(), which tried to dereference the NULL pointer. > > Would you like to try this patch to see if it fixes the problem? As I > said before, I'm not certain it's the best thing to do, but it worked > on my system. > > Alan Stern > > > > > Index: usb-3.0/drivers/scsi/scsi_lib.c > =================================================================== > --- usb-3.0.orig/drivers/scsi/scsi_lib.c > +++ usb-3.0/drivers/scsi/scsi_lib.c > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q > struct scsi_device *sdev = q->queuedata; > int ret = BLKPREP_KILL; > > + if (!sdev) > + return ret; > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > ret = scsi_setup_blk_pc_cmnd(sdev, req); > return scsi_prep_return(q, req, ret); > Index: usb-3.0/drivers/scsi/scsi_sysfs.c > =================================================================== > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c > +++ usb-3.0/drivers/scsi/scsi_sysfs.c > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user > kfree(evt); > } > > + /* Freeing the queue signals to block that we're done */ > + scsi_free_queue(sdev->request_queue); > blk_put_queue(sdev->request_queue); > /* NULL queue means the device can't be used */ > sdev->request_queue = NULL; > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de > /* cause the request function to reject all I/O requests */ > sdev->request_queue->queuedata = NULL; > > - /* Freeing the queue signals to block that we're done */ > - scsi_free_queue(sdev->request_queue); > put_device(dev); > } This patch seems to resolve the block/scsi null-ptr de-references in our libsas/isci environment, we have yet to try James' alternative [1]. Do we potentially need both? Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it seems only the "sdev->request_queue->queuedata = NULL" needed to be moved? The conversation appeared to be awaiting test results... [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2 -- Dan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-18 16:59 ` Dan Williams @ 2011-07-18 18:00 ` Andi Kleen 2011-07-20 9:58 ` Jack Wang 1 sibling, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-18 18:00 UTC (permalink / raw) To: Dan Williams Cc: Alan Stern, Andi Kleen, axboe, James Bottomley, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list > This patch seems to resolve the block/scsi null-ptr de-references in > our libsas/isci environment, we have yet to try James' alternative > [1]. Do we potentially need both? > > Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it > seems only the "sdev->request_queue->queuedata = NULL" needed to be > moved? > > The conversation appeared to be awaiting test results... Alan's patch worked for me too. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: Linux 3.0 oopses when pulling a USB CDROM 2011-07-18 16:59 ` Dan Williams @ 2011-07-20 9:58 ` Jack Wang 2011-07-20 9:58 ` Jack Wang 1 sibling, 0 replies; 63+ messages in thread From: Jack Wang @ 2011-07-20 9:58 UTC (permalink / raw) To: 'Dan Williams', 'Alan Stern' Cc: 'Andi Kleen', axboe, 'James Bottomley', 'Dave Jones', 'SCSI development list', 'Kernel development list', 'Rafael J. Wysocki', 'USB list' > > On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > > On Sat, 2 Jul 2011, Andi Kleen wrote: > > > >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which > >> > does this: > >> > > >> > struct scsi_device *sdev = q->queuedata; > >> > int ret = BLKPREP_KILL; > >> > > >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); > >> > return scsi_prep_return(q, req, ret); > >> > > >> > It doesn't check to see if sdev is NULL, nor does > >> > scsi_setup_blk_pc_cmnd(). That accounts for this error: > >> > >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, > >> but that just caused RCU CPU stalls afterwards and then eventually > >> a hung system. > > > > The RCU problem is likely to be a separate issue. It might even be a > > result of the use-after-free problem with the elevator. > > > > At any rate, it's clear that the crash in the refcounting log you > > posted occurred because scsi_setup_blk_pc_cmnd() called > > scsi_prep_state_check(), which tried to dereference the NULL pointer. > > > > Would you like to try this patch to see if it fixes the problem? As I > > said before, I'm not certain it's the best thing to do, but it worked > > on my system. > > > > Alan Stern > > > > > > > > > > Index: usb-3.0/drivers/scsi/scsi_lib.c > > =================================================================== > > --- usb-3.0.orig/drivers/scsi/scsi_lib.c > > +++ usb-3.0/drivers/scsi/scsi_lib.c > > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q > > struct scsi_device *sdev = q->queuedata; > > int ret = BLKPREP_KILL; > > > > + if (!sdev) > > + return ret; > > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > > ret = scsi_setup_blk_pc_cmnd(sdev, req); > > return scsi_prep_return(q, req, ret); > > Index: usb-3.0/drivers/scsi/scsi_sysfs.c > > =================================================================== > > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c > > +++ usb-3.0/drivers/scsi/scsi_sysfs.c > > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user > > kfree(evt); > > } > > > > + /* Freeing the queue signals to block that we're done */ > > + scsi_free_queue(sdev->request_queue); > > blk_put_queue(sdev->request_queue); > > /* NULL queue means the device can't be used */ > > sdev->request_queue = NULL; > > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de > > /* cause the request function to reject all I/O requests */ > > sdev->request_queue->queuedata = NULL; > > > > - /* Freeing the queue signals to block that we're done */ > > - scsi_free_queue(sdev->request_queue); > > put_device(dev); > > } > > This patch seems to resolve the block/scsi null-ptr de-references in > our libsas/isci environment, we have yet to try James' alternative > [1]. Do we potentially need both? > > Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it > seems only the "sdev->request_queue->queuedata = NULL" needed to be > moved? > > The conversation appeared to be awaiting test results... > > [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2 > > -- > Dan [Jack Wang] This patch fix kernel panic issue when hot-plut disk during I/O, I test it using pm8001 with 3.0.0-rc6 with above patch. > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-20 9:58 ` Jack Wang 0 siblings, 0 replies; 63+ messages in thread From: Jack Wang @ 2011-07-20 9:58 UTC (permalink / raw) To: 'Dan Williams', 'Alan Stern' Cc: 'Andi Kleen', axboe, 'James Bottomley', 'Dave Jones', 'SCSI development list', 'Kernel development list', 'Rafael J. Wysocki', 'USB list' > > On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu> wrote: > > On Sat, 2 Jul 2011, Andi Kleen wrote: > > > >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which > >> > does this: > >> > > >> > struct scsi_device *sdev = q->queuedata; > >> > int ret = BLKPREP_KILL; > >> > > >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); > >> > return scsi_prep_return(q, req, ret); > >> > > >> > It doesn't check to see if sdev is NULL, nor does > >> > scsi_setup_blk_pc_cmnd(). That accounts for this error: > >> > >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, > >> but that just caused RCU CPU stalls afterwards and then eventually > >> a hung system. > > > > The RCU problem is likely to be a separate issue. It might even be a > > result of the use-after-free problem with the elevator. > > > > At any rate, it's clear that the crash in the refcounting log you > > posted occurred because scsi_setup_blk_pc_cmnd() called > > scsi_prep_state_check(), which tried to dereference the NULL pointer. > > > > Would you like to try this patch to see if it fixes the problem? As I > > said before, I'm not certain it's the best thing to do, but it worked > > on my system. > > > > Alan Stern > > > > > > > > > > Index: usb-3.0/drivers/scsi/scsi_lib.c > > =================================================================== > > --- usb-3.0.orig/drivers/scsi/scsi_lib.c > > +++ usb-3.0/drivers/scsi/scsi_lib.c > > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q > > struct scsi_device *sdev = q->queuedata; > > int ret = BLKPREP_KILL; > > > > + if (!sdev) > > + return ret; > > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > > ret = scsi_setup_blk_pc_cmnd(sdev, req); > > return scsi_prep_return(q, req, ret); > > Index: usb-3.0/drivers/scsi/scsi_sysfs.c > > =================================================================== > > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c > > +++ usb-3.0/drivers/scsi/scsi_sysfs.c > > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user > > kfree(evt); > > } > > > > + /* Freeing the queue signals to block that we're done */ > > + scsi_free_queue(sdev->request_queue); > > blk_put_queue(sdev->request_queue); > > /* NULL queue means the device can't be used */ > > sdev->request_queue = NULL; > > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de > > /* cause the request function to reject all I/O requests */ > > sdev->request_queue->queuedata = NULL; > > > > - /* Freeing the queue signals to block that we're done */ > > - scsi_free_queue(sdev->request_queue); > > put_device(dev); > > } > > This patch seems to resolve the block/scsi null-ptr de-references in > our libsas/isci environment, we have yet to try James' alternative > [1]. Do we potentially need both? > > Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it > seems only the "sdev->request_queue->queuedata = NULL" needed to be > moved? > > The conversation appeared to be awaiting test results... > > [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2 > > -- > Dan [Jack Wang] This patch fix kernel panic issue when hot-plut disk during I/O, I test it using pm8001 with 3.0.0-rc6 with above patch. > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-20 9:58 ` Jack Wang (?) @ 2011-10-18 21:16 ` Ankit Jain 2011-10-18 21:30 ` James Bottomley -1 siblings, 1 reply; 63+ messages in thread From: Ankit Jain @ 2011-10-18 21:16 UTC (permalink / raw) To: Jack Wang Cc: Dan Williams, Alan Stern, Andi Kleen, axboe, James Bottomley, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Wed, Jul 20, 2011 at 3:28 PM, Jack Wang <jack_wang@usish.com> wrote: >> <snip> >> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu> > wrote: >> > On Sat, 2 Jul 2011, Andi Kleen wrote: >> > >> >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which >> >> > does this: >> >> > >> >> > struct scsi_device *sdev = q->queuedata; >> >> > int ret = BLKPREP_KILL; >> >> > >> >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) >> >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); >> >> > return scsi_prep_return(q, req, ret); >> >> > >> >> > It doesn't check to see if sdev is NULL, nor does >> >> > scsi_setup_blk_pc_cmnd(). That accounts for this error: >> >> >> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, >> >> but that just caused RCU CPU stalls afterwards and then eventually >> >> a hung system. >> > >> > The RCU problem is likely to be a separate issue. It might even be a >> > result of the use-after-free problem with the elevator. >> > >> > At any rate, it's clear that the crash in the refcounting log you >> > posted occurred because scsi_setup_blk_pc_cmnd() called >> > scsi_prep_state_check(), which tried to dereference the NULL pointer. >> > >> > Would you like to try this patch to see if it fixes the problem? As I >> > said before, I'm not certain it's the best thing to do, but it worked >> > on my system. >> > >> > Alan Stern >> > >> > >> > >> > >> > Index: usb-3.0/drivers/scsi/scsi_lib.c >> > =================================================================== >> > --- usb-3.0.orig/drivers/scsi/scsi_lib.c >> > +++ usb-3.0/drivers/scsi/scsi_lib.c >> > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q >> > struct scsi_device *sdev = q->queuedata; >> > int ret = BLKPREP_KILL; >> > >> > + if (!sdev) >> > + return ret; >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); >> > return scsi_prep_return(q, req, ret); >> > Index: usb-3.0/drivers/scsi/scsi_sysfs.c >> > =================================================================== >> > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c >> > +++ usb-3.0/drivers/scsi/scsi_sysfs.c >> > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user >> > kfree(evt); >> > } >> > >> > + /* Freeing the queue signals to block that we're done */ >> > + scsi_free_queue(sdev->request_queue); >> > blk_put_queue(sdev->request_queue); >> > /* NULL queue means the device can't be used */ >> > sdev->request_queue = NULL; >> > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de >> > /* cause the request function to reject all I/O requests */ >> > sdev->request_queue->queuedata = NULL; >> > >> > - /* Freeing the queue signals to block that we're done */ >> > - scsi_free_queue(sdev->request_queue); >> > put_device(dev); >> > } >> >> This patch seems to resolve the block/scsi null-ptr de-references in >> our libsas/isci environment, we have yet to try James' alternative >> [1]. Do we potentially need both? >> >> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it >> seems only the "sdev->request_queue->queuedata = NULL" needed to be >> moved? >> >> The conversation appeared to be awaiting test results... >> >> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2 >> >> -- >> Dan > [Jack Wang] > This patch fix kernel panic issue when hot-plut disk during I/O, I test it > using pm8001 with 3.0.0-rc6 with above patch. I don't see this patch in scsi-misc-2.6 or linus' tree. Is there a different patch that fixes the issue? -Ankit ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-10-18 21:16 ` Ankit Jain @ 2011-10-18 21:30 ` James Bottomley 2011-10-21 13:26 ` Hannes Reinecke 0 siblings, 1 reply; 63+ messages in thread From: James Bottomley @ 2011-10-18 21:30 UTC (permalink / raw) To: Ankit Jain Cc: Jack Wang, Dan Williams, Alan Stern, Andi Kleen, axboe, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Wed, 2011-10-19 at 02:46 +0530, Ankit Jain wrote: > On Wed, Jul 20, 2011 at 3:28 PM, Jack Wang <jack_wang@usish.com> wrote: > >> > <snip> > >> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern <stern@rowland.harvard.edu> > > wrote: > >> > On Sat, 2 Jul 2011, Andi Kleen wrote: > >> > > >> >> > The problem is that blk_peek_request() calls scsi_prep_fn(), which > >> >> > does this: > >> >> > > >> >> > struct scsi_device *sdev = q->queuedata; > >> >> > int ret = BLKPREP_KILL; > >> >> > > >> >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > >> >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); > >> >> > return scsi_prep_return(q, req, ret); > >> >> > > >> >> > It doesn't check to see if sdev is NULL, nor does > >> >> > scsi_setup_blk_pc_cmnd(). That accounts for this error: > >> >> > >> >> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, > >> >> but that just caused RCU CPU stalls afterwards and then eventually > >> >> a hung system. > >> > > >> > The RCU problem is likely to be a separate issue. It might even be a > >> > result of the use-after-free problem with the elevator. > >> > > >> > At any rate, it's clear that the crash in the refcounting log you > >> > posted occurred because scsi_setup_blk_pc_cmnd() called > >> > scsi_prep_state_check(), which tried to dereference the NULL pointer. > >> > > >> > Would you like to try this patch to see if it fixes the problem? As I > >> > said before, I'm not certain it's the best thing to do, but it worked > >> > on my system. > >> > > >> > Alan Stern > >> > > >> > > >> > > >> > > >> > Index: usb-3.0/drivers/scsi/scsi_lib.c > >> > =================================================================== > >> > --- usb-3.0.orig/drivers/scsi/scsi_lib.c > >> > +++ usb-3.0/drivers/scsi/scsi_lib.c > >> > @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q > >> > struct scsi_device *sdev = q->queuedata; > >> > int ret = BLKPREP_KILL; > >> > > >> > + if (!sdev) > >> > + return ret; > >> > if (req->cmd_type == REQ_TYPE_BLOCK_PC) > >> > ret = scsi_setup_blk_pc_cmnd(sdev, req); > >> > return scsi_prep_return(q, req, ret); > >> > Index: usb-3.0/drivers/scsi/scsi_sysfs.c > >> > =================================================================== > >> > --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c > >> > +++ usb-3.0/drivers/scsi/scsi_sysfs.c > >> > @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user > >> > kfree(evt); > >> > } > >> > > >> > + /* Freeing the queue signals to block that we're done */ > >> > + scsi_free_queue(sdev->request_queue); > >> > blk_put_queue(sdev->request_queue); > >> > /* NULL queue means the device can't be used */ > >> > sdev->request_queue = NULL; > >> > @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de > >> > /* cause the request function to reject all I/O requests */ > >> > sdev->request_queue->queuedata = NULL; > >> > > >> > - /* Freeing the queue signals to block that we're done */ > >> > - scsi_free_queue(sdev->request_queue); > >> > put_device(dev); > >> > } > >> > >> This patch seems to resolve the block/scsi null-ptr de-references in > >> our libsas/isci environment, we have yet to try James' alternative > >> [1]. Do we potentially need both? > >> > >> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it > >> seems only the "sdev->request_queue->queuedata = NULL" needed to be > >> moved? > >> > >> The conversation appeared to be awaiting test results... > >> > >> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2 > >> > >> -- > >> Dan > > [Jack Wang] > > This patch fix kernel panic issue when hot-plut disk during I/O, I test it > > using pm8001 with 3.0.0-rc6 with above patch. > > I don't see this patch in scsi-misc-2.6 or linus' tree. Is there a > different patch that fixes the > issue? It should be fixed by commit 777eb1bf15b8532c396821774bf6451e563438f5 Author: Hannes Reinecke <hare@suse.de> Date: Wed Sep 28 08:07:01 2011 -0600 block: Free queue resources at blk_release_queue() James ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-10-18 21:30 ` James Bottomley @ 2011-10-21 13:26 ` Hannes Reinecke 0 siblings, 0 replies; 63+ messages in thread From: Hannes Reinecke @ 2011-10-21 13:26 UTC (permalink / raw) To: James Bottomley Cc: Ankit Jain, Jack Wang, Dan Williams, Alan Stern, Andi Kleen, axboe, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On 10/18/2011 11:30 PM, James Bottomley wrote: > On Wed, 2011-10-19 at 02:46 +0530, Ankit Jain wrote: >> On Wed, Jul 20, 2011 at 3:28 PM, Jack Wang<jack_wang@usish.com> wrote: >>>> >> <snip> >>>> On Sat, Jul 2, 2011 at 12:59 PM, Alan Stern<stern@rowland.harvard.edu> >>> wrote: >>>>> On Sat, 2 Jul 2011, Andi Kleen wrote: >>>>> >>>>>>> The problem is that blk_peek_request() calls scsi_prep_fn(), which >>>>>>> does this: >>>>>>> >>>>>>> struct scsi_device *sdev = q->queuedata; >>>>>>> int ret = BLKPREP_KILL; >>>>>>> >>>>>>> if (req->cmd_type == REQ_TYPE_BLOCK_PC) >>>>>>> ret = scsi_setup_blk_pc_cmnd(sdev, req); >>>>>>> return scsi_prep_return(q, req, ret); >>>>>>> >>>>>>> It doesn't check to see if sdev is NULL, nor does >>>>>>> scsi_setup_blk_pc_cmnd(). That accounts for this error: >>>>>> >>>>>> I actually added a NULL check in scsi_setup_blk_pc_cmnd early on, >>>>>> but that just caused RCU CPU stalls afterwards and then eventually >>>>>> a hung system. >>>>> >>>>> The RCU problem is likely to be a separate issue. It might even be a >>>>> result of the use-after-free problem with the elevator. >>>>> >>>>> At any rate, it's clear that the crash in the refcounting log you >>>>> posted occurred because scsi_setup_blk_pc_cmnd() called >>>>> scsi_prep_state_check(), which tried to dereference the NULL pointer. >>>>> >>>>> Would you like to try this patch to see if it fixes the problem? As I >>>>> said before, I'm not certain it's the best thing to do, but it worked >>>>> on my system. >>>>> >>>>> Alan Stern >>>>> >>>>> >>>>> >>>>> >>>>> Index: usb-3.0/drivers/scsi/scsi_lib.c >>>>> =================================================================== >>>>> --- usb-3.0.orig/drivers/scsi/scsi_lib.c >>>>> +++ usb-3.0/drivers/scsi/scsi_lib.c >>>>> @@ -1247,6 +1247,8 @@ int scsi_prep_fn(struct request_queue *q >>>>> struct scsi_device *sdev = q->queuedata; >>>>> int ret = BLKPREP_KILL; >>>>> >>>>> + if (!sdev) >>>>> + return ret; >>>>> if (req->cmd_type == REQ_TYPE_BLOCK_PC) >>>>> ret = scsi_setup_blk_pc_cmnd(sdev, req); >>>>> return scsi_prep_return(q, req, ret); >>>>> Index: usb-3.0/drivers/scsi/scsi_sysfs.c >>>>> =================================================================== >>>>> --- usb-3.0.orig/drivers/scsi/scsi_sysfs.c >>>>> +++ usb-3.0/drivers/scsi/scsi_sysfs.c >>>>> @@ -322,6 +322,8 @@ static void scsi_device_dev_release_user >>>>> kfree(evt); >>>>> } >>>>> >>>>> + /* Freeing the queue signals to block that we're done */ >>>>> + scsi_free_queue(sdev->request_queue); >>>>> blk_put_queue(sdev->request_queue); >>>>> /* NULL queue means the device can't be used */ >>>>> sdev->request_queue = NULL; >>>>> @@ -936,8 +938,6 @@ void __scsi_remove_device(struct scsi_de >>>>> /* cause the request function to reject all I/O requests */ >>>>> sdev->request_queue->queuedata = NULL; >>>>> >>>>> - /* Freeing the queue signals to block that we're done */ >>>>> - scsi_free_queue(sdev->request_queue); >>>>> put_device(dev); >>>>> } >>>> >>>> This patch seems to resolve the block/scsi null-ptr de-references in >>>> our libsas/isci environment, we have yet to try James' alternative >>>> [1]. Do we potentially need both? >>>> >>>> Commit 86cbfb56 moved scsi_free_queue to __scsi_remove_device() but it >>>> seems only the "sdev->request_queue->queuedata = NULL" needed to be >>>> moved? >>>> >>>> The conversation appeared to be awaiting test results... >>>> >>>> [1]: http://marc.info/?l=linux-scsi&m=131007155700831&w=2 >>>> >>>> -- >>>> Dan >>> [Jack Wang] >>> This patch fix kernel panic issue when hot-plut disk during I/O, I test it >>> using pm8001 with 3.0.0-rc6 with above patch. >> >> I don't see this patch in scsi-misc-2.6 or linus' tree. Is there a >> different patch that fixes the >> issue? > > It should be fixed by > > commit 777eb1bf15b8532c396821774bf6451e563438f5 > Author: Hannes Reinecke<hare@suse.de> > Date: Wed Sep 28 08:07:01 2011 -0600 > > block: Free queue resources at blk_release_queue() > As much as I've hate to admit it, but it looks as if this is only a fix for the second part of the original patch. I've got reports that we still see crashes, which are fixed by the patch to scsi_lib.c. So please include this part. Do you need a resend? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 17:37 ` Alan Stern (?) (?) @ 2011-07-03 9:14 ` Dan Williams 2011-07-03 18:16 ` Andi Kleen -1 siblings, 1 reply; 63+ messages in thread From: Dan Williams @ 2011-07-03 9:14 UTC (permalink / raw) To: Alan Stern Cc: axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Sat, Jul 2, 2011 at 10:37 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > The second bug, which hit me but apparently not any of you, is that the > request_queue's elevator gets deallocated while it is still in use. > That's because __scsi_remove_device() calls scsi_free_queue(), which > does blk_cleanup_queue(), which calls elevator_exit(), even though the > device file is still open and more requests will be submitted when the > file is closed. > > I'm not sure of the right fix for this. One possibility is to move the > scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or > maybe the elevator_exit() call should be moved to blk_release_queue(). > > Also, I have no idea why this shows up with USB drives but not other > SCSI transports. A fluke of timing? > Not sure it's related, but it sounds like the hotplug failures being seen with libsas hotplug. Have not had a chance yet to track it down further with isci, but mvsas developer Xiangliang Yu is reporting: http://marc.info/?l=linux-scsi&m=130707166512002&w=2 He tracked it down to a suspected regression between .39-rc4 and .39-rc5 but did not finalize the bisect. The isci driver sees this and other signatures all seemingly related to early device tear down and only when pulling a drive with in-flight i/o. -- Dan ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-03 9:14 ` Dan Williams @ 2011-07-03 18:16 ` Andi Kleen 0 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-03 18:16 UTC (permalink / raw) To: Dan Williams Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list > Not sure it's related, but it sounds like the hotplug failures being > seen with libsas hotplug. Have not had a chance yet to track it down > further with isci, but mvsas developer Xiangliang Yu is reporting: It looks somewhat different: he has a bad pointer, not a NULL pointer, and it's in elv_completed_request() (but that may be related to the different request types) > > http://marc.info/?l=linux-scsi&m=130707166512002&w=2 > > He tracked it down to a suspected regression between .39-rc4 and > .39-rc5 but did not finalize the bisect. I can check if it disappears with .39-rc4. > The isci driver sees this and other signatures all seemingly related > to early device tear down and only when pulling a drive with in-flight > i/o. In my case it's not really in flight IO, but umount after pull with an idle drive. -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 17:37 ` Alan Stern ` (2 preceding siblings ...) (?) @ 2011-07-03 20:37 ` Stefan Richter 2011-07-08 13:37 ` Stefan Richter -1 siblings, 1 reply; 63+ messages in thread From: Stefan Richter @ 2011-07-03 20:37 UTC (permalink / raw) To: Alan Stern Cc: axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Jul 02 Alan Stern wrote: > Also, I have no idea why this shows up with USB drives but not other > SCSI transports. A fluke of timing? A while ago I frequently observed oopses at removal of a FireWire CompactFlash card reader (an sd device which exposes itself as device with removable medium). At that time I wasn't motivated to track down whether the bug resided in the block or SCSI or firewire subsystem. The bug was apparently triggered because hald was polling the device like crazy for media changes, and that polling coincided with device hot unplug. Furthermore, FireWire CD-ROM removal never has been an exactly glitch-free experience because the SCSI stack occasionally went on to issue command retries for many minutes after device removal a.k.a. DID_NO_CONNECT. I don't remember crashes during FireWire CD-ROM removal, but it has been a while that I last used CD-ROM drives. I might test the card reader and a CD-ROM later next week on 3.0-rc6 if I find the time and the issue hadn't been resolved by then. I don't have hald anymore, but there are certainly other ways to force accesses during device shutdown. -- Stefan Richter -=====-==-== -=== ---== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-03 20:37 ` Stefan Richter @ 2011-07-08 13:37 ` Stefan Richter 2011-07-08 13:41 ` Stefan Richter 0 siblings, 1 reply; 63+ messages in thread From: Stefan Richter @ 2011-07-08 13:37 UTC (permalink / raw) To: SCSI development list Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones, Kernel development list, Rafael J. Wysocki, USB list On Jul 03 Stefan Richter wrote: > On Jul 02 Alan Stern wrote: > > Also, I have no idea why this shows up with USB drives but not other > > SCSI transports. A fluke of timing? > > A while ago I frequently observed oopses at removal of a FireWire > CompactFlash card reader (an sd device which exposes itself as device with > removable medium). At that time I wasn't motivated to track down whether > the bug resided in the block or SCSI or firewire subsystem. The bug was > apparently triggered because hald was polling the device like crazy for > media changes, and that polling coincided with device hot unplug. > > Furthermore, FireWire CD-ROM removal never has been an exactly glitch-free > experience because the SCSI stack occasionally went on to issue command > retries for many minutes after device removal a.k.a. DID_NO_CONNECT. I > don't remember crashes during FireWire CD-ROM removal, but it has been a > while that I last used CD-ROM drives. > > I might test the card reader and a CD-ROM later next week on 3.0-rc6 if I > find the time and the issue hadn't been resolved by then. I don't have > hald anymore, but there are certainly other ways to force accesses during > device shutdown. I have not tested FireWire CD-ROMs or card reader yet. But today I did get the following crash on 3.0-rc6-71-g4dd1b49 x86-64 plus Alan's "USB: additional regression fix for device removal" when I powered down an USB hub with an empty USB card reader attached (or when I powered it up, impossible for me to say): general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PID: 8299, comm: kworker/0:0 Not tainted RIP: [...] elv_put_request+0xb/0x1b [...] Call Trace: __blk_put_request ? blk_put_request blk_put_request scsi_execute scsi_execute_req ? sd_check_events scsi_test_unit_ready ? kmem_cache_alloc ? sd_check_events sd_check_events disk_events_workfn [...] After this trace, the following messages were logged: generic-usb 0003:046D:C51.0059: can't reset device, 0000:00:12.2-3.1.3/input/0, status -71 usb 1-3.1: clear tt 1 (0530) error -19 BUG: unable to handle kernel paging request at fffffffffffffff8 IP: [...] kthread_data+0xb/0x11 Pid: 8299, comm: kworker/0:0 Tainted: G D RIP: 0010:[<ffffffff8104aa8f>] I can upload a screenshot of this crash if desired. The 2.6.39 and 3.0 block and/or SCSI layer are in a really sad state. -- Stefan Richter -=====-==-== -=== -=--- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-08 13:37 ` Stefan Richter @ 2011-07-08 13:41 ` Stefan Richter 0 siblings, 0 replies; 63+ messages in thread From: Stefan Richter @ 2011-07-08 13:41 UTC (permalink / raw) To: linux-scsi Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones, Kernel development list, Rafael J. Wysocki, USB list On Jul 08 Stefan Richter wrote: > But today I did get the following crash on 3.0-rc6-71-g4dd1b49 x86-64 plus > Alan's "USB: additional regression fix for device removal" when I powered > down an USB hub with an empty USB card reader attached (or when I powered > it up, impossible for me to say): Judging from the missing hours in an application log, it happened at power down of the hub = at removal of the card reader. -- Stefan Richter -=====-==-== -=== -=--- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-08 13:41 ` Stefan Richter 0 siblings, 0 replies; 63+ messages in thread From: Stefan Richter @ 2011-07-08 13:41 UTC (permalink / raw) To: linux-scsi Cc: Alan Stern, axboe, James Bottomley, Andi Kleen, Dave Jones, Kernel development list, Rafael J. Wysocki, USB list On Jul 08 Stefan Richter wrote: > But today I did get the following crash on 3.0-rc6-71-g4dd1b49 x86-64 plus > Alan's "USB: additional regression fix for device removal" when I powered > down an USB hub with an empty USB card reader attached (or when I powered > it up, impossible for me to say): Judging from the missing hours in an application log, it happened at power down of the hub = at removal of the card reader. -- Stefan Richter -=====-==-== -=== -=--- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-04 11:27 ` Heiko Carstens 0 siblings, 0 replies; 63+ messages in thread From: Heiko Carstens @ 2011-07-04 11:27 UTC (permalink / raw) To: Alan Stern Cc: axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote: > The second bug, which hit me but apparently not any of you, is that the > request_queue's elevator gets deallocated while it is still in use. > That's because __scsi_remove_device() calls scsi_free_queue(), which > does blk_cleanup_queue(), which calls elevator_exit(), even though the > device file is still open and more requests will be submitted when the > file is closed. > > I'm not sure of the right fix for this. One possibility is to move the > scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or > maybe the elevator_exit() call should be moved to blk_release_queue(). > > Also, I have no idea why this shows up with USB drives but not other > SCSI transports. A fluke of timing? FWIW, I reported a bug where the request_queue's elevator got deallocated while it was still in use (fc transport with device hotplug): http://www.spinics.net/lists/linux-scsi/msg52879.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-04 11:27 ` Heiko Carstens 0 siblings, 0 replies; 63+ messages in thread From: Heiko Carstens @ 2011-07-04 11:27 UTC (permalink / raw) To: Alan Stern Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote: > The second bug, which hit me but apparently not any of you, is that the > request_queue's elevator gets deallocated while it is still in use. > That's because __scsi_remove_device() calls scsi_free_queue(), which > does blk_cleanup_queue(), which calls elevator_exit(), even though the > device file is still open and more requests will be submitted when the > file is closed. > > I'm not sure of the right fix for this. One possibility is to move the > scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or > maybe the elevator_exit() call should be moved to blk_release_queue(). > > Also, I have no idea why this shows up with USB drives but not other > SCSI transports. A fluke of timing? FWIW, I reported a bug where the request_queue's elevator got deallocated while it was still in use (fc transport with device hotplug): http://www.spinics.net/lists/linux-scsi/msg52879.html -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-04 11:27 ` Heiko Carstens (?) @ 2011-07-04 16:04 ` Alan Stern 2011-07-06 6:50 ` Heiko Carstens 2011-07-12 18:49 ` Jonathan McDowell -1 siblings, 2 replies; 63+ messages in thread From: Alan Stern @ 2011-07-04 16:04 UTC (permalink / raw) To: Heiko Carstens Cc: axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Mon, 4 Jul 2011, Heiko Carstens wrote: > On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote: > > The second bug, which hit me but apparently not any of you, is that the > > request_queue's elevator gets deallocated while it is still in use. > > That's because __scsi_remove_device() calls scsi_free_queue(), which > > does blk_cleanup_queue(), which calls elevator_exit(), even though the > > device file is still open and more requests will be submitted when the > > file is closed. > > > > I'm not sure of the right fix for this. One possibility is to move the > > scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or > > maybe the elevator_exit() call should be moved to blk_release_queue(). > > > > Also, I have no idea why this shows up with USB drives but not other > > SCSI transports. A fluke of timing? > > FWIW, I reported a bug where the request_queue's elevator got deallocated > while it was still in use (fc transport with device hotplug): > > http://www.spinics.net/lists/linux-scsi/msg52879.html That does sound like the second bug I encountered. Can you reproduce it? Does the patch here: http://marc.info/?l=linux-kernel&m=130963676907731&w=2 fix the problem? Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-04 16:04 ` Alan Stern @ 2011-07-06 6:50 ` Heiko Carstens 2011-07-12 18:49 ` Jonathan McDowell 1 sibling, 0 replies; 63+ messages in thread From: Heiko Carstens @ 2011-07-06 6:50 UTC (permalink / raw) To: Alan Stern Cc: axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list, Manvanthara B. Puttashankar, Roland Dreier On Mon, Jul 04, 2011 at 12:04:54PM -0400, Alan Stern wrote: > On Mon, 4 Jul 2011, Heiko Carstens wrote: > > > On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote: > > > The second bug, which hit me but apparently not any of you, is that the > > > request_queue's elevator gets deallocated while it is still in use. > > > That's because __scsi_remove_device() calls scsi_free_queue(), which > > > does blk_cleanup_queue(), which calls elevator_exit(), even though the > > > device file is still open and more requests will be submitted when the > > > file is closed. > > > > > > I'm not sure of the right fix for this. One possibility is to move the > > > scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or > > > maybe the elevator_exit() call should be moved to blk_release_queue(). > > > > > > Also, I have no idea why this shows up with USB drives but not other > > > SCSI transports. A fluke of timing? > > > > FWIW, I reported a bug where the request_queue's elevator got deallocated > > while it was still in use (fc transport with device hotplug): > > > > http://www.spinics.net/lists/linux-scsi/msg52879.html > > That does sound like the second bug I encountered. Can you reproduce > it? Does the patch here: > > http://marc.info/?l=linux-kernel&m=130963676907731&w=2 > > fix the problem? I used to be reproducible, however I have to ask Manvathara to try your patch and see if it works. Thanks! ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-04 16:04 ` Alan Stern 2011-07-06 6:50 ` Heiko Carstens @ 2011-07-12 18:49 ` Jonathan McDowell 1 sibling, 0 replies; 63+ messages in thread From: Jonathan McDowell @ 2011-07-12 18:49 UTC (permalink / raw) To: Alan Stern Cc: Heiko Carstens, axboe, James Bottomley, Andi Kleen, Dave Jones, SCSI development list, Kernel development list, Rafael J. Wysocki, USB list On Mon, Jul 04, 2011 at 12:04:54PM -0400, Alan Stern wrote: > On Mon, 4 Jul 2011, Heiko Carstens wrote: > > > On Sat, Jul 02, 2011 at 01:37:59PM -0400, Alan Stern wrote: > > > The second bug, which hit me but apparently not any of you, is that the > > > request_queue's elevator gets deallocated while it is still in use. > > > That's because __scsi_remove_device() calls scsi_free_queue(), which > > > does blk_cleanup_queue(), which calls elevator_exit(), even though the > > > device file is still open and more requests will be submitted when the > > > file is closed. > > > > > > I'm not sure of the right fix for this. One possibility is to move the > > > scsi_free_queue() call to scsi_device_dev_release_usercontext(). Or > > > maybe the elevator_exit() call should be moved to blk_release_queue(). > > > > > > Also, I have no idea why this shows up with USB drives but not other > > > SCSI transports. A fluke of timing? > > > > FWIW, I reported a bug where the request_queue's elevator got deallocated > > while it was still in use (fc transport with device hotplug): > > > > http://www.spinics.net/lists/linux-scsi/msg52879.html > > That does sound like the second bug I encountered. Can you reproduce > it? Does the patch here: > > http://marc.info/?l=linux-kernel&m=130963676907731&w=2 > > fix the problem? FWIW I'm seeing crashes when FC devices go away while in use as well, under 2.6.39 and 3.0.0-rc6. I will try the patch linked to above, but the most recent Oops was: [71286.103409] end_request: I/O error, dev sdaw, sector 0 [71286.113710] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [71286.117681] IP: [<ffffffff81197828>] elv_completed_request+0x38/0x47 [71286.117681] PGD 2571c8067 PUD 253b81067 PMD 0 [71286.117681] Oops: 0000 [#1] SMP [71286.117681] CPU 0 [71286.117681] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables autofs4 ipv6 kvm_intel kvm nfsd nfs lockd auth_rpcgss nfs_acl sunrpc dm_round_robin dm_multipath scsi_dh ipmi_devintf ipmi_si ipmi_msghandler sg evdev processor button thermal_sys serio_raw i5k_amb i2c_i801 ioatdma i2c_core dca rng_core tpm_tis tpm tpm_bios ext3 jbd dm_mod ses enclosure ata_generic ata_piix lpfc scsi_transport_fc scsi_tgt [last unloaded: scsi_wait_scan] [71286.117681] [71286.117681] Pid: 0, comm: swapper Not tainted 3.0.0-rc6 #15 Intel S5000PAL./S5000PAL0 [71286.117681] RIP: 0010:[<ffffffff81197828>] [<ffffffff81197828>] elv_completed_request+0x38/0x47 [71286.117681] RSP: 0018:ffff88025fc03e10 EFLAGS: 00010002 [71286.117681] RAX: 0000000000000000 RBX: ffff880253cdc1c0 RCX: 00000000000003fe [71286.117681] RDX: ffff880253155840 RSI: ffff880255e37c70 RDI: ffff880253cdc1c0 [71286.117681] RBP: ffff880255e37c70 R08: 00000001010ec65f R09: 0000000000000000 [71286.117681] R10: ffff880255e37c70 R11: ffffffff817e3e98 R12: 00000000fffffffb [71286.117681] R13: 0000000000000246 R14: 0000000000000000 R15: 0000000000000000 [71286.117681] FS: 0000000000000000(0000) GS:ffff88025fc00000(0000) knlGS:0000000000000000 [71286.117681] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [71286.117681] CR2: 0000000000000048 CR3: 0000000257144000 CR4: 00000000000006f0 [71286.117681] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [71286.117681] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [71286.117681] Process swapper (pid: 0, threadinfo ffffffff81600000, task ffffffff8165b020) [71286.117681] Stack: [71286.117681] ffff880255e37c70 ffffffff8119c27e ffff880255e37c70 ffff880253cdc1c0 [71286.117681] 00000000fffffffb ffffffff8119d0c1 0000000000000000 ffff880255d733c0 [71286.117681] ffff880255e37c70 0000000000000000 00000000fffffffb ffffffff8122dfbb [71286.117681] Call Trace: [71286.117681] <IRQ> [71286.117681] [<ffffffff8119c27e>] ? __blk_put_request+0x2e/0xb0 [71286.117681] [<ffffffff8119d0c1>] ? blk_end_bidi_request+0x3b/0x55 [71286.117681] [<ffffffff8122dfbb>] ? scsi_io_completion+0x431/0x48e [71286.117681] [<ffffffff811a110f>] ? blk_done_softirq+0x5f/0x6c [71286.117681] [<ffffffff8103bc7d>] ? __do_softirq+0xbe/0x194 [71286.117681] [<ffffffff810569c6>] ? timekeeping_get_ns+0xd/0x2a [71286.117681] [<ffffffff8130dc0c>] ? call_softirq+0x1c/0x30 [71286.117681] [<ffffffff81003fc5>] ? do_softirq+0x31/0x63 [71286.117681] [<ffffffff8103ba69>] ? irq_exit+0x3f/0x9f [71286.117681] [<ffffffff8130d873>] ? call_function_single_interrupt+0x13/0x20 [71286.117681] <EOI> [71286.117681] [<ffffffffa012d0ca>] ? acpi_idle_enter_simple+0xb4/0xe2 [processor] [71286.117681] [<ffffffffa012d0c5>] ? acpi_idle_enter_simple+0xaf/0xe2 [processor] [71286.117681] [<ffffffff81277aba>] ? cpuidle_idle_call+0xe4/0x162 [71286.117681] [<ffffffff81001da4>] ? cpu_idle+0xa5/0xdb [71286.117681] [<ffffffff816c1ba8>] ? start_kernel+0x38e/0x399 [71286.117681] [<ffffffff816c138f>] ? x86_64_start_kernel+0xee/0xf2 [71286.117681] Code: 40 74 35 83 7e 44 01 74 04 a8 40 74 2b 83 e0 11 ff c8 0f 95 c0 83 e0 01 48 05 fc 00 00 00 ff 4c 87 04 f6 46 41 04 74 10 48 8b 02 [71286.117681] 8b 40 48 48 85 c0 74 04 41 58 ff e0 59 c3 48 83 ec 08 48 8d [71286.117681] RIP [<ffffffff81197828>] elv_completed_request+0x38/0x47 [71286.117681] RSP <ffff88025fc03e10> [71286.117681] CR2: 0000000000000048 [71286.117681] ---[ end trace 242b012d98a46112 ]--- [71286.117681] Kernel panic - not syncing: Fatal exception in interrupt J. -- Listen to the words, they tell you what to do... ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 6:08 ` Andi Kleen @ 2011-07-02 12:38 ` Alan Stern 2011-07-02 12:38 ` Alan Stern 2011-07-02 12:48 ` Rafael J. Wysocki 2 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 12:38 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, 2 Jul 2011, Andi Kleen wrote: > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > give the exact sequence of steps you went through to trigger the bug? > > Connect USB storage device with builtin fake CD rom. Wait for udisk > to mount it. Pull cable. udisk does umount. Oops. > > I also got a log of the refcounting now if you want it. Yes, please. The excerpt that James posted isn't enough. In particular, it doesn't show what happens during device creation. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 12:38 ` Alan Stern 0 siblings, 0 replies; 63+ messages in thread From: Alan Stern @ 2011-07-02 12:38 UTC (permalink / raw) To: Andi Kleen Cc: James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, 2 Jul 2011, Andi Kleen wrote: > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > give the exact sequence of steps you went through to trigger the bug? > > Connect USB storage device with builtin fake CD rom. Wait for udisk > to mount it. Pull cable. udisk does umount. Oops. > > I also got a log of the refcounting now if you want it. Yes, please. The excerpt that James posted isn't enough. In particular, it doesn't show what happens during device creation. Alan Stern ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 18:10 ` Andi Kleen 0 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-02 18:10 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, rjw, linux-usb On Sat, Jul 02, 2011 at 08:38:20AM -0400, Alan Stern wrote: > On Sat, 2 Jul 2011, Andi Kleen wrote: > > > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > > give the exact sequence of steps you went through to trigger the bug? > > > > Connect USB storage device with builtin fake CD rom. Wait for udisk > > to mount it. Pull cable. udisk does umount. Oops. > > > > I also got a log of the refcounting now if you want it. > > Yes, please. The excerpt that James posted isn't enough. In > particular, it doesn't show what happens during device creation. http://halobates.de/tmp/usb-pull-oops-ref -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 18:10 ` Andi Kleen 0 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-02 18:10 UTC (permalink / raw) To: Alan Stern Cc: Andi Kleen, James Bottomley, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, rjw-KKrjLPT3xs0, linux-usb-u79uwXL29TY76Z2rM5mHXA On Sat, Jul 02, 2011 at 08:38:20AM -0400, Alan Stern wrote: > On Sat, 2 Jul 2011, Andi Kleen wrote: > > > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > > give the exact sequence of steps you went through to trigger the bug? > > > > Connect USB storage device with builtin fake CD rom. Wait for udisk > > to mount it. Pull cable. udisk does umount. Oops. > > > > I also got a log of the refcounting now if you want it. > > Yes, please. The excerpt that James posted isn't enough. In > particular, it doesn't show what happens during device creation. http://halobates.de/tmp/usb-pull-oops-ref -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 12:48 ` Rafael J. Wysocki 0 siblings, 0 replies; 63+ messages in thread From: Rafael J. Wysocki @ 2011-07-02 12:48 UTC (permalink / raw) To: Andi Kleen Cc: Alan Stern, James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, linux-usb On Saturday, July 02, 2011, Andi Kleen wrote: > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > give the exact sequence of steps you went through to trigger the bug? > > Connect USB storage device with builtin fake CD rom. Wait for udisk > to mount it. Pull cable. udisk does umount. Oops. > > I also got a log of the refcounting now if you want it. Does reverting commit e1866b33b1e89f077b7132daae3dfd9a594e9a1a help? Rafael ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM @ 2011-07-02 12:48 ` Rafael J. Wysocki 0 siblings, 0 replies; 63+ messages in thread From: Rafael J. Wysocki @ 2011-07-02 12:48 UTC (permalink / raw) To: Andi Kleen Cc: Alan Stern, James Bottomley, Dave Jones, linux-scsi-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-usb-u79uwXL29TY76Z2rM5mHXA On Saturday, July 02, 2011, Andi Kleen wrote: > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > give the exact sequence of steps you went through to trigger the bug? > > Connect USB storage device with builtin fake CD rom. Wait for udisk > to mount it. Pull cable. udisk does umount. Oops. > > I also got a log of the refcounting now if you want it. Does reverting commit e1866b33b1e89f077b7132daae3dfd9a594e9a1a help? Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-02 12:48 ` Rafael J. Wysocki (?) @ 2011-07-02 17:06 ` Andi Kleen -1 siblings, 0 replies; 63+ messages in thread From: Andi Kleen @ 2011-07-02 17:06 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Andi Kleen, Alan Stern, James Bottomley, Dave Jones, linux-scsi, linux-kernel, axboe, linux-usb On Sat, Jul 02, 2011 at 02:48:22PM +0200, Rafael J. Wysocki wrote: > On Saturday, July 02, 2011, Andi Kleen wrote: > > > I'm not able to reproduce it on a vanilla 3.0-rc5 system. Can anybody > > > give the exact sequence of steps you went through to trigger the bug? > > > > Connect USB storage device with builtin fake CD rom. Wait for udisk > > to mount it. Pull cable. udisk does umount. Oops. > > > > I also got a log of the refcounting now if you want it. > > Does reverting commit e1866b33b1e89f077b7132daae3dfd9a594e9a1a help? I'll try next week. -Andi ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 17:05 Linux 3.0 oopses when pulling a USB CDROM Andi Kleen 2011-07-01 18:14 ` Dave Jones @ 2011-07-01 19:20 ` James Bottomley 2011-07-01 19:33 ` James Bottomley 1 sibling, 1 reply; 63+ messages in thread From: James Bottomley @ 2011-07-01 19:20 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw On Fri, 2011-07-01 at 10:05 -0700, Andi Kleen wrote: > Hi, > > I found I can reliably crash a 3.0 system by pulling the > USB cable of a mounted USB cdrom (or rather a USB device which > has a builtin fake CD-ROM) > > I suspect it's a regression too. > > It ends with a NULL pointer reference on a NULL sdev in > scsi_prep_state_check. > > Here's a somewhat incomplete backtrace (written down by hand) > > scsi_prep_state_check > scsi_setup_blk_pc_cmnd > blk_peek_request > ... > scsi_request_fn > ... > ioctl_internal_command > ... > scsi_set_medium_removal > sr_lock_door > cdrom_release > ... > umount > > I tried adding a > > if (!sdev) > return BLKPREP_KILL; > > to scsi_prep_state_check, but that caused a RCU CPU stall > and a generally unhappy system instead. Right, that wouldn't work. The sdev in question comes from request_queue->queuedata. That only goes to null when the last reference to the sdev has been released. So the root cause is something in sd holding a reference to sdev without actually getting an additional refcount. > The sdev must be still there in scsi_set_medium_removal because it's > referenced, so it must get lost somewhere in SCSI or in the block layer. > > Any ideas how to fix this? I'll see if I can find the refcounting problem. Likely it's a longstanding bug which we didn't actually notice until now. James ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 19:20 ` James Bottomley @ 2011-07-01 19:33 ` James Bottomley 2011-07-01 19:45 ` James Bottomley 0 siblings, 1 reply; 63+ messages in thread From: James Bottomley @ 2011-07-01 19:33 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw On Fri, 2011-07-01 at 14:20 -0500, James Bottomley wrote: > I'll see if I can find the refcounting problem. > > Likely it's a longstanding bug which we didn't actually notice until > now. OK, so it looks like we have correct refcounting on the sr_block ops, but not on the actual cdrom ops, so the device is already gone by the time the cdrom release function gets called. This is my best guess at the fix (just add a get into the cdrom ops), does it work? James --- diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 4778e27..ee75983 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -575,7 +575,10 @@ static int sr_open(struct cdrom_device_info *cdi, int purpose) { struct scsi_cd *cd = cdi->handle; struct scsi_device *sdev = cd->device; - int retval; + int retval = scsi_device_get(sdev); + + if (!retval) + return retval; /* * If the device is in error recovery, wait until it is done. @@ -588,6 +591,7 @@ static int sr_open(struct cdrom_device_info *cdi, int purpose) return 0; error_out: + scsi_device_put(sdev); return retval; } @@ -598,6 +602,7 @@ static void sr_release(struct cdrom_device_info *cdi) if (cd->device->sector_size > 2048) sr_set_blocklength(cd, 2048); + scsi_device_put(cd->device); } static int sr_probe(struct device *dev) ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: Linux 3.0 oopses when pulling a USB CDROM 2011-07-01 19:33 ` James Bottomley @ 2011-07-01 19:45 ` James Bottomley 0 siblings, 0 replies; 63+ messages in thread From: James Bottomley @ 2011-07-01 19:45 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-scsi, linux-kernel, axboe, rjw On Fri, 2011-07-01 at 14:33 -0500, James Bottomley wrote: > On Fri, 2011-07-01 at 14:20 -0500, James Bottomley wrote: > > I'll see if I can find the refcounting problem. > > > > Likely it's a longstanding bug which we didn't actually notice until > > now. > > OK, so it looks like we have correct refcounting on the sr_block ops, > but not on the actual cdrom ops, so the device is already gone by the > time the cdrom release function gets called. > > This is my best guess at the fix (just add a get into the cdrom ops), > does it work? Actually, forget that. The cdrom ops open/release is entirely subordinate to the block one. unless we have some type of double put, the sdev should be held by refcount right up until the cdrom code calls cdo->release(). However, there is one ioctl operation after that, but it's tray_move() not lock_door(). This should be the fix for that, but if we're oopsing in lock_door(), it's almost certain we've run into some refcounting imbalance. James --- diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c index 75fb965..512bd31 100644 --- a/drivers/cdrom/cdrom.c +++ b/drivers/cdrom/cdrom.c @@ -1220,12 +1220,12 @@ void cdrom_release(struct cdrom_device_info *cdi, fmode_t mode) if (CDROM_CAN(CDC_RAM) && !cdi->use_count && cdi->for_data) cdrom_close_write(cdi); - cdo->release(cdi); if (cdi->use_count == 0) { /* last process that closes dev*/ if (opened_for_data && cdi->options & CDO_AUTO_EJECT && CDROM_CAN(CDC_OPEN_TRAY)) cdo->tray_move(cdi, 1); } + cdo->release(cdi); } static int cdrom_read_mech_status(struct cdrom_device_info *cdi, ^ permalink raw reply related [flat|nested] 63+ messages in thread
end of thread, other threads:[~2011-10-21 13:26 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-01 17:05 Linux 3.0 oopses when pulling a USB CDROM Andi Kleen 2011-07-01 18:14 ` Dave Jones 2011-07-01 18:32 ` Andi Kleen 2011-07-01 18:40 ` Dave Jones 2011-07-02 15:13 ` Christoph Fritz 2011-07-01 20:29 ` James Bottomley 2011-07-01 20:43 ` [PATCH] USB: fix regression occurring during device removal Alan Stern 2011-07-01 20:43 ` Alan Stern 2011-07-01 21:04 ` Andi Kleen 2011-07-01 21:04 ` Linux 3.0 oopses when pulling a USB CDROM Alan Stern 2011-07-01 21:04 ` Alan Stern 2011-07-01 21:13 ` James Bottomley 2011-07-02 2:03 ` Alan Stern 2011-07-02 2:03 ` Alan Stern 2011-07-02 6:08 ` Andi Kleen 2011-07-02 12:24 ` James Bottomley 2011-07-02 17:05 ` Andi Kleen 2011-07-02 17:09 ` James Bottomley 2011-07-02 17:09 ` James Bottomley 2011-07-02 18:15 ` Andi Kleen 2011-07-02 18:15 ` Andi Kleen 2011-07-02 20:05 ` Alan Stern 2011-07-02 20:05 ` Alan Stern 2011-07-03 1:16 ` Andi Kleen 2011-07-03 1:16 ` Andi Kleen 2011-07-03 15:29 ` Alan Stern 2011-07-03 15:29 ` Alan Stern 2011-07-03 16:06 ` Alan Stern 2011-07-03 16:06 ` Alan Stern 2011-07-02 17:37 ` Alan Stern 2011-07-02 17:37 ` Alan Stern 2011-07-02 18:11 ` Andi Kleen 2011-07-02 19:59 ` Alan Stern 2011-07-03 1:17 ` Andi Kleen 2011-07-07 20:47 ` solved was " Andi Kleen 2011-07-18 16:59 ` Dan Williams 2011-07-18 18:00 ` Andi Kleen 2011-07-20 9:58 ` Jack Wang 2011-07-20 9:58 ` Jack Wang 2011-10-18 21:16 ` Ankit Jain 2011-10-18 21:30 ` James Bottomley 2011-10-21 13:26 ` Hannes Reinecke 2011-07-03 9:14 ` Dan Williams 2011-07-03 18:16 ` Andi Kleen 2011-07-03 20:37 ` Stefan Richter 2011-07-08 13:37 ` Stefan Richter 2011-07-08 13:41 ` Stefan Richter 2011-07-08 13:41 ` Stefan Richter 2011-07-04 11:27 ` Heiko Carstens 2011-07-04 11:27 ` Heiko Carstens 2011-07-04 16:04 ` Alan Stern 2011-07-06 6:50 ` Heiko Carstens 2011-07-12 18:49 ` Jonathan McDowell 2011-07-02 12:38 ` Alan Stern 2011-07-02 12:38 ` Alan Stern 2011-07-02 18:10 ` Andi Kleen 2011-07-02 18:10 ` Andi Kleen 2011-07-02 12:48 ` Rafael J. Wysocki 2011-07-02 12:48 ` Rafael J. Wysocki 2011-07-02 17:06 ` Andi Kleen 2011-07-01 19:20 ` James Bottomley 2011-07-01 19:33 ` James Bottomley 2011-07-01 19:45 ` James Bottomley
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.