* [3.12-rc] sg_open: leaving the kernel with locks still held! @ 2013-10-22 20:56 Simon Kirby 2013-10-23 0:41 ` Douglas Gilbert 0 siblings, 1 reply; 8+ messages in thread From: Simon Kirby @ 2013-10-22 20:56 UTC (permalink / raw) To: linux-kernel, linux-scsi, Vaughan Cao Hello! While trying to figure out why the request queue to sda (ext4) was clogging up on one of our btrfs backup boxes, I noticed a megarc process in D state, so enabled locking debugging, and got this (on 3.12-rc6): [ 205.372823] ================================================ [ 205.372901] [ BUG: lock held when returning to user space! ] [ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted [ 205.373055] ------------------------------------------------ [ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! [ 205.373212] 1 lock held by megarc.bin/5283: [ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 Vaughan, it seems you touched this area last in 15b06f9a02406e, and git tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, though I haven't tried with lockdep. This is caused by some of our internal RAID monitoring scripts that run "megarc.bin -dispCfg -a0" (even though that controller isn't present on this server -- a PowerEdge 2950 w/Perc 5). strace output of the program execution that causes the above message is here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt Simon- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-22 20:56 [3.12-rc] sg_open: leaving the kernel with locks still held! Simon Kirby @ 2013-10-23 0:41 ` Douglas Gilbert 2013-10-23 7:44 ` James Bottomley 0 siblings, 1 reply; 8+ messages in thread From: Douglas Gilbert @ 2013-10-23 0:41 UTC (permalink / raw) To: Simon Kirby, linux-kernel, linux-scsi, Vaughan Cao On 13-10-22 04:56 PM, Simon Kirby wrote: > Hello! > > While trying to figure out why the request queue to sda (ext4) was > clogging up on one of our btrfs backup boxes, I noticed a megarc process > in D state, so enabled locking debugging, and got this (on 3.12-rc6): > > [ 205.372823] ================================================ > [ 205.372901] [ BUG: lock held when returning to user space! ] > [ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted > [ 205.373055] ------------------------------------------------ > [ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! > [ 205.373212] 1 lock held by megarc.bin/5283: > [ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 > > Vaughan, it seems you touched this area last in 15b06f9a02406e, and git > tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, > though I haven't tried with lockdep. > > This is caused by some of our internal RAID monitoring scripts that run > "megarc.bin -dispCfg -a0" (even though that controller isn't present on > this server -- a PowerEdge 2950 w/Perc 5). > > strace output of the program execution that causes the above message is > here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt This has been reported. That patch will be reverted or, if there is enough time, a fix will (or at least should) go in before the release of lk 3.12 . See this thread: http://marc.info/?t=138228547300001&r=1&w=2 And you might test the patch and confirm that it does fix the problem (and report back). Doug Gilbert ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-23 0:41 ` Douglas Gilbert @ 2013-10-23 7:44 ` James Bottomley 2013-10-23 12:11 ` Josh Boyer 2013-10-23 14:10 ` Douglas Gilbert 0 siblings, 2 replies; 8+ messages in thread From: James Bottomley @ 2013-10-23 7:44 UTC (permalink / raw) To: dgilbert; +Cc: Simon Kirby, linux-kernel, linux-scsi, Vaughan Cao On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote: > On 13-10-22 04:56 PM, Simon Kirby wrote: > > Hello! > > > > While trying to figure out why the request queue to sda (ext4) was > > clogging up on one of our btrfs backup boxes, I noticed a megarc process > > in D state, so enabled locking debugging, and got this (on 3.12-rc6): > > > > [ 205.372823] ================================================ > > [ 205.372901] [ BUG: lock held when returning to user space! ] > > [ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted > > [ 205.373055] ------------------------------------------------ > > [ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! > > [ 205.373212] 1 lock held by megarc.bin/5283: > > [ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 > > > > Vaughan, it seems you touched this area last in 15b06f9a02406e, and git > > tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, > > though I haven't tried with lockdep. > > > > This is caused by some of our internal RAID monitoring scripts that run > > "megarc.bin -dispCfg -a0" (even though that controller isn't present on > > this server -- a PowerEdge 2950 w/Perc 5). > > > > strace output of the program execution that causes the above message is > > here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt > > This has been reported. That patch will be reverted or, > if there is enough time, a fix will (or at least should) > go in before the release of lk 3.12 . I think you've got about a week to prove you can fix it (before 3.12 goes final). I'll send my current set of fixes to Linus without doing anything about sg. James ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-23 7:44 ` James Bottomley @ 2013-10-23 12:11 ` Josh Boyer 2013-10-23 12:22 ` James Bottomley 2013-10-23 14:10 ` Douglas Gilbert 1 sibling, 1 reply; 8+ messages in thread From: Josh Boyer @ 2013-10-23 12:11 UTC (permalink / raw) To: James Bottomley Cc: dgilbert, Simon Kirby, Linux-Kernel@Vger. Kernel. Org, linux-scsi, Vaughan Cao On Wed, Oct 23, 2013 at 12:44 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote: >> On 13-10-22 04:56 PM, Simon Kirby wrote: >> > Hello! >> > >> > While trying to figure out why the request queue to sda (ext4) was >> > clogging up on one of our btrfs backup boxes, I noticed a megarc process >> > in D state, so enabled locking debugging, and got this (on 3.12-rc6): >> > >> > [ 205.372823] ================================================ >> > [ 205.372901] [ BUG: lock held when returning to user space! ] >> > [ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted >> > [ 205.373055] ------------------------------------------------ >> > [ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! >> > [ 205.373212] 1 lock held by megarc.bin/5283: >> > [ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 >> > >> > Vaughan, it seems you touched this area last in 15b06f9a02406e, and git >> > tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, >> > though I haven't tried with lockdep. >> > >> > This is caused by some of our internal RAID monitoring scripts that run >> > "megarc.bin -dispCfg -a0" (even though that controller isn't present on >> > this server -- a PowerEdge 2950 w/Perc 5). >> > >> > strace output of the program execution that causes the above message is >> > here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt >> >> This has been reported. That patch will be reverted or, >> if there is enough time, a fix will (or at least should) >> go in before the release of lk 3.12 . > > I think you've got about a week to prove you can fix it (before 3.12 > goes final). I'll send my current set of fixes to Linus without doing > anything about sg. In the event that a suitable fix isn't found, are you going to revert the commit(s) that caused the issue? josh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-23 12:11 ` Josh Boyer @ 2013-10-23 12:22 ` James Bottomley 0 siblings, 0 replies; 8+ messages in thread From: James Bottomley @ 2013-10-23 12:22 UTC (permalink / raw) To: Josh Boyer Cc: dgilbert, Simon Kirby, Linux-Kernel@Vger. Kernel. Org, linux-scsi, Vaughan Cao On Wed, 2013-10-23 at 05:11 -0700, Josh Boyer wrote: > On Wed, Oct 23, 2013 at 12:44 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote: > >> On 13-10-22 04:56 PM, Simon Kirby wrote: > >> > Hello! > >> > > >> > While trying to figure out why the request queue to sda (ext4) was > >> > clogging up on one of our btrfs backup boxes, I noticed a megarc process > >> > in D state, so enabled locking debugging, and got this (on 3.12-rc6): > >> > > >> > [ 205.372823] ================================================ > >> > [ 205.372901] [ BUG: lock held when returning to user space! ] > >> > [ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted > >> > [ 205.373055] ------------------------------------------------ > >> > [ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! > >> > [ 205.373212] 1 lock held by megarc.bin/5283: > >> > [ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 > >> > > >> > Vaughan, it seems you touched this area last in 15b06f9a02406e, and git > >> > tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, > >> > though I haven't tried with lockdep. > >> > > >> > This is caused by some of our internal RAID monitoring scripts that run > >> > "megarc.bin -dispCfg -a0" (even though that controller isn't present on > >> > this server -- a PowerEdge 2950 w/Perc 5). > >> > > >> > strace output of the program execution that causes the above message is > >> > here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt > >> > >> This has been reported. That patch will be reverted or, > >> if there is enough time, a fix will (or at least should) > >> go in before the release of lk 3.12 . > > > > I think you've got about a week to prove you can fix it (before 3.12 > > goes final). I'll send my current set of fixes to Linus without doing > > anything about sg. > > In the event that a suitable fix isn't found, are you going to revert > the commit(s) that caused the issue? That's what I said I'd do previously, yes. James ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-23 7:44 ` James Bottomley 2013-10-23 12:11 ` Josh Boyer @ 2013-10-23 14:10 ` Douglas Gilbert 2013-10-25 0:37 ` Simon Kirby 1 sibling, 1 reply; 8+ messages in thread From: Douglas Gilbert @ 2013-10-23 14:10 UTC (permalink / raw) To: James Bottomley Cc: Simon Kirby, linux-kernel, linux-scsi, Vaughan Cao, Madper Xie On 13-10-23 03:44 AM, James Bottomley wrote: > On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote: >> On 13-10-22 04:56 PM, Simon Kirby wrote: >>> Hello! >>> >>> While trying to figure out why the request queue to sda (ext4) was >>> clogging up on one of our btrfs backup boxes, I noticed a megarc process >>> in D state, so enabled locking debugging, and got this (on 3.12-rc6): >>> >>> [ 205.372823] ================================================ >>> [ 205.372901] [ BUG: lock held when returning to user space! ] >>> [ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted >>> [ 205.373055] ------------------------------------------------ >>> [ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! >>> [ 205.373212] 1 lock held by megarc.bin/5283: >>> [ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 >>> >>> Vaughan, it seems you touched this area last in 15b06f9a02406e, and git >>> tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, >>> though I haven't tried with lockdep. >>> >>> This is caused by some of our internal RAID monitoring scripts that run >>> "megarc.bin -dispCfg -a0" (even though that controller isn't present on >>> this server -- a PowerEdge 2950 w/Perc 5). >>> >>> strace output of the program execution that causes the above message is >>> here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt >> >> This has been reported. That patch will be reverted or, >> if there is enough time, a fix will (or at least should) >> go in before the release of lk 3.12 . > > I think you've got about a week to prove you can fix it (before 3.12 > goes final). I'll send my current set of fixes to Linus without doing > anything about sg. "prove" is a big ask, especially coming from a mathematician. I consider it more hacking (in the golf sense) on my part to tweak well-meaning patches to the sg driver that cause collateral damage. Further, I suspect Vaughan's patch was an attempt to fix damage left be a previous sg_open() hacker. I have asked Simon Kirby to apply the patch: http://marc.info/?l=linux-scsi&m=138237283432010&w=2 and report if it fixes his problems. Further I have written three test programs to test O_EXCL handling on SCSI devices, two of which are in the examples directory of sg3_utils version 1.37 . The latest one (single exclusive writer, multiple readers) can be found in the News section of: http://sg.danny.cz/sg/ These tests don't check all possibilities (e.g. random signals, ml error processing and detached devices) but they are better than nothing. And, as a side issue, they break bsg (cause it ignores O_EXCL) and break the block layer (e.g. /dev/sdb) so perhaps it should be reverted :-) Perhaps the original bug reporter (Madper Xie) might also test the proposed patch and report if it fixes what he saw. Doug Gilbert ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-23 14:10 ` Douglas Gilbert @ 2013-10-25 0:37 ` Simon Kirby 2013-10-25 7:20 ` James Bottomley 0 siblings, 1 reply; 8+ messages in thread From: Simon Kirby @ 2013-10-25 0:37 UTC (permalink / raw) To: Douglas Gilbert Cc: James Bottomley, linux-kernel, linux-scsi, Vaughan Cao, Madper Xie On Wed, Oct 23, 2013 at 10:10:47AM -0400, Douglas Gilbert wrote: > On 13-10-23 03:44 AM, James Bottomley wrote: > >On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote: > >>On 13-10-22 04:56 PM, Simon Kirby wrote: > >>>Hello! > >>> > >>>While trying to figure out why the request queue to sda (ext4) was > >>>clogging up on one of our btrfs backup boxes, I noticed a megarc process > >>>in D state, so enabled locking debugging, and got this (on 3.12-rc6): > >>> > >>>[ 205.372823] ================================================ > >>>[ 205.372901] [ BUG: lock held when returning to user space! ] > >>>[ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted > >>>[ 205.373055] ------------------------------------------------ > >>>[ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! > >>>[ 205.373212] 1 lock held by megarc.bin/5283: > >>>[ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 > >>> > >>>Vaughan, it seems you touched this area last in 15b06f9a02406e, and git > >>>tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, > >>>though I haven't tried with lockdep. > >>> > >>>This is caused by some of our internal RAID monitoring scripts that run > >>>"megarc.bin -dispCfg -a0" (even though that controller isn't present on > >>>this server -- a PowerEdge 2950 w/Perc 5). > >>> > >>>strace output of the program execution that causes the above message is > >>>here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt > >> > >>This has been reported. That patch will be reverted or, > >>if there is enough time, a fix will (or at least should) > >>go in before the release of lk 3.12 . > > > >I think you've got about a week to prove you can fix it (before 3.12 > >goes final). I'll send my current set of fixes to Linus without doing > >anything about sg. > > "prove" is a big ask, especially coming from a > mathematician. I consider it more hacking (in the > golf sense) on my part to tweak well-meaning patches > to the sg driver that cause collateral damage. Further, > I suspect Vaughan's patch was an attempt to fix > damage left be a previous sg_open() hacker. > > I have asked Simon Kirby to apply the patch: > http://marc.info/?l=linux-scsi&m=138237283432010&w=2 > and report if it fixes his problems. Further I have > written three test programs to test O_EXCL handling on > SCSI devices, two of which are in the examples directory > of sg3_utils version 1.37 . The latest one (single > exclusive writer, multiple readers) can be found in > the News section of: > http://sg.danny.cz/sg/ > These tests don't check all possibilities (e.g. random > signals, ml error processing and detached devices) but > they are better than nothing. And, as a side issue, they > break bsg (cause it ignores O_EXCL) and break the block > layer (e.g. /dev/sdb) so perhaps it should be reverted :-) Well, this patch works for me in that I see no more lockdep warnings or unintended consequences when running the same "megarc.bin -dispCfg -a0" command. Simon- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.12-rc] sg_open: leaving the kernel with locks still held! 2013-10-25 0:37 ` Simon Kirby @ 2013-10-25 7:20 ` James Bottomley 0 siblings, 0 replies; 8+ messages in thread From: James Bottomley @ 2013-10-25 7:20 UTC (permalink / raw) To: Simon Kirby Cc: Douglas Gilbert, linux-kernel, linux-scsi, Vaughan Cao, Madper Xie On Thu, 2013-10-24 at 17:37 -0700, Simon Kirby wrote: > On Wed, Oct 23, 2013 at 10:10:47AM -0400, Douglas Gilbert wrote: > > > On 13-10-23 03:44 AM, James Bottomley wrote: > > >On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote: > > >>On 13-10-22 04:56 PM, Simon Kirby wrote: > > >>>Hello! > > >>> > > >>>While trying to figure out why the request queue to sda (ext4) was > > >>>clogging up on one of our btrfs backup boxes, I noticed a megarc process > > >>>in D state, so enabled locking debugging, and got this (on 3.12-rc6): > > >>> > > >>>[ 205.372823] ================================================ > > >>>[ 205.372901] [ BUG: lock held when returning to user space! ] > > >>>[ 205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted > > >>>[ 205.373055] ------------------------------------------------ > > >>>[ 205.373132] megarc.bin/5283 is leaving the kernel with locks still held! > > >>>[ 205.373212] 1 lock held by megarc.bin/5283: > > >>>[ 205.373285] #0: (&sdp->o_sem){.+.+..}, at: [<ffffffff8161e650>] sg_open+0x3a0/0x4d0 > > >>> > > >>>Vaughan, it seems you touched this area last in 15b06f9a02406e, and git > > >>>tag --contains says this went in for 3.12-rc. We didn't see this on 3.11, > > >>>though I haven't tried with lockdep. > > >>> > > >>>This is caused by some of our internal RAID monitoring scripts that run > > >>>"megarc.bin -dispCfg -a0" (even though that controller isn't present on > > >>>this server -- a PowerEdge 2950 w/Perc 5). > > >>> > > >>>strace output of the program execution that causes the above message is > > >>>here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt > > >> > > >>This has been reported. That patch will be reverted or, > > >>if there is enough time, a fix will (or at least should) > > >>go in before the release of lk 3.12 . > > > > > >I think you've got about a week to prove you can fix it (before 3.12 > > >goes final). I'll send my current set of fixes to Linus without doing > > >anything about sg. > > > > "prove" is a big ask, especially coming from a > > mathematician. I consider it more hacking (in the > > golf sense) on my part to tweak well-meaning patches > > to the sg driver that cause collateral damage. Further, > > I suspect Vaughan's patch was an attempt to fix > > damage left be a previous sg_open() hacker. > > > > I have asked Simon Kirby to apply the patch: > > http://marc.info/?l=linux-scsi&m=138237283432010&w=2 > > and report if it fixes his problems. Further I have > > written three test programs to test O_EXCL handling on > > SCSI devices, two of which are in the examples directory > > of sg3_utils version 1.37 . The latest one (single > > exclusive writer, multiple readers) can be found in > > the News section of: > > http://sg.danny.cz/sg/ > > These tests don't check all possibilities (e.g. random > > signals, ml error processing and detached devices) but > > they are better than nothing. And, as a side issue, they > > break bsg (cause it ignores O_EXCL) and break the block > > layer (e.g. /dev/sdb) so perhaps it should be reverted :-) > > Well, this patch works for me in that I see no more lockdep warnings or > unintended consequences when running the same "megarc.bin -dispCfg -a0" > command. OK, I thought about this some more and I just don't see the problem as being so urgent that we do a fixup patch on the eve of the merge window. Lets just do the revert and then, Doug, do your patch from the revert and I'll put it in in the merge window. James ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-10-25 7:20 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-10-22 20:56 [3.12-rc] sg_open: leaving the kernel with locks still held! Simon Kirby 2013-10-23 0:41 ` Douglas Gilbert 2013-10-23 7:44 ` James Bottomley 2013-10-23 12:11 ` Josh Boyer 2013-10-23 12:22 ` James Bottomley 2013-10-23 14:10 ` Douglas Gilbert 2013-10-25 0:37 ` Simon Kirby 2013-10-25 7:20 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).