linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* i686 hang on boot in userspace
       [not found] <20060714150418.120680@gmx.net>
@ 2006-07-14 19:43 ` john stultz
  2006-07-17 10:52 ` Roman Zippel
  1 sibling, 0 replies; 46+ messages in thread
From: john stultz @ 2006-07-14 19:43 UTC (permalink / raw)
  To: Uwe Bugla
  Cc: Roman Zippel, Valdis.Kletnieks, linux-kernel, akpm, Peter Zijlstra

On Fri, 2006-07-14 at 17:04 +0200, Uwe Bugla wrote:
> Hi everybody,
> first of all thanks to the explanatory hints how a magic Sysrq key works – I've learned a lot.
> 
> I first pressed ALT + PrintScreen + P, then ALT + PrintScreen + T.
> To avoid wordwrapping or other unwanted effects please see the resulting kern.log as outline attachment.
> 
> Could someone please explain to me what's behind that cryptic code?
> Hope I could help - still need a booting kernel, and I think I ain´t the only one.

Hmmm... I don't see anything that sticks out in the sysrq info (well,
not sure if the do_wp_page() is just trace junk or not - CC'ed Peter
just in case). Maybe this is related to the expand files OOM thing that
Martin saw?

If you boot w/ init=/bin/bash do you also see the hang? If you execute
"date" a few times, does it seem to keep proper track of time?

Also could you enable softlockup detection? (CONFIG_DETECT_SOFTLOCKUP)
Which you can find under Kernel debugging in the make menuconfig.

thanks
-john


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: i686 hang on boot in userspace
       [not found] <20060714150418.120680@gmx.net>
  2006-07-14 19:43 ` i686 hang on boot in userspace john stultz
@ 2006-07-17 10:52 ` Roman Zippel
  2006-07-17 11:09   ` Roman Zippel
  2006-07-17 13:38   ` Uwe Bugla
  1 sibling, 2 replies; 46+ messages in thread
From: Roman Zippel @ 2006-07-17 10:52 UTC (permalink / raw)
  To: Uwe Bugla; +Cc: Valdis.Kletnieks, linux-kernel, akpm, johnstul

[-- Attachment #1: Type: TEXT/PLAIN, Size: 992 bytes --]

Hi,

On Fri, 14 Jul 2006, Uwe Bugla wrote:

> Hi everybody,
> first of all thanks to the explanatory hints how a magic Sysrq key works – I've learned a lot.
> 
> I first pressed ALT + PrintScreen + P, then ALT + PrintScreen + T.
> To avoid wordwrapping or other unwanted effects please see the resulting kern.log as outline attachment.
> 
> Could someone please explain to me what's behind that cryptic code?

It shows what the kernel is currently is doing and where it's spending the 
time.
First, your kernel buffer log buffer seems a little small, so not 
everything is captured. Could you increase the number in the "Kernel log 
buffer size" option (it's in the "Kernel debugging" part of the "Kernel 
hacking" menu).
Second, could you press ALT+PrintScreen+P a few more times (maybe around 
10 at least) while the kernel hangs? This would should where the cpu is 
spending its time and whether it's at a single place or at different 
places.
Thanks.

bye, Roman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: i686 hang on boot in userspace
  2006-07-17 10:52 ` Roman Zippel
@ 2006-07-17 11:09   ` Roman Zippel
  2006-07-17 13:38   ` Uwe Bugla
  1 sibling, 0 replies; 46+ messages in thread
From: Roman Zippel @ 2006-07-17 11:09 UTC (permalink / raw)
  To: Uwe Bugla; +Cc: Valdis.Kletnieks, linux-kernel, akpm, johnstul

Hi,

On Mon, 17 Jul 2006, Roman Zippel wrote:

> > I first pressed ALT + PrintScreen + P, then ALT + PrintScreen + T.
> > To avoid wordwrapping or other unwanted effects please see the resulting kern.log as outline attachment.

I almost forgot, please compress it next time, the list has a size limit.

bye, Roman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 10:52 ` Roman Zippel
  2006-07-17 11:09   ` Roman Zippel
@ 2006-07-17 13:38   ` Uwe Bugla
  2006-07-17 14:17     ` Roman Zippel
  1 sibling, 1 reply; 46+ messages in thread
From: Uwe Bugla @ 2006-07-17 13:38 UTC (permalink / raw)
  To: Roman Zippel; +Cc: johnstul, akpm, linux-kernel, Valdis.Kletnieks


-------- Original-Nachricht --------
Datum: Mon, 17 Jul 2006 12:52:28 +0200 (CEST)
Von: Roman Zippel <zippel@linux-m68k.org>
An: Uwe Bugla <uwe.bugla@gmx.de>
Betreff: Re: i686 hang on boot in userspace

> Hi,
> 
> On Fri, 14 Jul 2006, Uwe Bugla wrote:
> 
> > Hi everybody,
> > first of all thanks to the explanatory hints how a magic Sysrq key works
> – I've learned a lot.
> > 
> > I first pressed ALT + PrintScreen + P, then ALT + PrintScreen + T.
> > To avoid wordwrapping or other unwanted effects please see the resulting
> kern.log as outline attachment.
> > 
> > Could someone please explain to me what's behind that cryptic code?
> 
> It shows what the kernel is currently is doing and where it's spending the
> time.
> First, your kernel buffer log buffer seems a little small, so not 
> everything is captured. Could you increase the number in the "Kernel log 
> buffer size" option (it's in the "Kernel debugging" part of the "Kernel 
> hacking" menu).
> Second, could you press ALT+PrintScreen+P a few more times (maybe around 
> 10 at least) while the kernel hangs? This would should where the cpu is 
> spending its time and whether it's at a single place or at different 
> places.
> Thanks.
> 
> bye, Roman

Hi Roman, Hi everybody else,
my boot problem is solved within kernel 2.6.18-rc1-mm2!!!

A thousands of thanks for all your efforts!

I have compared 18-rc1-mm1 and 18-rc1-mm2.
mm2 contains a patch for timer.c owning almost twice as many hunks than mm1.
In so far I was sure it was a timer.c issue.

Regards

Uwe

-- 


Echte DSL-Flatrate dauerhaft für 0,- Euro*!
"Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 13:38   ` Uwe Bugla
@ 2006-07-17 14:17     ` Roman Zippel
  2006-07-17 14:59       ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: Roman Zippel @ 2006-07-17 14:17 UTC (permalink / raw)
  To: Uwe Bugla; +Cc: johnstul, akpm, linux-kernel, Valdis.Kletnieks

Hi,

On Mon, 17 Jul 2006, Uwe Bugla wrote:

> I have compared 18-rc1-mm1 and 18-rc1-mm2.
> mm2 contains a patch for timer.c owning almost twice as many hunks than mm1.
> In so far I was sure it was a timer.c issue.

You're still guessing, a lot more things changed between 18-rc1-mm1 and 
18-rc1-mm2. It's rather unlikely that the timer changes fixed your 
problem. You might want to try to revert 
ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc1/2.6.18-rc1-mm2/broken-out/improve-timekeeping-resume-robustness.patch
to see whether the problem is back afterwards.

bye, Roman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 14:17     ` Roman Zippel
@ 2006-07-17 14:59       ` gmu 2k6
  2006-07-17 15:21         ` Roman Zippel
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-17 14:59 UTC (permalink / raw)
  To: linux-kernel

On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> Hi,
>
> On Mon, 17 Jul 2006, Uwe Bugla wrote:
>
> > I have compared 18-rc1-mm1 and 18-rc1-mm2.
> > mm2 contains a patch for timer.c owning almost twice as many hunks than mm1.
> > In so far I was sure it was a timer.c issue.
>
> You're still guessing, a lot more things changed between 18-rc1-mm1 and
> 18-rc1-mm2. It's rather unlikely that the timer changes fixed your
> problem. You might want to try to revert
> ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc1/2.6.18-rc1-mm2/broken-out/improve-timekeeping-resume-robustness.patch
> to see whether the problem is back afterwards.

I was preparing a post to lkml about a similar hang which happens
during init. I also saw an error while ntpdate tried to set the
time/get the time. this only happens after I've enabled the NX bit on
the dual 32bit Xeons installed in the HP Proliant Server. it works
flawlessly with 2.6.17.6 (CONFIG_X86_PAE and CONFIG_HIGHMEM_64) but
since 2.6.18-rc2-git4 (including 2.6.18-rc2) it hangs late in the init
process.

could this be related?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 14:59       ` gmu 2k6
@ 2006-07-17 15:21         ` Roman Zippel
  2006-07-17 15:58           ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: Roman Zippel @ 2006-07-17 15:21 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

Hi,

On Mon, 17 Jul 2006, gmu 2k6 wrote:

> I was preparing a post to lkml about a similar hang which happens
> during init. I also saw an error while ntpdate tried to set the
> time/get the time. this only happens after I've enabled the NX bit on
> the dual 32bit Xeons installed in the HP Proliant Server. it works
> flawlessly with 2.6.17.6 (CONFIG_X86_PAE and CONFIG_HIGHMEM_64) but
> since 2.6.18-rc2-git4 (including 2.6.18-rc2) it hangs late in the init
> process.
> 
> could this be related?

Well, it could, but without further information it's impossible to say.
What error did you see with ntpdate? Could you post the kernel messages 
and also insert a few stack traces as mentioned earlier?
Thanks.

bye, Rman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 15:21         ` Roman Zippel
@ 2006-07-17 15:58           ` gmu 2k6
  2006-07-17 16:02             ` gmu 2k6
  2006-07-17 16:11             ` gmu 2k6
  0 siblings, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-17 15:58 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-kernel

On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> Hi,
>
> On Mon, 17 Jul 2006, gmu 2k6 wrote:
>
> > I was preparing a post to lkml about a similar hang which happens
> > during init. I also saw an error while ntpdate tried to set the
> > time/get the time. this only happens after I've enabled the NX bit on
> > the dual 32bit Xeons installed in the HP Proliant Server. it works
> > flawlessly with 2.6.17.6 (CONFIG_X86_PAE and CONFIG_HIGHMEM_64) but
> > since 2.6.18-rc2-git4 (including 2.6.18-rc2) it hangs late in the init
> > process.
> >
> > could this be related?
>
> Well, it could, but without further information it's impossible to say.
> What error did you see with ntpdate? Could you post the kernel messages
> and also insert a few stack traces as mentioned earlier?
> Thanks.

ok, the error printed from ntpdate has to do with networking:
Running ntpdate to synchronize clockError : Temporary failure in name resolution

it then stops after printing the following tg3 initialization line:
tg3: eth0 Flow control is on for TX and on for RX.

right now I'm trying to get SysRq working (my first try with it) so
that I see where it's hanging.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 15:58           ` gmu 2k6
@ 2006-07-17 16:02             ` gmu 2k6
  2006-07-17 17:03               ` Roman Zippel
  2006-07-17 16:11             ` gmu 2k6
  1 sibling, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-17 16:02 UTC (permalink / raw)
  To: linux-kernel

On 7/17/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> > Hi,
> >
> > On Mon, 17 Jul 2006, gmu 2k6 wrote:
> >
> > > I was preparing a post to lkml about a similar hang which happens
> > > during init. I also saw an error while ntpdate tried to set the
> > > time/get the time. this only happens after I've enabled the NX bit on
> > > the dual 32bit Xeons installed in the HP Proliant Server. it works
> > > flawlessly with 2.6.17.6 (CONFIG_X86_PAE and CONFIG_HIGHMEM_64) but
> > > since 2.6.18-rc2-git4 (including 2.6.18-rc2) it hangs late in the init
> > > process.
> > >
> > > could this be related?
> >
> > Well, it could, but without further information it's impossible to say.
> > What error did you see with ntpdate? Could you post the kernel messages
> > and also insert a few stack traces as mentioned earlier?
> > Thanks.
>
> ok, the error printed from ntpdate has to do with networking:
> Running ntpdate to synchronize clockError : Temporary failure in name resolution
>
> it then stops after printing the following tg3 initialization line:
> tg3: eth0 Flow control is on for TX and on for RX.
>
> right now I'm trying to get SysRq working (my first try with it) so
> that I see where it's hanging.

either I'm too dumb or there is an undocumented way to enable SysRq on
bootup or the machine is really hanging hard. I'm not able use
Alt+Print as nothing happens besides console showing the typed in
characters ^[t.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 15:58           ` gmu 2k6
  2006-07-17 16:02             ` gmu 2k6
@ 2006-07-17 16:11             ` gmu 2k6
  1 sibling, 0 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-17 16:11 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-kernel

On 7/17/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> > Hi,
> >
> > On Mon, 17 Jul 2006, gmu 2k6 wrote:
> >
> > > I was preparing a post to lkml about a similar hang which happens
> > > during init. I also saw an error while ntpdate tried to set the
> > > time/get the time. this only happens after I've enabled the NX bit on
> > > the dual 32bit Xeons installed in the HP Proliant Server. it works
> > > flawlessly with 2.6.17.6 (CONFIG_X86_PAE and CONFIG_HIGHMEM_64) but
> > > since 2.6.18-rc2-git4 (including 2.6.18-rc2) it hangs late in the init
> > > process.
> > >
> > > could this be related?
> >
> > Well, it could, but without further information it's impossible to say.
> > What error did you see with ntpdate? Could you post the kernel messages
> > and also insert a few stack traces as mentioned earlier?
> > Thanks.
>
> ok, the error printed from ntpdate has to do with networking:
> Running ntpdate to synchronize clockError : Temporary failure in name resolution
>
> it then stops after printing the following tg3 initialization line:
> tg3: eth0 Flow control is on for TX and on for RX.
>
> right now I'm trying to get SysRq working (my first try with it) so
> that I see where it's hanging.

ignore the ntpdate message, it also appears when normally booting as
the tg3 interfaces are not up yet when ntpdate is run, but this is ok
as I've been doing ntpdate @hourly anyway.
therefore, there must be a different issue which I'm not able to debug yet.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 16:02             ` gmu 2k6
@ 2006-07-17 17:03               ` Roman Zippel
  2006-07-17 18:15                 ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: Roman Zippel @ 2006-07-17 17:03 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

Hi,

On Mon, 17 Jul 2006, gmu 2k6 wrote:

> either I'm too dumb or there is an undocumented way to enable SysRq on
> bootup or the machine is really hanging hard. I'm not able use
> Alt+Print as nothing happens besides console showing the typed in
> characters ^[t.

It might be a keyboard problem, try releasing Print, but keeping Alt 
pressed and then try another key.

bye, Roman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 17:03               ` Roman Zippel
@ 2006-07-17 18:15                 ` gmu 2k6
  2006-07-17 18:17                   ` gmu 2k6
  2006-07-18  9:38                   ` gmu 2k6
  0 siblings, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-17 18:15 UTC (permalink / raw)
  To: Roman Zippel, linux-kernel

On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> Hi,
>
> On Mon, 17 Jul 2006, gmu 2k6 wrote:
>
> > either I'm too dumb or there is an undocumented way to enable SysRq on
> > bootup or the machine is really hanging hard. I'm not able use
> > Alt+Print as nothing happens besides console showing the typed in
> > characters ^[t.
>
> It might be a keyboard problem, try releasing Print, but keeping Alt
> pressed and then try another key.

maybe the problem is HP's Integrated Lights Out Java Applet. I will
try tomorrow morning in the server room.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 18:15                 ` gmu 2k6
@ 2006-07-17 18:17                   ` gmu 2k6
  2006-07-18  9:38                   ` gmu 2k6
  1 sibling, 0 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-17 18:17 UTC (permalink / raw)
  To: linux-kernel

enabling HIGHMEM_64 to get NX bit enabled on a Xeon supporting the bit
does not make the overall system slower with only 4GiB RAM installed,
does it?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-17 18:15                 ` gmu 2k6
  2006-07-17 18:17                   ` gmu 2k6
@ 2006-07-18  9:38                   ` gmu 2k6
  2006-07-19 10:26                     ` gmu 2k6
  1 sibling, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-18  9:38 UTC (permalink / raw)
  To: Roman Zippel, linux-kernel

On 7/17/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> > Hi,
> >
> > On Mon, 17 Jul 2006, gmu 2k6 wrote:
> >
> > > either I'm too dumb or there is an undocumented way to enable SysRq on
> > > bootup or the machine is really hanging hard. I'm not able use
> > > Alt+Print as nothing happens besides console showing the typed in
> > > characters ^[t.
> >
> > It might be a keyboard problem, try releasing Print, but keeping Alt
> > pressed and then try another key.
>
> maybe the problem is HP's Integrated Lights Out Java Applet. I will
> try tomorrow morning in the server room.

yep that Java Applet was the problem. it worked when I was physically
connected by keyboard. I got the following by pressing Alt+SysRq+p but
I'm not sure it helps as being in cpu_idle looks normal to me:
Pid: 0, comm: swapper
EIP: 0060 [<c0101a57>] CPU: 0
EIP: isat mwait_idle+0x2a/0x34
EAX: 00000000 EBX: c0414008 ECX: 00000000 EDX: 00000000
ESI: c0414000 EDI: c33984e4 EPP: 00004864 DS: 007b ES: 007b
CR0: 8005003b CR2: b7f818cc CR3: 375e73c0 CR4: 000006f0
[<c0101a175>] cpu_idle+0x63/0x79
[<c041a6cf>] start_kernel+0x262/0x393
[<c041a1c3>] unknown_bootoption+0x0/0x25a

Alt+SysRq+s seems to sync and write the logs to kern.log/messages but
the logs vanish after reboot. Therefore for the time being I had to
write it down by hand but I'm sure there's an elegant way like saving
the logfiles before booting up again via a second system or livecd.
Maybe there's a better way than that?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-18  9:38                   ` gmu 2k6
@ 2006-07-19 10:26                     ` gmu 2k6
  2006-07-24 15:34                       ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-19 10:26 UTC (permalink / raw)
  To: Roman Zippel, linux-kernel

On 7/18/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/17/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> > On 7/17/06, Roman Zippel <zippel@linux-m68k.org> wrote:
> > > Hi,
> > >
> > > On Mon, 17 Jul 2006, gmu 2k6 wrote:
> > >
> > > > either I'm too dumb or there is an undocumented way to enable SysRq on
> > > > bootup or the machine is really hanging hard. I'm not able use
> > > > Alt+Print as nothing happens besides console showing the typed in
> > > > characters ^[t.
> > >
> > > It might be a keyboard problem, try releasing Print, but keeping Alt
> > > pressed and then try another key.
> >
> > maybe the problem is HP's Integrated Lights Out Java Applet. I will
> > try tomorrow morning in the server room.
>
> yep that Java Applet was the problem. it worked when I was physically
> connected by keyboard. I got the following by pressing Alt+SysRq+p but
> I'm not sure it helps as being in cpu_idle looks normal to me:
> Pid: 0, comm: swapper
> EIP: 0060 [<c0101a57>] CPU: 0
> EIP: isat mwait_idle+0x2a/0x34
> EAX: 00000000 EBX: c0414008 ECX: 00000000 EDX: 00000000
> ESI: c0414000 EDI: c33984e4 EPP: 00004864 DS: 007b ES: 007b
> CR0: 8005003b CR2: b7f818cc CR3: 375e73c0 CR4: 000006f0
> [<c0101a175>] cpu_idle+0x63/0x79
> [<c041a6cf>] start_kernel+0x262/0x393
> [<c041a1c3>] unknown_bootoption+0x0/0x25a
>
> Alt+SysRq+s seems to sync and write the logs to kern.log/messages but
> the logs vanish after reboot. Therefore for the time being I had to
> write it down by hand but I'm sure there's an elegant way like saving
> the logfiles before booting up again via a second system or livecd.
> Maybe there's a better way than that?

same boot problem with 2.6.18-rc1-mm2 btw, did not SysRq for that though.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-19 10:26                     ` gmu 2k6
@ 2006-07-24 15:34                       ` gmu 2k6
  2006-07-25  7:32                         ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-24 15:34 UTC (permalink / raw)
  To: Roman Zippel, linux-kernel, axboe

the problem I have with hangs is related to changes in CFQ and that
CFQ is now the default. 2.6.17-git12 had the problem but booting
it with elevator=deadline fixes the hang.

symptoms encountered during git-bisecting between v2.6.17 and v2.6.18-rc1:
 A hang while starting network services
 B hang while trying to login
   1 on remote console [not SSH] it hang after typing <uid><CR>
   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin root@<IP>

A is the problem I got in the first place and this seems to be the
case since 2.6.17-git11 definitely although git-bisect pointed me at
the following
changeset which is included since 2.6.17-git12:

caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
by Jens Axboe
titled "[PATCH] cfq-iosched: many performance fixes"

strange enough it also hangs with 2.6.17-git11 which did not include that
one changeset yet.

to sum it up "elevator=deadline" fixes the problem, so CFQ seems to be
buggy in the current Torvalds 2.6 tree.

therefore, sorry Roman for bugging you, Jens seems to be the right person to
ping about the problem.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-24 15:34                       ` gmu 2k6
@ 2006-07-25  7:32                         ` Jens Axboe
  2006-07-25  8:00                           ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  7:32 UTC (permalink / raw)
  To: gmu 2k6; +Cc: Roman Zippel, linux-kernel

On Mon, Jul 24 2006, gmu 2k6 wrote:
> the problem I have with hangs is related to changes in CFQ and that
> CFQ is now the default. 2.6.17-git12 had the problem but booting
> it with elevator=deadline fixes the hang.
> 
> symptoms encountered during git-bisecting between v2.6.17 and v2.6.18-rc1:
> A hang while starting network services
> B hang while trying to login
>   1 on remote console [not SSH] it hang after typing <uid><CR>
>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin root@<IP>
> 
> A is the problem I got in the first place and this seems to be the
> case since 2.6.17-git11 definitely although git-bisect pointed me at
> the following
> changeset which is included since 2.6.17-git12:
> 
> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> by Jens Axboe
> titled "[PATCH] cfq-iosched: many performance fixes"
> 
> strange enough it also hangs with 2.6.17-git11 which did not include that
> one changeset yet.

So perhaps your bisect isn't 100% trust worthy? Can you do a manual
-gitX bisect to see which 2.6.17-gitX introduced the problem?

Also please put a serial console or similar on the machine, so you can
log + store the sysrq+t output.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  8:00                           ` gmu 2k6
@ 2006-07-25  7:41                             ` Jens Axboe
       [not found]                               ` <f96157c40607250120s2554cbc6qbd7c42972b70f6de@mail.gmail.com>
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  7:41 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> the problem I have with hangs is related to changes in CFQ and that
> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> >> it with elevator=deadline fixes the hang.
> >>
> >> symptoms encountered during git-bisecting between v2.6.17 and 
> >v2.6.18-rc1:
> >> A hang while starting network services
> >> B hang while trying to login
> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin 
> >root@<IP>
> >>
> >> A is the problem I got in the first place and this seems to be the
> >> case since 2.6.17-git11 definitely although git-bisect pointed me at
> >> the following
> >> changeset which is included since 2.6.17-git12:
> >>
> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> by Jens Axboe
> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >>
> >> strange enough it also hangs with 2.6.17-git11 which did not include that
> >> one changeset yet.
> >
> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >
> >Also please put a serial console or similar on the machine, so you can
> >log + store the sysrq+t output.
> 
> well I didn't say that caa....fd3 is the exact change which broke it,
> just that it's related to 1) CFQ changes and 2) CFQ being the default
> now.
> I have a Remote Serial Console via HP's integrated Lights-Out Java
> Applet but am not sure how to enable serial console via kernel boot
> params (will try to find out).
> I will first try to find the 2.6.17-git* revision working before
> bisecting it against -git11 or git12.

Thanks, would be much appreciated to try and narrow it down to a
specific fix.

Are you seeing the hang on cciss?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  7:32                         ` Jens Axboe
@ 2006-07-25  8:00                           ` gmu 2k6
  2006-07-25  7:41                             ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25  8:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Mon, Jul 24 2006, gmu 2k6 wrote:
> > the problem I have with hangs is related to changes in CFQ and that
> > CFQ is now the default. 2.6.17-git12 had the problem but booting
> > it with elevator=deadline fixes the hang.
> >
> > symptoms encountered during git-bisecting between v2.6.17 and v2.6.18-rc1:
> > A hang while starting network services
> > B hang while trying to login
> >   1 on remote console [not SSH] it hang after typing <uid><CR>
> >   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin root@<IP>
> >
> > A is the problem I got in the first place and this seems to be the
> > case since 2.6.17-git11 definitely although git-bisect pointed me at
> > the following
> > changeset which is included since 2.6.17-git12:
> >
> > caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > by Jens Axboe
> > titled "[PATCH] cfq-iosched: many performance fixes"
> >
> > strange enough it also hangs with 2.6.17-git11 which did not include that
> > one changeset yet.
>
> So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> -gitX bisect to see which 2.6.17-gitX introduced the problem?
>
> Also please put a serial console or similar on the machine, so you can
> log + store the sysrq+t output.

well I didn't say that caa....fd3 is the exact change which broke it,
just that it's related to 1) CFQ changes and 2) CFQ being the default
now.
I have a Remote Serial Console via HP's integrated Lights-Out Java
Applet but am not sure how to enable serial console via kernel boot
params (will try to find out).
I will first try to find the 2.6.17-git* revision working before
bisecting it against -git11 or git12.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  8:28                                   ` gmu 2k6
@ 2006-07-25  8:08                                     ` Jens Axboe
  2006-07-25  9:17                                       ` gmu 2k6
  2006-07-25  9:51                                       ` gmu 2k6
  0 siblings, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  8:08 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> >> >> the problem I have with hangs is related to changes in CFQ and that
> >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> >> >> >> it with elevator=deadline fixes the hang.
> >> >> >>
> >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> >> >> >v2.6.18-rc1:
> >> >> >> A hang while starting network services
> >> >> >> B hang while trying to login
> >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> >> >> >root@<IP>
> >> >> >>
> >> >> >> A is the problem I got in the first place and this seems to be the
> >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed me 
> >at
> >> >> >> the following
> >> >> >> changeset which is included since 2.6.17-git12:
> >> >> >>
> >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> >> >> by Jens Axboe
> >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >> >> >>
> >> >> >> strange enough it also hangs with 2.6.17-git11 which did not 
> >include
> >> >that
> >> >> >> one changeset yet.
> >> >> >
> >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >> >> >
> >> >> >Also please put a serial console or similar on the machine, so you 
> >can
> >> >> >log + store the sysrq+t output.
> >> >>
> >> >> well I didn't say that caa....fd3 is the exact change which broke it,
> >> >> just that it's related to 1) CFQ changes and 2) CFQ being the default
> >> >> now.
> >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> >> >> Applet but am not sure how to enable serial console via kernel boot
> >> >> params (will try to find out).
> >> >> I will first try to find the 2.6.17-git* revision working before
> >> >> bisecting it against -git11 or git12.
> >> >
> >> >Thanks, would be much appreciated to try and narrow it down to a
> >> >specific fix.
> >> >
> >> >Are you seeing the hang on cciss?
> >>
> >> I'm not sure it is in the cciss driver, but the SmartArray is driven by
> >> cciss.
> >> starting git<11 boot tests in a minute now.
> >
> >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> >between cciss and cfq I think, so it would be very useful if you can pin
> >point which of the cfq patches make it stall.
> 
> is there anything special about cciss or did you just deduce that it
> must be cciss in that particular box and are suspecting interaction
> problems with that driver and your CFQ changes?

Nothing really special about cciss, but a few months ago I had a similar
discussion about cciss and a strange hang.

If possible, please also try a known bad kernel and apply the below
patch and see if it still reproduces:

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 1c4df22..2b36e7a 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
 	cmd->rq->completion_data = cmd;
 	cmd->rq->errors = status;
 	blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
+#if 1
+	cciss_softirq_done(cmd->rq);
+#else
 	blk_complete_request(cmd->rq);
+#endif
 }
 
 /*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
       [not found]                                 ` <20060725080002.GD4044@suse.de>
@ 2006-07-25  8:28                                   ` gmu 2k6
  2006-07-25  8:08                                     ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25  8:28 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > >> >> the problem I have with hangs is related to changes in CFQ and that
> > >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> > >> >> it with elevator=deadline fixes the hang.
> > >> >>
> > >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > >> >v2.6.18-rc1:
> > >> >> A hang while starting network services
> > >> >> B hang while trying to login
> > >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> > >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> > >> >root@<IP>
> > >> >>
> > >> >> A is the problem I got in the first place and this seems to be the
> > >> >> case since 2.6.17-git11 definitely although git-bisect pointed me at
> > >> >> the following
> > >> >> changeset which is included since 2.6.17-git12:
> > >> >>
> > >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > >> >> by Jens Axboe
> > >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > >> >>
> > >> >> strange enough it also hangs with 2.6.17-git11 which did not include
> > >that
> > >> >> one changeset yet.
> > >> >
> > >> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> > >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > >> >
> > >> >Also please put a serial console or similar on the machine, so you can
> > >> >log + store the sysrq+t output.
> > >>
> > >> well I didn't say that caa....fd3 is the exact change which broke it,
> > >> just that it's related to 1) CFQ changes and 2) CFQ being the default
> > >> now.
> > >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> > >> Applet but am not sure how to enable serial console via kernel boot
> > >> params (will try to find out).
> > >> I will first try to find the 2.6.17-git* revision working before
> > >> bisecting it against -git11 or git12.
> > >
> > >Thanks, would be much appreciated to try and narrow it down to a
> > >specific fix.
> > >
> > >Are you seeing the hang on cciss?
> >
> > I'm not sure it is in the cciss driver, but the SmartArray is driven by
> > cciss.
> > starting git<11 boot tests in a minute now.
>
> Ok, thanks for confirming it's cciss. The bug is likely an interaction
> between cciss and cfq I think, so it would be very useful if you can pin
> point which of the cfq patches make it stall.

is there anything special about cciss or did you just deduce that it
must be cciss in that particular box and are suspecting interaction
problems with that driver and your CFQ changes?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:17                                       ` gmu 2k6
@ 2006-07-25  8:57                                         ` Jens Axboe
  2006-07-25 10:09                                           ` gmu 2k6
  2006-07-25  9:20                                         ` gmu 2k6
  1 sibling, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  8:57 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> >> >> >> the problem I have with hangs is related to changes in CFQ and 
> >that
> >> >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> >> >> >> >> it with elevator=deadline fixes the hang.
> >> >> >> >>
> >> >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> >> >> >> >v2.6.18-rc1:
> >> >> >> >> A hang while starting network services
> >> >> >> >> B hang while trying to login
> >> >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> >> >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> >> >> >> >root@<IP>
> >> >> >> >>
> >> >> >> >> A is the problem I got in the first place and this seems to be 
> >the
> >> >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed 
> >me
> >> >at
> >> >> >> >> the following
> >> >> >> >> changeset which is included since 2.6.17-git12:
> >> >> >> >>
> >> >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> >> >> >> by Jens Axboe
> >> >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >> >> >> >>
> >> >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> >> >include
> >> >> >that
> >> >> >> >> one changeset yet.
> >> >> >> >
> >> >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a 
> >manual
> >> >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >> >> >> >
> >> >> >> >Also please put a serial console or similar on the machine, so you
> >> >can
> >> >> >> >log + store the sysrq+t output.
> >> >> >>
> >> >> >> well I didn't say that caa....fd3 is the exact change which broke 
> >it,
> >> >> >> just that it's related to 1) CFQ changes and 2) CFQ being the 
> >default
> >> >> >> now.
> >> >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> >> >> >> Applet but am not sure how to enable serial console via kernel boot
> >> >> >> params (will try to find out).
> >> >> >> I will first try to find the 2.6.17-git* revision working before
> >> >> >> bisecting it against -git11 or git12.
> >> >> >
> >> >> >Thanks, would be much appreciated to try and narrow it down to a
> >> >> >specific fix.
> >> >> >
> >> >> >Are you seeing the hang on cciss?
> >> >>
> >> >> I'm not sure it is in the cciss driver, but the SmartArray is driven 
> >by
> >> >> cciss.
> >> >> starting git<11 boot tests in a minute now.
> >> >
> >> >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> >> >between cciss and cfq I think, so it would be very useful if you can pin
> >> >point which of the cfq patches make it stall.
> >>
> >> is there anything special about cciss or did you just deduce that it
> >> must be cciss in that particular box and are suspecting interaction
> >> problems with that driver and your CFQ changes?
> >
> >Nothing really special about cciss, but a few months ago I had a similar
> >discussion about cciss and a strange hang.
> >
> >If possible, please also try a known bad kernel and apply the below
> >patch and see if it still reproduces:
> >
> >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >index 1c4df22..2b36e7a 100644
> >--- a/drivers/block/cciss.c
> >+++ b/drivers/block/cciss.c
> >@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> >        cmd->rq->completion_data = cmd;
> >        cmd->rq->errors = status;
> >        blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> >+#if 1
> >+       cciss_softirq_done(cmd->rq);
> >+#else
> >        blk_complete_request(cmd->rq);
> >+#endif
> > }
> >
> > /*
> 
> manually nailed it down to 2.6.17-git7 being the first broken revision.
> going to try whether Linus' git tree knows the -git revisions and do a 
> bisect
> otherwise interdiff and looking for CFQ or cciss changes as best I can.

Hmm, there are no cfq/cciss changes between git6 and git7. Some SCSI
changes, though. Are you using SCSI for anything?

We really need that sysrq-t dump.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:20                                         ` gmu 2k6
@ 2006-07-25  8:57                                           ` Jens Axboe
  2006-07-25  9:35                                           ` gmu 2k6
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  8:57 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> >On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> > >> >> >> the problem I have with hangs is related to changes in CFQ 
> >and that
> >> > >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but 
> >booting
> >> > >> >> >> it with elevator=deadline fixes the hang.
> >> > >> >> >>
> >> > >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> >> > >> >> >v2.6.18-rc1:
> >> > >> >> >> A hang while starting network services
> >> > >> >> >> B hang while trying to login
> >> > >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> >> > >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing 
> >slogin
> >> > >> >> >root@<IP>
> >> > >> >> >>
> >> > >> >> >> A is the problem I got in the first place and this seems to 
> >be the
> >> > >> >> >> case since 2.6.17-git11 definitely although git-bisect 
> >pointed me
> >> > >at
> >> > >> >> >> the following
> >> > >> >> >> changeset which is included since 2.6.17-git12:
> >> > >> >> >>
> >> > >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> > >> >> >> by Jens Axboe
> >> > >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >> > >> >> >>
> >> > >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> >> > >include
> >> > >> >that
> >> > >> >> >> one changeset yet.
> >> > >> >> >
> >> > >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a 
> >manual
> >> > >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >> > >> >> >
> >> > >> >> >Also please put a serial console or similar on the machine, so 
> >you
> >> > >can
> >> > >> >> >log + store the sysrq+t output.
> >> > >> >>
> >> > >> >> well I didn't say that caa....fd3 is the exact change which 
> >broke it,
> >> > >> >> just that it's related to 1) CFQ changes and 2) CFQ being the 
> >default
> >> > >> >> now.
> >> > >> >> I have a Remote Serial Console via HP's integrated Lights-Out 
> >Java
> >> > >> >> Applet but am not sure how to enable serial console via kernel 
> >boot
> >> > >> >> params (will try to find out).
> >> > >> >> I will first try to find the 2.6.17-git* revision working before
> >> > >> >> bisecting it against -git11 or git12.
> >> > >> >
> >> > >> >Thanks, would be much appreciated to try and narrow it down to a
> >> > >> >specific fix.
> >> > >> >
> >> > >> >Are you seeing the hang on cciss?
> >> > >>
> >> > >> I'm not sure it is in the cciss driver, but the SmartArray is 
> >driven by
> >> > >> cciss.
> >> > >> starting git<11 boot tests in a minute now.
> >> > >
> >> > >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> >> > >between cciss and cfq I think, so it would be very useful if you can 
> >pin
> >> > >point which of the cfq patches make it stall.
> >> >
> >> > is there anything special about cciss or did you just deduce that it
> >> > must be cciss in that particular box and are suspecting interaction
> >> > problems with that driver and your CFQ changes?
> >>
> >> Nothing really special about cciss, but a few months ago I had a similar
> >> discussion about cciss and a strange hang.
> >>
> >> If possible, please also try a known bad kernel and apply the below
> >> patch and see if it still reproduces:
> >>
> >> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >> index 1c4df22..2b36e7a 100644
> >> --- a/drivers/block/cciss.c
> >> +++ b/drivers/block/cciss.c
> >> @@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> >>         cmd->rq->completion_data = cmd;
> >>         cmd->rq->errors = status;
> >>         blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> >> +#if 1
> >> +       cciss_softirq_done(cmd->rq);
> >> +#else
> >>         blk_complete_request(cmd->rq);
> >> +#endif
> >>  }
> >>
> >>  /*
> >
> >manually nailed it down to 2.6.17-git7 being the first broken revision.
> >going to try whether Linus' git tree knows the -git revisions and do a 
> >bisect
> >otherwise interdiff and looking for CFQ or cciss changes as best I can.
> 
> oops, doing git-status while running 2.6.17-git6 seems to have locked the 
> box
> again :D, ping works though. *sigh*. Jens I will try your cciss.c change 
> now.

I guess that's a good thing, if it was git7 that introduced it, then
things are looking fishy.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  8:08                                     ` Jens Axboe
@ 2006-07-25  9:17                                       ` gmu 2k6
  2006-07-25  8:57                                         ` Jens Axboe
  2006-07-25  9:20                                         ` gmu 2k6
  2006-07-25  9:51                                       ` gmu 2k6
  1 sibling, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25  9:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > >> >> >> the problem I have with hangs is related to changes in CFQ and that
> > >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> > >> >> >> it with elevator=deadline fixes the hang.
> > >> >> >>
> > >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > >> >> >v2.6.18-rc1:
> > >> >> >> A hang while starting network services
> > >> >> >> B hang while trying to login
> > >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> > >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> > >> >> >root@<IP>
> > >> >> >>
> > >> >> >> A is the problem I got in the first place and this seems to be the
> > >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed me
> > >at
> > >> >> >> the following
> > >> >> >> changeset which is included since 2.6.17-git12:
> > >> >> >>
> > >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > >> >> >> by Jens Axboe
> > >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > >> >> >>
> > >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> > >include
> > >> >that
> > >> >> >> one changeset yet.
> > >> >> >
> > >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> > >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > >> >> >
> > >> >> >Also please put a serial console or similar on the machine, so you
> > >can
> > >> >> >log + store the sysrq+t output.
> > >> >>
> > >> >> well I didn't say that caa....fd3 is the exact change which broke it,
> > >> >> just that it's related to 1) CFQ changes and 2) CFQ being the default
> > >> >> now.
> > >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> > >> >> Applet but am not sure how to enable serial console via kernel boot
> > >> >> params (will try to find out).
> > >> >> I will first try to find the 2.6.17-git* revision working before
> > >> >> bisecting it against -git11 or git12.
> > >> >
> > >> >Thanks, would be much appreciated to try and narrow it down to a
> > >> >specific fix.
> > >> >
> > >> >Are you seeing the hang on cciss?
> > >>
> > >> I'm not sure it is in the cciss driver, but the SmartArray is driven by
> > >> cciss.
> > >> starting git<11 boot tests in a minute now.
> > >
> > >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> > >between cciss and cfq I think, so it would be very useful if you can pin
> > >point which of the cfq patches make it stall.
> >
> > is there anything special about cciss or did you just deduce that it
> > must be cciss in that particular box and are suspecting interaction
> > problems with that driver and your CFQ changes?
>
> Nothing really special about cciss, but a few months ago I had a similar
> discussion about cciss and a strange hang.
>
> If possible, please also try a known bad kernel and apply the below
> patch and see if it still reproduces:
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 1c4df22..2b36e7a 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
>         cmd->rq->completion_data = cmd;
>         cmd->rq->errors = status;
>         blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> +#if 1
> +       cciss_softirq_done(cmd->rq);
> +#else
>         blk_complete_request(cmd->rq);
> +#endif
>  }
>
>  /*

manually nailed it down to 2.6.17-git7 being the first broken revision.
going to try whether Linus' git tree knows the -git revisions and do a bisect
otherwise interdiff and looking for CFQ or cciss changes as best I can.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:17                                       ` gmu 2k6
  2006-07-25  8:57                                         ` Jens Axboe
@ 2006-07-25  9:20                                         ` gmu 2k6
  2006-07-25  8:57                                           ` Jens Axboe
  2006-07-25  9:35                                           ` gmu 2k6
  1 sibling, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25  9:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > > >> >> >> the problem I have with hangs is related to changes in CFQ and that
> > > >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> > > >> >> >> it with elevator=deadline fixes the hang.
> > > >> >> >>
> > > >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > > >> >> >v2.6.18-rc1:
> > > >> >> >> A hang while starting network services
> > > >> >> >> B hang while trying to login
> > > >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> > > >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> > > >> >> >root@<IP>
> > > >> >> >>
> > > >> >> >> A is the problem I got in the first place and this seems to be the
> > > >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed me
> > > >at
> > > >> >> >> the following
> > > >> >> >> changeset which is included since 2.6.17-git12:
> > > >> >> >>
> > > >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > > >> >> >> by Jens Axboe
> > > >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > > >> >> >>
> > > >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> > > >include
> > > >> >that
> > > >> >> >> one changeset yet.
> > > >> >> >
> > > >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> > > >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > > >> >> >
> > > >> >> >Also please put a serial console or similar on the machine, so you
> > > >can
> > > >> >> >log + store the sysrq+t output.
> > > >> >>
> > > >> >> well I didn't say that caa....fd3 is the exact change which broke it,
> > > >> >> just that it's related to 1) CFQ changes and 2) CFQ being the default
> > > >> >> now.
> > > >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> > > >> >> Applet but am not sure how to enable serial console via kernel boot
> > > >> >> params (will try to find out).
> > > >> >> I will first try to find the 2.6.17-git* revision working before
> > > >> >> bisecting it against -git11 or git12.
> > > >> >
> > > >> >Thanks, would be much appreciated to try and narrow it down to a
> > > >> >specific fix.
> > > >> >
> > > >> >Are you seeing the hang on cciss?
> > > >>
> > > >> I'm not sure it is in the cciss driver, but the SmartArray is driven by
> > > >> cciss.
> > > >> starting git<11 boot tests in a minute now.
> > > >
> > > >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> > > >between cciss and cfq I think, so it would be very useful if you can pin
> > > >point which of the cfq patches make it stall.
> > >
> > > is there anything special about cciss or did you just deduce that it
> > > must be cciss in that particular box and are suspecting interaction
> > > problems with that driver and your CFQ changes?
> >
> > Nothing really special about cciss, but a few months ago I had a similar
> > discussion about cciss and a strange hang.
> >
> > If possible, please also try a known bad kernel and apply the below
> > patch and see if it still reproduces:
> >
> > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > index 1c4df22..2b36e7a 100644
> > --- a/drivers/block/cciss.c
> > +++ b/drivers/block/cciss.c
> > @@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> >         cmd->rq->completion_data = cmd;
> >         cmd->rq->errors = status;
> >         blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> > +#if 1
> > +       cciss_softirq_done(cmd->rq);
> > +#else
> >         blk_complete_request(cmd->rq);
> > +#endif
> >  }
> >
> >  /*
>
> manually nailed it down to 2.6.17-git7 being the first broken revision.
> going to try whether Linus' git tree knows the -git revisions and do a bisect
> otherwise interdiff and looking for CFQ or cciss changes as best I can.

oops, doing git-status while running 2.6.17-git6 seems to have locked the box
again :D, ping works though. *sigh*. Jens I will try your cciss.c change now.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:35                                           ` gmu 2k6
@ 2006-07-25  9:24                                             ` Jens Axboe
  2006-07-25 11:29                                             ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  9:24 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> >On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > > >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> > > >> >> >> the problem I have with hangs is related to changes in CFQ 
> >and that
> >> > > >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but 
> >booting
> >> > > >> >> >> it with elevator=deadline fixes the hang.
> >> > > >> >> >>
> >> > > >> >> >> symptoms encountered during git-bisecting between v2.6.17 
> >and
> >> > > >> >> >v2.6.18-rc1:
> >> > > >> >> >> A hang while starting network services
> >> > > >> >> >> B hang while trying to login
> >> > > >> >> >>   1 on remote console [not SSH] it hang after typing 
> ><uid><CR>
> >> > > >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing 
> >slogin
> >> > > >> >> >root@<IP>
> >> > > >> >> >>
> >> > > >> >> >> A is the problem I got in the first place and this seems to 
> >be the
> >> > > >> >> >> case since 2.6.17-git11 definitely although git-bisect 
> >pointed me
> >> > > >at
> >> > > >> >> >> the following
> >> > > >> >> >> changeset which is included since 2.6.17-git12:
> >> > > >> >> >>
> >> > > >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> > > >> >> >> by Jens Axboe
> >> > > >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >> > > >> >> >>
> >> > > >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> >> > > >include
> >> > > >> >that
> >> > > >> >> >> one changeset yet.
> >> > > >> >> >
> >> > > >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a 
> >manual
> >> > > >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >> > > >> >> >
> >> > > >> >> >Also please put a serial console or similar on the machine, 
> >so you
> >> > > >can
> >> > > >> >> >log + store the sysrq+t output.
> >> > > >> >>
> >> > > >> >> well I didn't say that caa....fd3 is the exact change which 
> >broke it,
> >> > > >> >> just that it's related to 1) CFQ changes and 2) CFQ being the 
> >default
> >> > > >> >> now.
> >> > > >> >> I have a Remote Serial Console via HP's integrated Lights-Out 
> >Java
> >> > > >> >> Applet but am not sure how to enable serial console via kernel 
> >boot
> >> > > >> >> params (will try to find out).
> >> > > >> >> I will first try to find the 2.6.17-git* revision working 
> >before
> >> > > >> >> bisecting it against -git11 or git12.
> >> > > >> >
> >> > > >> >Thanks, would be much appreciated to try and narrow it down to a
> >> > > >> >specific fix.
> >> > > >> >
> >> > > >> >Are you seeing the hang on cciss?
> >> > > >>
> >> > > >> I'm not sure it is in the cciss driver, but the SmartArray is 
> >driven by
> >> > > >> cciss.
> >> > > >> starting git<11 boot tests in a minute now.
> >> > > >
> >> > > >Ok, thanks for confirming it's cciss. The bug is likely an 
> >interaction
> >> > > >between cciss and cfq I think, so it would be very useful if you 
> >can pin
> >> > > >point which of the cfq patches make it stall.
> >> > >
> >> > > is there anything special about cciss or did you just deduce that it
> >> > > must be cciss in that particular box and are suspecting interaction
> >> > > problems with that driver and your CFQ changes?
> >> >
> >> > Nothing really special about cciss, but a few months ago I had a 
> >similar
> >> > discussion about cciss and a strange hang.
> >> >
> >> > If possible, please also try a known bad kernel and apply the below
> >> > patch and see if it still reproduces:
> >> >
> >> > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >> > index 1c4df22..2b36e7a 100644
> >> > --- a/drivers/block/cciss.c
> >> > +++ b/drivers/block/cciss.c
> >> > @@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> >> >         cmd->rq->completion_data = cmd;
> >> >         cmd->rq->errors = status;
> >> >         blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> >> > +#if 1
> >> > +       cciss_softirq_done(cmd->rq);
> >> > +#else
> >> >         blk_complete_request(cmd->rq);
> >> > +#endif
> >> >  }
> >> >
> >> >  /*
> >>
> >> manually nailed it down to 2.6.17-git7 being the first broken revision.
> >> going to try whether Linus' git tree knows the -git revisions and do a 
> >bisect
> >> otherwise interdiff and looking for CFQ or cciss changes as best I can.
> >
> >oops, doing git-status while running 2.6.17-git6 seems to have locked the 
> >box
> >again :D, ping works though. *sigh*. Jens I will try your cciss.c change 
> >now.
> 
> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> compared to -git6
> which seems to have correctly booted by accident the lastime. timing issues
> I guess.

Then please also try the cciss patch, as suggested.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:20                                         ` gmu 2k6
  2006-07-25  8:57                                           ` Jens Axboe
@ 2006-07-25  9:35                                           ` gmu 2k6
  2006-07-25  9:24                                             ` Jens Axboe
  2006-07-25 11:29                                             ` Jens Axboe
  1 sibling, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25  9:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > > >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > > > >> >> >> the problem I have with hangs is related to changes in CFQ and that
> > > > >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> > > > >> >> >> it with elevator=deadline fixes the hang.
> > > > >> >> >>
> > > > >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > > > >> >> >v2.6.18-rc1:
> > > > >> >> >> A hang while starting network services
> > > > >> >> >> B hang while trying to login
> > > > >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> > > > >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> > > > >> >> >root@<IP>
> > > > >> >> >>
> > > > >> >> >> A is the problem I got in the first place and this seems to be the
> > > > >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed me
> > > > >at
> > > > >> >> >> the following
> > > > >> >> >> changeset which is included since 2.6.17-git12:
> > > > >> >> >>
> > > > >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > > > >> >> >> by Jens Axboe
> > > > >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > > > >> >> >>
> > > > >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> > > > >include
> > > > >> >that
> > > > >> >> >> one changeset yet.
> > > > >> >> >
> > > > >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> > > > >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > > > >> >> >
> > > > >> >> >Also please put a serial console or similar on the machine, so you
> > > > >can
> > > > >> >> >log + store the sysrq+t output.
> > > > >> >>
> > > > >> >> well I didn't say that caa....fd3 is the exact change which broke it,
> > > > >> >> just that it's related to 1) CFQ changes and 2) CFQ being the default
> > > > >> >> now.
> > > > >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> > > > >> >> Applet but am not sure how to enable serial console via kernel boot
> > > > >> >> params (will try to find out).
> > > > >> >> I will first try to find the 2.6.17-git* revision working before
> > > > >> >> bisecting it against -git11 or git12.
> > > > >> >
> > > > >> >Thanks, would be much appreciated to try and narrow it down to a
> > > > >> >specific fix.
> > > > >> >
> > > > >> >Are you seeing the hang on cciss?
> > > > >>
> > > > >> I'm not sure it is in the cciss driver, but the SmartArray is driven by
> > > > >> cciss.
> > > > >> starting git<11 boot tests in a minute now.
> > > > >
> > > > >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> > > > >between cciss and cfq I think, so it would be very useful if you can pin
> > > > >point which of the cfq patches make it stall.
> > > >
> > > > is there anything special about cciss or did you just deduce that it
> > > > must be cciss in that particular box and are suspecting interaction
> > > > problems with that driver and your CFQ changes?
> > >
> > > Nothing really special about cciss, but a few months ago I had a similar
> > > discussion about cciss and a strange hang.
> > >
> > > If possible, please also try a known bad kernel and apply the below
> > > patch and see if it still reproduces:
> > >
> > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > > index 1c4df22..2b36e7a 100644
> > > --- a/drivers/block/cciss.c
> > > +++ b/drivers/block/cciss.c
> > > @@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> > >         cmd->rq->completion_data = cmd;
> > >         cmd->rq->errors = status;
> > >         blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> > > +#if 1
> > > +       cciss_softirq_done(cmd->rq);
> > > +#else
> > >         blk_complete_request(cmd->rq);
> > > +#endif
> > >  }
> > >
> > >  /*
> >
> > manually nailed it down to 2.6.17-git7 being the first broken revision.
> > going to try whether Linus' git tree knows the -git revisions and do a bisect
> > otherwise interdiff and looking for CFQ or cciss changes as best I can.
>
> oops, doing git-status while running 2.6.17-git6 seems to have locked the box
> again :D, ping works though. *sigh*. Jens I will try your cciss.c change now.

ok, let's nail it to 2.6.17-git5 instead as it survived git status
compared to -git6
which seems to have correctly booted by accident the lastime. timing issues
I guess.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:51                                       ` gmu 2k6
@ 2006-07-25  9:42                                         ` Jens Axboe
  0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  9:42 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> >> >> >> the problem I have with hangs is related to changes in CFQ and 
> >that
> >> >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> >> >> >> >> it with elevator=deadline fixes the hang.
> >> >> >> >>
> >> >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> >> >> >> >v2.6.18-rc1:
> >> >> >> >> A hang while starting network services
> >> >> >> >> B hang while trying to login
> >> >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> >> >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> >> >> >> >root@<IP>
> >> >> >> >>
> >> >> >> >> A is the problem I got in the first place and this seems to be 
> >the
> >> >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed 
> >me
> >> >at
> >> >> >> >> the following
> >> >> >> >> changeset which is included since 2.6.17-git12:
> >> >> >> >>
> >> >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> >> >> >> by Jens Axboe
> >> >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >> >> >> >>
> >> >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> >> >include
> >> >> >that
> >> >> >> >> one changeset yet.
> >> >> >> >
> >> >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a 
> >manual
> >> >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >> >> >> >
> >> >> >> >Also please put a serial console or similar on the machine, so you
> >> >can
> >> >> >> >log + store the sysrq+t output.
> >> >> >>
> >> >> >> well I didn't say that caa....fd3 is the exact change which broke 
> >it,
> >> >> >> just that it's related to 1) CFQ changes and 2) CFQ being the 
> >default
> >> >> >> now.
> >> >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> >> >> >> Applet but am not sure how to enable serial console via kernel boot
> >> >> >> params (will try to find out).
> >> >> >> I will first try to find the 2.6.17-git* revision working before
> >> >> >> bisecting it against -git11 or git12.
> >> >> >
> >> >> >Thanks, would be much appreciated to try and narrow it down to a
> >> >> >specific fix.
> >> >> >
> >> >> >Are you seeing the hang on cciss?
> >> >>
> >> >> I'm not sure it is in the cciss driver, but the SmartArray is driven 
> >by
> >> >> cciss.
> >> >> starting git<11 boot tests in a minute now.
> >> >
> >> >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> >> >between cciss and cfq I think, so it would be very useful if you can pin
> >> >point which of the cfq patches make it stall.
> >>
> >> is there anything special about cciss or did you just deduce that it
> >> must be cciss in that particular box and are suspecting interaction
> >> problems with that driver and your CFQ changes?
> >
> >Nothing really special about cciss, but a few months ago I had a similar
> >discussion about cciss and a strange hang.
> >
> >If possible, please also try a known bad kernel and apply the below
> >patch and see if it still reproduces:
> >
> >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >index 1c4df22..2b36e7a 100644
> >--- a/drivers/block/cciss.c
> >+++ b/drivers/block/cciss.c
> >@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> >        cmd->rq->completion_data = cmd;
> >        cmd->rq->errors = status;
> >        blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> >+#if 1
> >+       cciss_softirq_done(cmd->rq);
> >+#else
> >        blk_complete_request(cmd->rq);
> >+#endif
> > }
> >
> > /*
> 
> I patched Linus' HEAD/trunk/master tree with the following and it
> stuck in cciss init.
> I hope I didn't get your diff wrong:

Doh, sorry about that - it needs an unlocking change as well. Try this
one.

> now I'm really going to try to get the remote serial console working.

Thanks!

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 1c4df22..f1ea0d6 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1261,10 +1261,8 @@ #ifdef CCISS_DEBUG
 #endif				/* CCISS_DEBUG */
 
 	add_disk_randomness(rq->rq_disk);
-	spin_lock_irqsave(&h->lock, flags);
 	end_that_request_last(rq, rq->errors);
 	cmd_free(h, cmd, 1);
-	spin_unlock_irqrestore(&h->lock, flags);
 }
 
 /* This function will check the usage_count of the drive to be updated/added.
@@ -2362,7 +2360,11 @@ static inline void complete_command(ctlr
 	cmd->rq->completion_data = cmd;
 	cmd->rq->errors = status;
 	blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
+#if 1
+	cciss_softirq_done(cmd->rq);
+#else
 	blk_complete_request(cmd->rq);
+#endif
 }
 
 /*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 10:09                                           ` gmu 2k6
@ 2006-07-25  9:46                                             ` Jens Axboe
  2006-07-25 10:19                                               ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25  9:46 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >> >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> >> >> >> >> >> the problem I have with hangs is related to changes in CFQ 
> >and
> >> >that
> >> >> >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but 
> >booting
> >> >> >> >> >> it with elevator=deadline fixes the hang.
> >> >> >> >> >>
> >> >> >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> >> >> >> >> >v2.6.18-rc1:
> >> >> >> >> >> A hang while starting network services
> >> >> >> >> >> B hang while trying to login
> >> >> >> >> >>   1 on remote console [not SSH] it hang after typing 
> ><uid><CR>
> >> >> >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing 
> >slogin
> >> >> >> >> >root@<IP>
> >> >> >> >> >>
> >> >> >> >> >> A is the problem I got in the first place and this seems to 
> >be
> >> >the
> >> >> >> >> >> case since 2.6.17-git11 definitely although git-bisect 
> >pointed
> >> >me
> >> >> >at
> >> >> >> >> >> the following
> >> >> >> >> >> changeset which is included since 2.6.17-git12:
> >> >> >> >> >>
> >> >> >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> >> >> >> >> >> by Jens Axboe
> >> >> >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> >> >> >> >> >>
> >> >> >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> >> >> >include
> >> >> >> >that
> >> >> >> >> >> one changeset yet.
> >> >> >> >> >
> >> >> >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a
> >> >manual
> >> >> >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> >> >> >> >> >
> >> >> >> >> >Also please put a serial console or similar on the machine, so 
> >you
> >> >> >can
> >> >> >> >> >log + store the sysrq+t output.
> >> >> >> >>
> >> >> >> >> well I didn't say that caa....fd3 is the exact change which 
> >broke
> >> >it,
> >> >> >> >> just that it's related to 1) CFQ changes and 2) CFQ being the
> >> >default
> >> >> >> >> now.
> >> >> >> >> I have a Remote Serial Console via HP's integrated Lights-Out 
> >Java
> >> >> >> >> Applet but am not sure how to enable serial console via kernel 
> >boot
> >> >> >> >> params (will try to find out).
> >> >> >> >> I will first try to find the 2.6.17-git* revision working before
> >> >> >> >> bisecting it against -git11 or git12.
> >> >> >> >
> >> >> >> >Thanks, would be much appreciated to try and narrow it down to a
> >> >> >> >specific fix.
> >> >> >> >
> >> >> >> >Are you seeing the hang on cciss?
> >> >> >>
> >> >> >> I'm not sure it is in the cciss driver, but the SmartArray is 
> >driven
> >> >by
> >> >> >> cciss.
> >> >> >> starting git<11 boot tests in a minute now.
> >> >> >
> >> >> >Ok, thanks for confirming it's cciss. The bug is likely an 
> >interaction
> >> >> >between cciss and cfq I think, so it would be very useful if you can 
> >pin
> >> >> >point which of the cfq patches make it stall.
> >> >>
> >> >> is there anything special about cciss or did you just deduce that it
> >> >> must be cciss in that particular box and are suspecting interaction
> >> >> problems with that driver and your CFQ changes?
> >> >
> >> >Nothing really special about cciss, but a few months ago I had a similar
> >> >discussion about cciss and a strange hang.
> >> >
> >> >If possible, please also try a known bad kernel and apply the below
> >> >patch and see if it still reproduces:
> >> >
> >> >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >> >index 1c4df22..2b36e7a 100644
> >> >--- a/drivers/block/cciss.c
> >> >+++ b/drivers/block/cciss.c
> >> >@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> >> >        cmd->rq->completion_data = cmd;
> >> >        cmd->rq->errors = status;
> >> >        blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> >> >+#if 1
> >> >+       cciss_softirq_done(cmd->rq);
> >> >+#else
> >> >        blk_complete_request(cmd->rq);
> >> >+#endif
> >> > }
> >> >
> >> > /*
> >>
> >> manually nailed it down to 2.6.17-git7 being the first broken revision.
> >> going to try whether Linus' git tree knows the -git revisions and do a
> >> bisect
> >> otherwise interdiff and looking for CFQ or cciss changes as best I can.
> >
> >Hmm, there are no cfq/cciss changes between git6 and git7. Some SCSI
> >changes, though. Are you using SCSI for anything?
> 
> I thought cciss (SmartArray) was SCSI, isn't it? I guess you mean "no
> James Bottomley changes in the SCSI layer".

Nope, cciss doesn't interact with the SCSI layer except for tapes.

> >We really need that sysrq-t dump.
> 
> I'm not able to get the virtual remote serial console working, so I
> will try to go down to the datacenter and do "1) SysRq-t 2) SysRq-S 3)
> reboot with livecd and get the content of the synced kern.log which
> should contain the SysRq-t output (hopefully).

Ok, hope it works out :)

You can also use netconsole, that might be a lot easier for you. That
just requires networking and et netcat at the other end.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  8:08                                     ` Jens Axboe
  2006-07-25  9:17                                       ` gmu 2k6
@ 2006-07-25  9:51                                       ` gmu 2k6
  2006-07-25  9:42                                         ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25  9:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > >> >> >> the problem I have with hangs is related to changes in CFQ and that
> > >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> > >> >> >> it with elevator=deadline fixes the hang.
> > >> >> >>
> > >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > >> >> >v2.6.18-rc1:
> > >> >> >> A hang while starting network services
> > >> >> >> B hang while trying to login
> > >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> > >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> > >> >> >root@<IP>
> > >> >> >>
> > >> >> >> A is the problem I got in the first place and this seems to be the
> > >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed me
> > >at
> > >> >> >> the following
> > >> >> >> changeset which is included since 2.6.17-git12:
> > >> >> >>
> > >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > >> >> >> by Jens Axboe
> > >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > >> >> >>
> > >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> > >include
> > >> >that
> > >> >> >> one changeset yet.
> > >> >> >
> > >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a manual
> > >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > >> >> >
> > >> >> >Also please put a serial console or similar on the machine, so you
> > >can
> > >> >> >log + store the sysrq+t output.
> > >> >>
> > >> >> well I didn't say that caa....fd3 is the exact change which broke it,
> > >> >> just that it's related to 1) CFQ changes and 2) CFQ being the default
> > >> >> now.
> > >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> > >> >> Applet but am not sure how to enable serial console via kernel boot
> > >> >> params (will try to find out).
> > >> >> I will first try to find the 2.6.17-git* revision working before
> > >> >> bisecting it against -git11 or git12.
> > >> >
> > >> >Thanks, would be much appreciated to try and narrow it down to a
> > >> >specific fix.
> > >> >
> > >> >Are you seeing the hang on cciss?
> > >>
> > >> I'm not sure it is in the cciss driver, but the SmartArray is driven by
> > >> cciss.
> > >> starting git<11 boot tests in a minute now.
> > >
> > >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> > >between cciss and cfq I think, so it would be very useful if you can pin
> > >point which of the cfq patches make it stall.
> >
> > is there anything special about cciss or did you just deduce that it
> > must be cciss in that particular box and are suspecting interaction
> > problems with that driver and your CFQ changes?
>
> Nothing really special about cciss, but a few months ago I had a similar
> discussion about cciss and a strange hang.
>
> If possible, please also try a known bad kernel and apply the below
> patch and see if it still reproduces:
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 1c4df22..2b36e7a 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
>         cmd->rq->completion_data = cmd;
>         cmd->rq->errors = status;
>         blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> +#if 1
> +       cciss_softirq_done(cmd->rq);
> +#else
>         blk_complete_request(cmd->rq);
> +#endif
>  }
>
>  /*

I patched Linus' HEAD/trunk/master tree with the following and it
stuck in cciss init.
I hope I didn't get your diff wrong:
diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 1c4df22..641dc2d 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
        cmd->rq->completion_data = cmd;
        cmd->rq->errors = status;
        blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
-       blk_complete_request(cmd->rq);
+       #if 1
+               cciss_softirq_done(cmd->rq);
+       #else
+               blk_complete_request(cmd->rq);
+       #endif
 }

 /*


now I'm really going to try to get the remote serial console working.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  8:57                                         ` Jens Axboe
@ 2006-07-25 10:09                                           ` gmu 2k6
  2006-07-25  9:46                                             ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 10:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > >> >> >> >> the problem I have with hangs is related to changes in CFQ and
> > >that
> > >> >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but booting
> > >> >> >> >> it with elevator=deadline fixes the hang.
> > >> >> >> >>
> > >> >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > >> >> >> >v2.6.18-rc1:
> > >> >> >> >> A hang while starting network services
> > >> >> >> >> B hang while trying to login
> > >> >> >> >>   1 on remote console [not SSH] it hang after typing <uid><CR>
> > >> >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing slogin
> > >> >> >> >root@<IP>
> > >> >> >> >>
> > >> >> >> >> A is the problem I got in the first place and this seems to be
> > >the
> > >> >> >> >> case since 2.6.17-git11 definitely although git-bisect pointed
> > >me
> > >> >at
> > >> >> >> >> the following
> > >> >> >> >> changeset which is included since 2.6.17-git12:
> > >> >> >> >>
> > >> >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > >> >> >> >> by Jens Axboe
> > >> >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > >> >> >> >>
> > >> >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> > >> >include
> > >> >> >that
> > >> >> >> >> one changeset yet.
> > >> >> >> >
> > >> >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a
> > >manual
> > >> >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > >> >> >> >
> > >> >> >> >Also please put a serial console or similar on the machine, so you
> > >> >can
> > >> >> >> >log + store the sysrq+t output.
> > >> >> >>
> > >> >> >> well I didn't say that caa....fd3 is the exact change which broke
> > >it,
> > >> >> >> just that it's related to 1) CFQ changes and 2) CFQ being the
> > >default
> > >> >> >> now.
> > >> >> >> I have a Remote Serial Console via HP's integrated Lights-Out Java
> > >> >> >> Applet but am not sure how to enable serial console via kernel boot
> > >> >> >> params (will try to find out).
> > >> >> >> I will first try to find the 2.6.17-git* revision working before
> > >> >> >> bisecting it against -git11 or git12.
> > >> >> >
> > >> >> >Thanks, would be much appreciated to try and narrow it down to a
> > >> >> >specific fix.
> > >> >> >
> > >> >> >Are you seeing the hang on cciss?
> > >> >>
> > >> >> I'm not sure it is in the cciss driver, but the SmartArray is driven
> > >by
> > >> >> cciss.
> > >> >> starting git<11 boot tests in a minute now.
> > >> >
> > >> >Ok, thanks for confirming it's cciss. The bug is likely an interaction
> > >> >between cciss and cfq I think, so it would be very useful if you can pin
> > >> >point which of the cfq patches make it stall.
> > >>
> > >> is there anything special about cciss or did you just deduce that it
> > >> must be cciss in that particular box and are suspecting interaction
> > >> problems with that driver and your CFQ changes?
> > >
> > >Nothing really special about cciss, but a few months ago I had a similar
> > >discussion about cciss and a strange hang.
> > >
> > >If possible, please also try a known bad kernel and apply the below
> > >patch and see if it still reproduces:
> > >
> > >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > >index 1c4df22..2b36e7a 100644
> > >--- a/drivers/block/cciss.c
> > >+++ b/drivers/block/cciss.c
> > >@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> > >        cmd->rq->completion_data = cmd;
> > >        cmd->rq->errors = status;
> > >        blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> > >+#if 1
> > >+       cciss_softirq_done(cmd->rq);
> > >+#else
> > >        blk_complete_request(cmd->rq);
> > >+#endif
> > > }
> > >
> > > /*
> >
> > manually nailed it down to 2.6.17-git7 being the first broken revision.
> > going to try whether Linus' git tree knows the -git revisions and do a
> > bisect
> > otherwise interdiff and looking for CFQ or cciss changes as best I can.
>
> Hmm, there are no cfq/cciss changes between git6 and git7. Some SCSI
> changes, though. Are you using SCSI for anything?

I thought cciss (SmartArray) was SCSI, isn't it? I guess you mean "no
James Bottomley changes in the SCSI layer".

> We really need that sysrq-t dump.

I'm not able to get the virtual remote serial console working, so I
will try to go down to the datacenter and do "1) SysRq-t 2) SysRq-S 3)
reboot with livecd and get the content of the synced kern.log which
should contain the SysRq-t output (hopefully).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:46                                             ` Jens Axboe
@ 2006-07-25 10:19                                               ` gmu 2k6
  2006-07-25 10:41                                                 ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 10:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> >> >> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >> >> >> >> >On Mon, Jul 24 2006, gmu 2k6 wrote:
> > >> >> >> >> >> the problem I have with hangs is related to changes in CFQ
> > >and
> > >> >that
> > >> >> >> >> >> CFQ is now the default. 2.6.17-git12 had the problem but
> > >booting
> > >> >> >> >> >> it with elevator=deadline fixes the hang.
> > >> >> >> >> >>
> > >> >> >> >> >> symptoms encountered during git-bisecting between v2.6.17 and
> > >> >> >> >> >v2.6.18-rc1:
> > >> >> >> >> >> A hang while starting network services
> > >> >> >> >> >> B hang while trying to login
> > >> >> >> >> >>   1 on remote console [not SSH] it hang after typing
> > ><uid><CR>
> > >> >> >> >> >>   1 via OpenSSH it hang after typing <pwd><CR> when doing
> > >slogin
> > >> >> >> >> >root@<IP>
> > >> >> >> >> >>
> > >> >> >> >> >> A is the problem I got in the first place and this seems to
> > >be
> > >> >the
> > >> >> >> >> >> case since 2.6.17-git11 definitely although git-bisect
> > >pointed
> > >> >me
> > >> >> >at
> > >> >> >> >> >> the following
> > >> >> >> >> >> changeset which is included since 2.6.17-git12:
> > >> >> >> >> >>
> > >> >> >> >> >> caaa5f9f0a75d1dc5e812e69afdbb8720e077fd3
> > >> >> >> >> >> by Jens Axboe
> > >> >> >> >> >> titled "[PATCH] cfq-iosched: many performance fixes"
> > >> >> >> >> >>
> > >> >> >> >> >> strange enough it also hangs with 2.6.17-git11 which did not
> > >> >> >include
> > >> >> >> >that
> > >> >> >> >> >> one changeset yet.
> > >> >> >> >> >
> > >> >> >> >> >So perhaps your bisect isn't 100% trust worthy? Can you do a
> > >> >manual
> > >> >> >> >> >-gitX bisect to see which 2.6.17-gitX introduced the problem?
> > >> >> >> >> >
> > >> >> >> >> >Also please put a serial console or similar on the machine, so
> > >you
> > >> >> >can
> > >> >> >> >> >log + store the sysrq+t output.
> > >> >> >> >>
> > >> >> >> >> well I didn't say that caa....fd3 is the exact change which
> > >broke
> > >> >it,
> > >> >> >> >> just that it's related to 1) CFQ changes and 2) CFQ being the
> > >> >default
> > >> >> >> >> now.
> > >> >> >> >> I have a Remote Serial Console via HP's integrated Lights-Out
> > >Java
> > >> >> >> >> Applet but am not sure how to enable serial console via kernel
> > >boot
> > >> >> >> >> params (will try to find out).
> > >> >> >> >> I will first try to find the 2.6.17-git* revision working before
> > >> >> >> >> bisecting it against -git11 or git12.
> > >> >> >> >
> > >> >> >> >Thanks, would be much appreciated to try and narrow it down to a
> > >> >> >> >specific fix.
> > >> >> >> >
> > >> >> >> >Are you seeing the hang on cciss?
> > >> >> >>
> > >> >> >> I'm not sure it is in the cciss driver, but the SmartArray is
> > >driven
> > >> >by
> > >> >> >> cciss.
> > >> >> >> starting git<11 boot tests in a minute now.
> > >> >> >
> > >> >> >Ok, thanks for confirming it's cciss. The bug is likely an
> > >interaction
> > >> >> >between cciss and cfq I think, so it would be very useful if you can
> > >pin
> > >> >> >point which of the cfq patches make it stall.
> > >> >>
> > >> >> is there anything special about cciss or did you just deduce that it
> > >> >> must be cciss in that particular box and are suspecting interaction
> > >> >> problems with that driver and your CFQ changes?
> > >> >
> > >> >Nothing really special about cciss, but a few months ago I had a similar
> > >> >discussion about cciss and a strange hang.
> > >> >
> > >> >If possible, please also try a known bad kernel and apply the below
> > >> >patch and see if it still reproduces:
> > >> >
> > >> >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > >> >index 1c4df22..2b36e7a 100644
> > >> >--- a/drivers/block/cciss.c
> > >> >+++ b/drivers/block/cciss.c
> > >> >@@ -2362,7 +2362,11 @@ static inline void complete_command(ctlr
> > >> >        cmd->rq->completion_data = cmd;
> > >> >        cmd->rq->errors = status;
> > >> >        blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
> > >> >+#if 1
> > >> >+       cciss_softirq_done(cmd->rq);
> > >> >+#else
> > >> >        blk_complete_request(cmd->rq);
> > >> >+#endif
> > >> > }
> > >> >
> > >> > /*
> > >>
> > >> manually nailed it down to 2.6.17-git7 being the first broken revision.
> > >> going to try whether Linus' git tree knows the -git revisions and do a
> > >> bisect
> > >> otherwise interdiff and looking for CFQ or cciss changes as best I can.
> > >
> > >Hmm, there are no cfq/cciss changes between git6 and git7. Some SCSI
> > >changes, though. Are you using SCSI for anything?
> >
> > I thought cciss (SmartArray) was SCSI, isn't it? I guess you mean "no
> > James Bottomley changes in the SCSI layer".
>
> Nope, cciss doesn't interact with the SCSI layer except for tapes.
>
> > >We really need that sysrq-t dump.
> >
> > I'm not able to get the virtual remote serial console working, so I
> > will try to go down to the datacenter and do "1) SysRq-t 2) SysRq-S 3)
> > reboot with livecd and get the content of the synced kern.log which
> > should contain the SysRq-t output (hopefully).
>
> Ok, hope it works out :)

I'm going downstairs now.

> You can also use netconsole, that might be a lot easier for you. That
> just requires networking and et netcat at the other end.

any howto?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 10:19                                               ` gmu 2k6
@ 2006-07-25 10:41                                                 ` Jens Axboe
  0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 10:41 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> >You can also use netconsole, that might be a lot easier for you. That
> >just requires networking and et netcat at the other end.
> 
> any howto?

Documentation/networking/netconsole.txt

using the modular approach is the easiest.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25  9:35                                           ` gmu 2k6
  2006-07-25  9:24                                             ` Jens Axboe
@ 2006-07-25 11:29                                             ` Jens Axboe
  2006-07-25 12:47                                               ` gmu 2k6
  1 sibling, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 11:29 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> compared to -git6
> which seems to have correctly booted by accident the lastime. timing issues
> I guess.

I will try and reproduce it here now. It seems to be in between commit
271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
be bad.

I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
good, so you can try and verify that
dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
two. It's only about 6 commits, so should be quick enough to do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 11:29                                             ` Jens Axboe
@ 2006-07-25 12:47                                               ` gmu 2k6
  2006-07-25 12:52                                                 ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 12:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > ok, let's nail it to 2.6.17-git5 instead as it survived git status
> > compared to -git6
> > which seems to have correctly booted by accident the lastime. timing issues
> > I guess.
>
> I will try and reproduce it here now. It seems to be in between commit
> 271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> be bad.
>
> I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> good, so you can try and verify that
> dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> two. It's only about 6 commits, so should be quick enough to do.

1) no luck with remote serial console
2) netconsole does not work although connecting to the listener with netcat and
 sending strings works
I'm gonna try via physical rs232 9pins and see how that works.
afterwards I will try to bisect the revisions you mentioned.

btw, the issue seems to come and go as I managed to boot log into a .17-git6
kernel or is timing-dependent.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 12:47                                               ` gmu 2k6
@ 2006-07-25 12:52                                                 ` Jens Axboe
  2006-07-25 12:58                                                   ` Jens Axboe
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 12:52 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> >> compared to -git6
> >> which seems to have correctly booted by accident the lastime. timing 
> >issues
> >> I guess.
> >
> >I will try and reproduce it here now. It seems to be in between commit
> >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> >be bad.
> >
> >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> >good, so you can try and verify that
> >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> >two. It's only about 6 commits, so should be quick enough to do.
> 
> 1) no luck with remote serial console
> 2) netconsole does not work although connecting to the listener with netcat 
> and
> sending strings works
> I'm gonna try via physical rs232 9pins and see how that works.
> afterwards I will try to bisect the revisions you mentioned.
> 
> btw, the issue seems to come and go as I managed to boot log into a .17-git6
> kernel or is timing-dependent.

I can reproduce it, you don't have to spend more time on bisecting or
testing. This should fix it:

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 1c4df22..1eac041 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
 	CommandList_struct *cmd = rq->completion_data;
 	ctlr_info_t *h = hba[cmd->ctlr];
 	unsigned long flags;
+	request_queue_t *q;
 	u64bit temp64;
 	int i, ddir;
 
@@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
 	printk("Done with %p\n", rq);
 #endif				/* CCISS_DEBUG */
 
+	q = rq->q;
+
 	add_disk_randomness(rq->rq_disk);
 	spin_lock_irqsave(&h->lock, flags);
 	end_that_request_last(rq, rq->errors);
 	cmd_free(h, cmd, 1);
+	blk_start_queue(q);
 	spin_unlock_irqrestore(&h->lock, flags);
 }
 

A better fix would rework the start_queue logic entirely in the driver,
but the above should get you running for now. I'll take a further look.

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 12:52                                                 ` Jens Axboe
@ 2006-07-25 12:58                                                   ` Jens Axboe
  2006-07-25 14:27                                                     ` gmu 2k6
  2006-07-25 13:13                                                   ` gmu 2k6
  2006-07-25 14:50                                                   ` gmu 2k6
  2 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 12:58 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, Jens Axboe wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> > >> compared to -git6
> > >> which seems to have correctly booted by accident the lastime. timing 
> > >issues
> > >> I guess.
> > >
> > >I will try and reproduce it here now. It seems to be in between commit
> > >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> > >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> > >be bad.
> > >
> > >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> > >good, so you can try and verify that
> > >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> > >two. It's only about 6 commits, so should be quick enough to do.
> > 
> > 1) no luck with remote serial console
> > 2) netconsole does not work although connecting to the listener with netcat 
> > and
> > sending strings works
> > I'm gonna try via physical rs232 9pins and see how that works.
> > afterwards I will try to bisect the revisions you mentioned.
> > 
> > btw, the issue seems to come and go as I managed to boot log into a .17-git6
> > kernel or is timing-dependent.
> 
> I can reproduce it, you don't have to spend more time on bisecting or
> testing. This should fix it:
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 1c4df22..1eac041 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
>  	CommandList_struct *cmd = rq->completion_data;
>  	ctlr_info_t *h = hba[cmd->ctlr];
>  	unsigned long flags;
> +	request_queue_t *q;
>  	u64bit temp64;
>  	int i, ddir;
>  
> @@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
>  	printk("Done with %p\n", rq);
>  #endif				/* CCISS_DEBUG */
>  
> +	q = rq->q;
> +
>  	add_disk_randomness(rq->rq_disk);
>  	spin_lock_irqsave(&h->lock, flags);
>  	end_that_request_last(rq, rq->errors);
>  	cmd_free(h, cmd, 1);
> +	blk_start_queue(q);
>  	spin_unlock_irqrestore(&h->lock, flags);
>  }
>  
> 
> A better fix would rework the start_queue logic entirely in the driver,
> but the above should get you running for now. I'll take a further look.

Something like this matches the current logic better. It's not very good
from a cpu efficiency point of view, but it's better than what is there
now since at least it's not in hard irq context.

Not tested yet, will do so right now.

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 1c4df22..a9e0510 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1233,6 +1233,50 @@ static inline void complete_buffers(stru
 	}
 }
 
+static void cciss_check_queues(ctlr_info_t *h)
+{
+	int start_queue = h->next_to_run;
+	int i;
+
+	/* check to see if we have maxed out the number of commands that can
+	 * be placed on the queue.  If so then exit.  We do this check here
+	 * in case the interrupt we serviced was from an ioctl and did not
+	 * free any new commands.
+	 */
+	if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
+		return;
+
+	/* We have room on the queue for more commands.  Now we need to queue
+	 * them up.  We will also keep track of the next queue to run so
+	 * that every queue gets a chance to be started first.
+	 */
+	for (i = 0; i < h->highest_lun + 1; i++) {
+		int curr_queue = (start_queue + i) % (h->highest_lun + 1);
+		/* make sure the disk has been added and the drive is real
+		 * because this can be called from the middle of init_one.
+		 */
+		if (!(h->drv[curr_queue].queue) || !(h->drv[curr_queue].heads))
+			continue;
+		blk_start_queue(h->gendisk[curr_queue]->queue);
+
+		/* check to see if we have maxed out the number of commands
+		 * that can be placed on the queue.
+		 */
+		if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS) {
+			if (curr_queue == start_queue) {
+				h->next_to_run =
+				    (start_queue + 1) % (h->highest_lun + 1);
+				break;
+			} else {
+				h->next_to_run = curr_queue;
+				break;
+			}
+		} else {
+			curr_queue = (curr_queue + 1) % (h->highest_lun + 1);
+		}
+	}
+}
+
 static void cciss_softirq_done(struct request *rq)
 {
 	CommandList_struct *cmd = rq->completion_data;
@@ -1264,6 +1308,7 @@ #endif				/* CCISS_DEBUG */
 	spin_lock_irqsave(&h->lock, flags);
 	end_that_request_last(rq, rq->errors);
 	cmd_free(h, cmd, 1);
+	cciss_check_queues(h);
 	spin_unlock_irqrestore(&h->lock, flags);
 }
 
@@ -2528,8 +2573,6 @@ static irqreturn_t do_cciss_intr(int irq
 	CommandList_struct *c;
 	unsigned long flags;
 	__u32 a, a1, a2;
-	int j;
-	int start_queue = h->next_to_run;
 
 	if (interrupt_not_for_us(h))
 		return IRQ_NONE;
@@ -2588,45 +2631,6 @@ #				endif
 		}
 	}
 
-	/* check to see if we have maxed out the number of commands that can
-	 * be placed on the queue.  If so then exit.  We do this check here
-	 * in case the interrupt we serviced was from an ioctl and did not
-	 * free any new commands.
-	 */
-	if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
-		goto cleanup;
-
-	/* We have room on the queue for more commands.  Now we need to queue
-	 * them up.  We will also keep track of the next queue to run so
-	 * that every queue gets a chance to be started first.
-	 */
-	for (j = 0; j < h->highest_lun + 1; j++) {
-		int curr_queue = (start_queue + j) % (h->highest_lun + 1);
-		/* make sure the disk has been added and the drive is real
-		 * because this can be called from the middle of init_one.
-		 */
-		if (!(h->drv[curr_queue].queue) || !(h->drv[curr_queue].heads))
-			continue;
-		blk_start_queue(h->gendisk[curr_queue]->queue);
-
-		/* check to see if we have maxed out the number of commands
-		 * that can be placed on the queue.
-		 */
-		if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS) {
-			if (curr_queue == start_queue) {
-				h->next_to_run =
-				    (start_queue + 1) % (h->highest_lun + 1);
-				goto cleanup;
-			} else {
-				h->next_to_run = curr_queue;
-				goto cleanup;
-			}
-		} else {
-			curr_queue = (curr_queue + 1) % (h->highest_lun + 1);
-		}
-	}
-
-      cleanup:
 	spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags);
 	return IRQ_HANDLED;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 12:52                                                 ` Jens Axboe
  2006-07-25 12:58                                                   ` Jens Axboe
@ 2006-07-25 13:13                                                   ` gmu 2k6
  2006-07-25 14:50                                                   ` gmu 2k6
  2 siblings, 0 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 13:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

thanks, so I can stop searching for a terminal app (HyperTerminal was
not installed
on this Windows box) and escape the 20°C which feel like winter when it is
35°C outside.

I will test the 2nd patch and let you know.

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> > >> compared to -git6
> > >> which seems to have correctly booted by accident the lastime. timing
> > >issues
> > >> I guess.
> > >
> > >I will try and reproduce it here now. It seems to be in between commit
> > >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> > >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> > >be bad.
> > >
> > >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> > >good, so you can try and verify that
> > >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> > >two. It's only about 6 commits, so should be quick enough to do.
> >
> > 1) no luck with remote serial console
> > 2) netconsole does not work although connecting to the listener with netcat
> > and
> > sending strings works
> > I'm gonna try via physical rs232 9pins and see how that works.
> > afterwards I will try to bisect the revisions you mentioned.
> >
> > btw, the issue seems to come and go as I managed to boot log into a .17-git6
> > kernel or is timing-dependent.
>
> I can reproduce it, you don't have to spend more time on bisecting or
> testing. This should fix it:
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 1c4df22..1eac041 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
>        CommandList_struct *cmd = rq->completion_data;
>        ctlr_info_t *h = hba[cmd->ctlr];
>        unsigned long flags;
> +       request_queue_t *q;
>        u64bit temp64;
>        int i, ddir;
>
> @@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
>        printk("Done with %p\n", rq);
>  #endif                         /* CCISS_DEBUG */
>
> +       q = rq->q;
> +
>        add_disk_randomness(rq->rq_disk);
>        spin_lock_irqsave(&h->lock, flags);
>        end_that_request_last(rq, rq->errors);
>        cmd_free(h, cmd, 1);
> +       blk_start_queue(q);
>        spin_unlock_irqrestore(&h->lock, flags);
>  }
>
>
> A better fix would rework the start_queue logic entirely in the driver,
> but the above should get you running for now. I'll take a further look.
>
> --
> Jens Axboe
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 12:58                                                   ` Jens Axboe
@ 2006-07-25 14:27                                                     ` gmu 2k6
  2006-07-25 14:29                                                       ` gmu 2k6
  2006-07-25 15:18                                                       ` Jens Axboe
  0 siblings, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 14:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, Jens Axboe wrote:
> > On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> > > >> compared to -git6
> > > >> which seems to have correctly booted by accident the lastime. timing
> > > >issues
> > > >> I guess.
> > > >
> > > >I will try and reproduce it here now. It seems to be in between commit
> > > >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> > > >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> > > >be bad.
> > > >
> > > >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> > > >good, so you can try and verify that
> > > >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> > > >two. It's only about 6 commits, so should be quick enough to do.
> > >
> > > 1) no luck with remote serial console
> > > 2) netconsole does not work although connecting to the listener with netcat
> > > and
> > > sending strings works
> > > I'm gonna try via physical rs232 9pins and see how that works.
> > > afterwards I will try to bisect the revisions you mentioned.
> > >
> > > btw, the issue seems to come and go as I managed to boot log into a .17-git6
> > > kernel or is timing-dependent.
> >
> > I can reproduce it, you don't have to spend more time on bisecting or
> > testing. This should fix it:
> >
> > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > index 1c4df22..1eac041 100644
> > --- a/drivers/block/cciss.c
> > +++ b/drivers/block/cciss.c
> > @@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
> >       CommandList_struct *cmd = rq->completion_data;
> >       ctlr_info_t *h = hba[cmd->ctlr];
> >       unsigned long flags;
> > +     request_queue_t *q;
> >       u64bit temp64;
> >       int i, ddir;
> >
> > @@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
> >       printk("Done with %p\n", rq);
> >  #endif                               /* CCISS_DEBUG */
> >
> > +     q = rq->q;
> > +
> >       add_disk_randomness(rq->rq_disk);
> >       spin_lock_irqsave(&h->lock, flags);
> >       end_that_request_last(rq, rq->errors);
> >       cmd_free(h, cmd, 1);
> > +     blk_start_queue(q);
> >       spin_unlock_irqrestore(&h->lock, flags);
> >  }
> >
> >
> > A better fix would rework the start_queue logic entirely in the driver,
> > but the above should get you running for now. I'll take a further look.
>
> Something like this matches the current logic better. It's not very good
> from a cpu efficiency point of view, but it's better than what is there
> now since at least it's not in hard irq context.
>
> Not tested yet, will do so right now.
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 1c4df22..a9e0510 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -1233,6 +1233,50 @@ static inline void complete_buffers(stru
>         }
>  }
>
> +static void cciss_check_queues(ctlr_info_t *h)
> +{
> +       int start_queue = h->next_to_run;
> +       int i;
> +
> +       /* check to see if we have maxed out the number of commands that can
> +        * be placed on the queue.  If so then exit.  We do this check here
> +        * in case the interrupt we serviced was from an ioctl and did not
> +        * free any new commands.
> +        */
> +       if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
> +               return;
> +
> +       /* We have room on the queue for more commands.  Now we need to queue
> +        * them up.  We will also keep track of the next queue to run so
> +        * that every queue gets a chance to be started first.
> +        */
> +       for (i = 0; i < h->highest_lun + 1; i++) {
> +               int curr_queue = (start_queue + i) % (h->highest_lun + 1);
> +               /* make sure the disk has been added and the drive is real
> +                * because this can be called from the middle of init_one.
> +                */
> +               if (!(h->drv[curr_queue].queue) || !(h->drv[curr_queue].heads))
> +                       continue;
> +               blk_start_queue(h->gendisk[curr_queue]->queue);
> +
> +               /* check to see if we have maxed out the number of commands
> +                * that can be placed on the queue.
> +                */
> +               if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS) {
> +                       if (curr_queue == start_queue) {
> +                               h->next_to_run =
> +                                   (start_queue + 1) % (h->highest_lun + 1);
> +                               break;
> +                       } else {
> +                               h->next_to_run = curr_queue;
> +                               break;
> +                       }
> +               } else {
> +                       curr_queue = (curr_queue + 1) % (h->highest_lun + 1);
> +               }
> +       }
> +}
> +
>  static void cciss_softirq_done(struct request *rq)
>  {
>         CommandList_struct *cmd = rq->completion_data;
> @@ -1264,6 +1308,7 @@ #endif                            /* CCISS_DEBUG */
>         spin_lock_irqsave(&h->lock, flags);
>         end_that_request_last(rq, rq->errors);
>         cmd_free(h, cmd, 1);
> +       cciss_check_queues(h);
>         spin_unlock_irqrestore(&h->lock, flags);
>  }
>
> @@ -2528,8 +2573,6 @@ static irqreturn_t do_cciss_intr(int irq
>         CommandList_struct *c;
>         unsigned long flags;
>         __u32 a, a1, a2;
> -       int j;
> -       int start_queue = h->next_to_run;
>
>         if (interrupt_not_for_us(h))
>                 return IRQ_NONE;
> @@ -2588,45 +2631,6 @@ #                                endif
>                 }
>         }
>
> -       /* check to see if we have maxed out the number of commands that can
> -        * be placed on the queue.  If so then exit.  We do this check here
> -        * in case the interrupt we serviced was from an ioctl and did not
> -        * free any new commands.
> -        */
> -       if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
> -               goto cleanup;
> -
> -       /* We have room on the queue for more commands.  Now we need to queue
> -        * them up.  We will also keep track of the next queue to run so
> -        * that every queue gets a chance to be started first.
> -        */
> -       for (j = 0; j < h->highest_lun + 1; j++) {
> -               int curr_queue = (start_queue + j) % (h->highest_lun + 1);
> -               /* make sure the disk has been added and the drive is real
> -                * because this can be called from the middle of init_one.
> -                */
> -               if (!(h->drv[curr_queue].queue) || !(h->drv[curr_queue].heads))
> -                       continue;
> -               blk_start_queue(h->gendisk[curr_queue]->queue);
> -
> -               /* check to see if we have maxed out the number of commands
> -                * that can be placed on the queue.
> -                */
> -               if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS) {
> -                       if (curr_queue == start_queue) {
> -                               h->next_to_run =
> -                                   (start_queue + 1) % (h->highest_lun + 1);
> -                               goto cleanup;
> -                       } else {
> -                               h->next_to_run = curr_queue;
> -                               goto cleanup;
> -                       }
> -               } else {
> -                       curr_queue = (curr_queue + 1) % (h->highest_lun + 1);
> -               }
> -       }
> -
> -      cleanup:
>         spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags);
>         return IRQ_HANDLED;
>  }

this makes the cciss init hang.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 14:27                                                     ` gmu 2k6
@ 2006-07-25 14:29                                                       ` gmu 2k6
  2006-07-25 15:18                                                       ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 14:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, gmu 2k6 <gmu2006@gmail.com> wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > On Tue, Jul 25 2006, Jens Axboe wrote:
> > > On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > > > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > > > >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> > > > >> compared to -git6
> > > > >> which seems to have correctly booted by accident the lastime. timing
> > > > >issues
> > > > >> I guess.
> > > > >
> > > > >I will try and reproduce it here now. It seems to be in between commit
> > > > >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> > > > >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> > > > >be bad.
> > > > >
> > > > >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> > > > >good, so you can try and verify that
> > > > >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> > > > >two. It's only about 6 commits, so should be quick enough to do.
> > > >
> > > > 1) no luck with remote serial console
> > > > 2) netconsole does not work although connecting to the listener with netcat
> > > > and
> > > > sending strings works
> > > > I'm gonna try via physical rs232 9pins and see how that works.
> > > > afterwards I will try to bisect the revisions you mentioned.
> > > >
> > > > btw, the issue seems to come and go as I managed to boot log into a .17-git6
> > > > kernel or is timing-dependent.
> > >
> > > I can reproduce it, you don't have to spend more time on bisecting or
> > > testing. This should fix it:
> > >
> > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > > index 1c4df22..1eac041 100644
> > > --- a/drivers/block/cciss.c
> > > +++ b/drivers/block/cciss.c
> > > @@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
> > >       CommandList_struct *cmd = rq->completion_data;
> > >       ctlr_info_t *h = hba[cmd->ctlr];
> > >       unsigned long flags;
> > > +     request_queue_t *q;
> > >       u64bit temp64;
> > >       int i, ddir;
> > >
> > > @@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
> > >       printk("Done with %p\n", rq);
> > >  #endif                               /* CCISS_DEBUG */
> > >
> > > +     q = rq->q;
> > > +
> > >       add_disk_randomness(rq->rq_disk);
> > >       spin_lock_irqsave(&h->lock, flags);
> > >       end_that_request_last(rq, rq->errors);
> > >       cmd_free(h, cmd, 1);
> > > +     blk_start_queue(q);
> > >       spin_unlock_irqrestore(&h->lock, flags);
> > >  }
> > >
> > >
> > > A better fix would rework the start_queue logic entirely in the driver,
> > > but the above should get you running for now. I'll take a further look.
> >
> > Something like this matches the current logic better. It's not very good
> > from a cpu efficiency point of view, but it's better than what is there
> > now since at least it's not in hard irq context.
> >
> > Not tested yet, will do so right now.
> >
> > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > index 1c4df22..a9e0510 100644
> > --- a/drivers/block/cciss.c
> > +++ b/drivers/block/cciss.c
> > @@ -1233,6 +1233,50 @@ static inline void complete_buffers(stru
> >         }
> >  }
> >
> > +static void cciss_check_queues(ctlr_info_t *h)
> > +{
> > +       int start_queue = h->next_to_run;
> > +       int i;
> > +
> > +       /* check to see if we have maxed out the number of commands that can
> > +        * be placed on the queue.  If so then exit.  We do this check here
> > +        * in case the interrupt we serviced was from an ioctl and did not
> > +        * free any new commands.
> > +        */
> > +       if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
> > +               return;
> > +
> > +       /* We have room on the queue for more commands.  Now we need to queue
> > +        * them up.  We will also keep track of the next queue to run so
> > +        * that every queue gets a chance to be started first.
> > +        */
> > +       for (i = 0; i < h->highest_lun + 1; i++) {
> > +               int curr_queue = (start_queue + i) % (h->highest_lun + 1);
> > +               /* make sure the disk has been added and the drive is real
> > +                * because this can be called from the middle of init_one.
> > +                */
> > +               if (!(h->drv[curr_queue].queue) || !(h->drv[curr_queue].heads))
> > +                       continue;
> > +               blk_start_queue(h->gendisk[curr_queue]->queue);
> > +
> > +               /* check to see if we have maxed out the number of commands
> > +                * that can be placed on the queue.
> > +                */
> > +               if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS) {
> > +                       if (curr_queue == start_queue) {
> > +                               h->next_to_run =
> > +                                   (start_queue + 1) % (h->highest_lun + 1);
> > +                               break;
> > +                       } else {
> > +                               h->next_to_run = curr_queue;
> > +                               break;
> > +                       }
> > +               } else {
> > +                       curr_queue = (curr_queue + 1) % (h->highest_lun + 1);
> > +               }
> > +       }
> > +}
> > +
> >  static void cciss_softirq_done(struct request *rq)
> >  {
> >         CommandList_struct *cmd = rq->completion_data;
> > @@ -1264,6 +1308,7 @@ #endif                            /* CCISS_DEBUG */
> >         spin_lock_irqsave(&h->lock, flags);
> >         end_that_request_last(rq, rq->errors);
> >         cmd_free(h, cmd, 1);
> > +       cciss_check_queues(h);
> >         spin_unlock_irqrestore(&h->lock, flags);
> >  }
> >
> > @@ -2528,8 +2573,6 @@ static irqreturn_t do_cciss_intr(int irq
> >         CommandList_struct *c;
> >         unsigned long flags;
> >         __u32 a, a1, a2;
> > -       int j;
> > -       int start_queue = h->next_to_run;
> >
> >         if (interrupt_not_for_us(h))
> >                 return IRQ_NONE;
> > @@ -2588,45 +2631,6 @@ #                                endif
> >                 }
> >         }
> >
> > -       /* check to see if we have maxed out the number of commands that can
> > -        * be placed on the queue.  If so then exit.  We do this check here
> > -        * in case the interrupt we serviced was from an ioctl and did not
> > -        * free any new commands.
> > -        */
> > -       if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
> > -               goto cleanup;
> > -
> > -       /* We have room on the queue for more commands.  Now we need to queue
> > -        * them up.  We will also keep track of the next queue to run so
> > -        * that every queue gets a chance to be started first.
> > -        */
> > -       for (j = 0; j < h->highest_lun + 1; j++) {
> > -               int curr_queue = (start_queue + j) % (h->highest_lun + 1);
> > -               /* make sure the disk has been added and the drive is real
> > -                * because this can be called from the middle of init_one.
> > -                */
> > -               if (!(h->drv[curr_queue].queue) || !(h->drv[curr_queue].heads))
> > -                       continue;
> > -               blk_start_queue(h->gendisk[curr_queue]->queue);
> > -
> > -               /* check to see if we have maxed out the number of commands
> > -                * that can be placed on the queue.
> > -                */
> > -               if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS) {
> > -                       if (curr_queue == start_queue) {
> > -                               h->next_to_run =
> > -                                   (start_queue + 1) % (h->highest_lun + 1);
> > -                               goto cleanup;
> > -                       } else {
> > -                               h->next_to_run = curr_queue;
> > -                               goto cleanup;
> > -                       }
> > -               } else {
> > -                       curr_queue = (curr_queue + 1) % (h->highest_lun + 1);
> > -               }
> > -       }
> > -
> > -      cleanup:
> >         spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags);
> >         return IRQ_HANDLED;
> >  }
>
> this makes the cciss init hang.

I've patched Linus' trunk/HEAD with it, btw.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 12:52                                                 ` Jens Axboe
  2006-07-25 12:58                                                   ` Jens Axboe
  2006-07-25 13:13                                                   ` gmu 2k6
@ 2006-07-25 14:50                                                   ` gmu 2k6
  2006-07-25 15:19                                                     ` Jens Axboe
  2006-07-25 18:58                                                     ` gmu 2k6
  2 siblings, 2 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 14:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> > >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> > >> compared to -git6
> > >> which seems to have correctly booted by accident the lastime. timing
> > >issues
> > >> I guess.
> > >
> > >I will try and reproduce it here now. It seems to be in between commit
> > >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> > >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> > >be bad.
> > >
> > >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> > >good, so you can try and verify that
> > >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> > >two. It's only about 6 commits, so should be quick enough to do.
> >
> > 1) no luck with remote serial console
> > 2) netconsole does not work although connecting to the listener with netcat
> > and
> > sending strings works
> > I'm gonna try via physical rs232 9pins and see how that works.
> > afterwards I will try to bisect the revisions you mentioned.
> >
> > btw, the issue seems to come and go as I managed to boot log into a .17-git6
> > kernel or is timing-dependent.
>
> I can reproduce it, you don't have to spend more time on bisecting or
> testing. This should fix it:
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 1c4df22..1eac041 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
>         CommandList_struct *cmd = rq->completion_data;
>         ctlr_info_t *h = hba[cmd->ctlr];
>         unsigned long flags;
> +       request_queue_t *q;
>         u64bit temp64;
>         int i, ddir;
>
> @@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
>         printk("Done with %p\n", rq);
>  #endif                         /* CCISS_DEBUG */
>
> +       q = rq->q;
> +
>         add_disk_randomness(rq->rq_disk);
>         spin_lock_irqsave(&h->lock, flags);
>         end_that_request_last(rq, rq->errors);
>         cmd_free(h, cmd, 1);
> +       blk_start_queue(q);
>         spin_unlock_irqrestore(&h->lock, flags);
>  }
>
>
> A better fix would rework the start_queue logic entirely in the driver,
> but the above should get you running for now. I'll take a further look.

this four-liner seems to fix it:
- I can boot
- log in
- git-status works
- svn up works

as my last mail said the 2nd patch with the new function introduced
did hang cciss
on driver init before printing any drive info.

btw, I assume you have systems with SmartArray 6* at your disposal to
test, right?
I mean SuSE should have some as a distro vendor.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 14:27                                                     ` gmu 2k6
  2006-07-25 14:29                                                       ` gmu 2k6
@ 2006-07-25 15:18                                                       ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 15:18 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, Jens Axboe wrote:
> >> On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> > >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> > >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> >> > >> compared to -git6
> >> > >> which seems to have correctly booted by accident the lastime. timing
> >> > >issues
> >> > >> I guess.
> >> > >
> >> > >I will try and reproduce it here now. It seems to be in between commit
> >> > >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> >> > >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could 
> >also
> >> > >be bad.
> >> > >
> >> > >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> >> > >good, so you can try and verify that
> >> > >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> >> > >two. It's only about 6 commits, so should be quick enough to do.
> >> >
> >> > 1) no luck with remote serial console
> >> > 2) netconsole does not work although connecting to the listener with 
> >netcat
> >> > and
> >> > sending strings works
> >> > I'm gonna try via physical rs232 9pins and see how that works.
> >> > afterwards I will try to bisect the revisions you mentioned.
> >> >
> >> > btw, the issue seems to come and go as I managed to boot log into a 
> >.17-git6
> >> > kernel or is timing-dependent.
> >>
> >> I can reproduce it, you don't have to spend more time on bisecting or
> >> testing. This should fix it:
> >>
> >> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >> index 1c4df22..1eac041 100644
> >> --- a/drivers/block/cciss.c
> >> +++ b/drivers/block/cciss.c
> >> @@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
> >>       CommandList_struct *cmd = rq->completion_data;
> >>       ctlr_info_t *h = hba[cmd->ctlr];
> >>       unsigned long flags;
> >> +     request_queue_t *q;
> >>       u64bit temp64;
> >>       int i, ddir;
> >>
> >> @@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
> >>       printk("Done with %p\n", rq);
> >>  #endif                               /* CCISS_DEBUG */
> >>
> >> +     q = rq->q;
> >> +
> >>       add_disk_randomness(rq->rq_disk);
> >>       spin_lock_irqsave(&h->lock, flags);
> >>       end_that_request_last(rq, rq->errors);
> >>       cmd_free(h, cmd, 1);
> >> +     blk_start_queue(q);
> >>       spin_unlock_irqrestore(&h->lock, flags);
> >>  }
> >>
> >>
> >> A better fix would rework the start_queue logic entirely in the driver,
> >> but the above should get you running for now. I'll take a further look.
> >
> >Something like this matches the current logic better. It's not very good
> >from a cpu efficiency point of view, but it's better than what is there
> >now since at least it's not in hard irq context.
> >
> >Not tested yet, will do so right now.
> >
> >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >index 1c4df22..a9e0510 100644
> >--- a/drivers/block/cciss.c
> >+++ b/drivers/block/cciss.c
> >@@ -1233,6 +1233,50 @@ static inline void complete_buffers(stru
> >        }
> > }
> >
> >+static void cciss_check_queues(ctlr_info_t *h)
> >+{
> >+       int start_queue = h->next_to_run;
> >+       int i;
> >+
> >+       /* check to see if we have maxed out the number of commands that 
> >can
> >+        * be placed on the queue.  If so then exit.  We do this check here
> >+        * in case the interrupt we serviced was from an ioctl and did not
> >+        * free any new commands.
> >+        */
> >+       if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
> >+               return;
> >+
> >+       /* We have room on the queue for more commands.  Now we need to 
> >queue
> >+        * them up.  We will also keep track of the next queue to run so
> >+        * that every queue gets a chance to be started first.
> >+        */
> >+       for (i = 0; i < h->highest_lun + 1; i++) {
> >+               int curr_queue = (start_queue + i) % (h->highest_lun + 1);
> >+               /* make sure the disk has been added and the drive is real
> >+                * because this can be called from the middle of init_one.
> >+                */
> >+               if (!(h->drv[curr_queue].queue) || 
> >!(h->drv[curr_queue].heads))
> >+                       continue;
> >+               blk_start_queue(h->gendisk[curr_queue]->queue);
> >+
> >+               /* check to see if we have maxed out the number of commands
> >+                * that can be placed on the queue.
> >+                */
> >+               if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == 
> >NR_CMDS) {
> >+                       if (curr_queue == start_queue) {
> >+                               h->next_to_run =
> >+                                   (start_queue + 1) % (h->highest_lun + 
> >1);
> >+                               break;
> >+                       } else {
> >+                               h->next_to_run = curr_queue;
> >+                               break;
> >+                       }
> >+               } else {
> >+                       curr_queue = (curr_queue + 1) % (h->highest_lun + 
> >1);
> >+               }
> >+       }
> >+}
> >+
> > static void cciss_softirq_done(struct request *rq)
> > {
> >        CommandList_struct *cmd = rq->completion_data;
> >@@ -1264,6 +1308,7 @@ #endif                            /* CCISS_DEBUG */
> >        spin_lock_irqsave(&h->lock, flags);
> >        end_that_request_last(rq, rq->errors);
> >        cmd_free(h, cmd, 1);
> >+       cciss_check_queues(h);
> >        spin_unlock_irqrestore(&h->lock, flags);
> > }
> >
> >@@ -2528,8 +2573,6 @@ static irqreturn_t do_cciss_intr(int irq
> >        CommandList_struct *c;
> >        unsigned long flags;
> >        __u32 a, a1, a2;
> >-       int j;
> >-       int start_queue = h->next_to_run;
> >
> >        if (interrupt_not_for_us(h))
> >                return IRQ_NONE;
> >@@ -2588,45 +2631,6 @@ #                                endif
> >                }
> >        }
> >
> >-       /* check to see if we have maxed out the number of commands that 
> >can
> >-        * be placed on the queue.  If so then exit.  We do this check here
> >-        * in case the interrupt we serviced was from an ioctl and did not
> >-        * free any new commands.
> >-        */
> >-       if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == NR_CMDS)
> >-               goto cleanup;
> >-
> >-       /* We have room on the queue for more commands.  Now we need to 
> >queue
> >-        * them up.  We will also keep track of the next queue to run so
> >-        * that every queue gets a chance to be started first.
> >-        */
> >-       for (j = 0; j < h->highest_lun + 1; j++) {
> >-               int curr_queue = (start_queue + j) % (h->highest_lun + 1);
> >-               /* make sure the disk has been added and the drive is real
> >-                * because this can be called from the middle of init_one.
> >-                */
> >-               if (!(h->drv[curr_queue].queue) || 
> >!(h->drv[curr_queue].heads))
> >-                       continue;
> >-               blk_start_queue(h->gendisk[curr_queue]->queue);
> >-
> >-               /* check to see if we have maxed out the number of commands
> >-                * that can be placed on the queue.
> >-                */
> >-               if ((find_first_zero_bit(h->cmd_pool_bits, NR_CMDS)) == 
> >NR_CMDS) {
> >-                       if (curr_queue == start_queue) {
> >-                               h->next_to_run =
> >-                                   (start_queue + 1) % (h->highest_lun + 
> >1);
> >-                               goto cleanup;
> >-                       } else {
> >-                               h->next_to_run = curr_queue;
> >-                               goto cleanup;
> >-                       }
> >-               } else {
> >-                       curr_queue = (curr_queue + 1) % (h->highest_lun + 
> >1);
> >-               }
> >-       }
> >-
> >-      cleanup:
> >        spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags);
> >        return IRQ_HANDLED;
> > }
> 
> this makes the cciss init hang.

hmm strange, it works for me. sysrq-t for the hang, please. just note
the top few functions, should be easy enough to write down manually.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 14:50                                                   ` gmu 2k6
@ 2006-07-25 15:19                                                     ` Jens Axboe
  2006-07-25 18:58                                                     ` gmu 2k6
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 15:19 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> >> >On Tue, Jul 25 2006, gmu 2k6 wrote:
> >> >> ok, let's nail it to 2.6.17-git5 instead as it survived git status
> >> >> compared to -git6
> >> >> which seems to have correctly booted by accident the lastime. timing
> >> >issues
> >> >> I guess.
> >> >
> >> >I will try and reproduce it here now. It seems to be in between commit
> >> >271f18f102c789f59644bb6c53a69da1df72b2f4 and commit
> >> >dd67d051529387f6e44d22d1d5540ef281965fdd where the first one could also
> >> >be bad.
> >> >
> >> >I'm assuming that acf421755593f7d7bd9352d57eda796c6eb4fa43 should be
> >> >good, so you can try and verify that
> >> >dd67d051529387f6e44d22d1d5540ef281965fdd is bad and bisect between the
> >> >two. It's only about 6 commits, so should be quick enough to do.
> >>
> >> 1) no luck with remote serial console
> >> 2) netconsole does not work although connecting to the listener with 
> >netcat
> >> and
> >> sending strings works
> >> I'm gonna try via physical rs232 9pins and see how that works.
> >> afterwards I will try to bisect the revisions you mentioned.
> >>
> >> btw, the issue seems to come and go as I managed to boot log into a 
> >.17-git6
> >> kernel or is timing-dependent.
> >
> >I can reproduce it, you don't have to spend more time on bisecting or
> >testing. This should fix it:
> >
> >diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> >index 1c4df22..1eac041 100644
> >--- a/drivers/block/cciss.c
> >+++ b/drivers/block/cciss.c
> >@@ -1238,6 +1238,7 @@ static void cciss_softirq_done(struct re
> >        CommandList_struct *cmd = rq->completion_data;
> >        ctlr_info_t *h = hba[cmd->ctlr];
> >        unsigned long flags;
> >+       request_queue_t *q;
> >        u64bit temp64;
> >        int i, ddir;
> >
> >@@ -1260,10 +1261,13 @@ #ifdef CCISS_DEBUG
> >        printk("Done with %p\n", rq);
> > #endif                         /* CCISS_DEBUG */
> >
> >+       q = rq->q;
> >+
> >        add_disk_randomness(rq->rq_disk);
> >        spin_lock_irqsave(&h->lock, flags);
> >        end_that_request_last(rq, rq->errors);
> >        cmd_free(h, cmd, 1);
> >+       blk_start_queue(q);
> >        spin_unlock_irqrestore(&h->lock, flags);
> > }
> >
> >
> >A better fix would rework the start_queue logic entirely in the driver,
> >but the above should get you running for now. I'll take a further look.
> 
> this four-liner seems to fix it:
> - I can boot
> - log in
> - git-status works
> - svn up works
> 
> as my last mail said the 2nd patch with the new function introduced
> did hang cciss
> on driver init before printing any drive info.

The problem with the 4-liner is that it potentially starves some arrays
(in theory). I'll retry the full fix.

> btw, I assume you have systems with SmartArray 6* at your disposal to
> test, right?
> I mean SuSE should have some as a distro vendor.

I do, I have several right here next to me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 14:50                                                   ` gmu 2k6
  2006-07-25 15:19                                                     ` Jens Axboe
@ 2006-07-25 18:58                                                     ` gmu 2k6
  2006-07-25 19:21                                                       ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 18:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

thanks Jens,
7b30f09245d0e6868819b946b2f6879e5d3d106b
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b30f09245d0e6868819b946b2f6879e5d3d106b
has fixed the problem (maybe together with the other 3 changes in HEAD
as the 2nd patch in this thread did not work in the first place or maybe
it is a little bit different, no time to check right now).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 18:58                                                     ` gmu 2k6
@ 2006-07-25 19:21                                                       ` Jens Axboe
  2006-07-25 19:28                                                         ` gmu 2k6
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2006-07-25 19:21 UTC (permalink / raw)
  To: gmu 2k6; +Cc: linux-kernel

On Tue, Jul 25 2006, gmu 2k6 wrote:
> thanks Jens,
> 7b30f09245d0e6868819b946b2f6879e5d3d106b
> http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b30f09245d0e6868819b946b2f6879e5d3d106b
> has fixed the problem (maybe together with the other 3 changes in HEAD
> as the 2nd patch in this thread did not work in the first place or maybe
> it is a little bit different, no time to check right now).

It's an identical change, so the one sent you should work as well.
Perhaps you botched that one test? These things happen, it's happened to
me as well :-)

The change definitely fixed it for me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Re: i686 hang on boot in userspace
  2006-07-25 19:21                                                       ` Jens Axboe
@ 2006-07-25 19:28                                                         ` gmu 2k6
  0 siblings, 0 replies; 46+ messages in thread
From: gmu 2k6 @ 2006-07-25 19:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On 7/25/06, Jens Axboe <axboe@suse.de> wrote:
> On Tue, Jul 25 2006, gmu 2k6 wrote:
> > thanks Jens,
> > 7b30f09245d0e6868819b946b2f6879e5d3d106b
> > http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b30f09245d0e6868819b946b2f6879e5d3d106b
> > has fixed the problem (maybe together with the other 3 changes in HEAD
> > as the 2nd patch in this thread did not work in the first place or maybe
> > it is a little bit different, no time to check right now).
>
> It's an identical change, so the one sent you should work as well.
> Perhaps you botched that one test? These things happen, it's happened to
> me as well :-)
>
> The change definitely fixed it for me.

here too. right now I'm busy with trying to find out why /dev/hwrng is
present but does not pass rngtest checks. I'll post about that soonish
but it's a different issue of course.

anyway, thanks a lot Jens.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2006-07-26  4:58 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20060714150418.120680@gmx.net>
2006-07-14 19:43 ` i686 hang on boot in userspace john stultz
2006-07-17 10:52 ` Roman Zippel
2006-07-17 11:09   ` Roman Zippel
2006-07-17 13:38   ` Uwe Bugla
2006-07-17 14:17     ` Roman Zippel
2006-07-17 14:59       ` gmu 2k6
2006-07-17 15:21         ` Roman Zippel
2006-07-17 15:58           ` gmu 2k6
2006-07-17 16:02             ` gmu 2k6
2006-07-17 17:03               ` Roman Zippel
2006-07-17 18:15                 ` gmu 2k6
2006-07-17 18:17                   ` gmu 2k6
2006-07-18  9:38                   ` gmu 2k6
2006-07-19 10:26                     ` gmu 2k6
2006-07-24 15:34                       ` gmu 2k6
2006-07-25  7:32                         ` Jens Axboe
2006-07-25  8:00                           ` gmu 2k6
2006-07-25  7:41                             ` Jens Axboe
     [not found]                               ` <f96157c40607250120s2554cbc6qbd7c42972b70f6de@mail.gmail.com>
     [not found]                                 ` <20060725080002.GD4044@suse.de>
2006-07-25  8:28                                   ` gmu 2k6
2006-07-25  8:08                                     ` Jens Axboe
2006-07-25  9:17                                       ` gmu 2k6
2006-07-25  8:57                                         ` Jens Axboe
2006-07-25 10:09                                           ` gmu 2k6
2006-07-25  9:46                                             ` Jens Axboe
2006-07-25 10:19                                               ` gmu 2k6
2006-07-25 10:41                                                 ` Jens Axboe
2006-07-25  9:20                                         ` gmu 2k6
2006-07-25  8:57                                           ` Jens Axboe
2006-07-25  9:35                                           ` gmu 2k6
2006-07-25  9:24                                             ` Jens Axboe
2006-07-25 11:29                                             ` Jens Axboe
2006-07-25 12:47                                               ` gmu 2k6
2006-07-25 12:52                                                 ` Jens Axboe
2006-07-25 12:58                                                   ` Jens Axboe
2006-07-25 14:27                                                     ` gmu 2k6
2006-07-25 14:29                                                       ` gmu 2k6
2006-07-25 15:18                                                       ` Jens Axboe
2006-07-25 13:13                                                   ` gmu 2k6
2006-07-25 14:50                                                   ` gmu 2k6
2006-07-25 15:19                                                     ` Jens Axboe
2006-07-25 18:58                                                     ` gmu 2k6
2006-07-25 19:21                                                       ` Jens Axboe
2006-07-25 19:28                                                         ` gmu 2k6
2006-07-25  9:51                                       ` gmu 2k6
2006-07-25  9:42                                         ` Jens Axboe
2006-07-17 16:11             ` gmu 2k6

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).