linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel freeze on 2.4.0.prerelease (smp,raid5)
@ 2001-01-02 17:19 Otto Meier
  0 siblings, 0 replies; 6+ messages in thread
From: Otto Meier @ 2001-01-02 17:19 UTC (permalink / raw)
  To: linux-kernel

On all kernels newer than 2.4.0t13p3 I have the following problem.

shorly after boot (some seconds) the system freeze. I can only swith consoles
but i am not able to login. Over the net I get onyl responses to
pings nothing else.

Up to kernel 2.4.0.t13p3 everythings works fine.

Sorry for this simple description, but I am not able to get more clear infos.
No oops, nothing in the logs after reboot with 240t13p3.

Perhaps someone has an idea where to dig?

ps: Here is my short system description:

Dual Celeron (SMP)
Raid5 (3 drives actuall 2 drives degra. mode)







-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel freeze on 2.4.0.prerelease (smp,raid5)
  2001-01-03 11:05 Otto Meier
@ 2001-01-04 21:53 ` Neil Brown
  0 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2001-01-04 21:53 UTC (permalink / raw)
  To: otto meier; +Cc: linux-kernel

On Wednesday January 3, gf435@gmx.net wrote:
> On Tue, 02 Jan 2001 18:19:41 +0100, Otto Meier wrote:
> 
> >>Dual Celeron (SMP,raid5)
> >> As stated in my first mail I run actually my raid5 devices in degrated mode
> >> and as I remenber there has been some raid5 stuff changed between 
> >> test13p3 and newer kernels.
> 
> >So tell us, why do you run your raid5 devices in degraded mode?? I
> >cannot be good for performance, and certainly isn't good for
> >redundancy!!! But I'm not complaining as you found a bug...
> 
> I am actually in the middle of the conversion process to raid5 but it takes a while
> I am to lazy :-) to get the next drive free to get raid5 into the
> fully running mode.  

If "necessity is the mother of invention", then I think laziness is
the father :-)

> 
> btw what does this message in boot.msg mean?
> 
> <4>raid5: switching cache buffer size, 4096 --> 1024
> <4>raid5: switching cache buffer size, 1024 --> 4096

The raid5 module maintains a stripe cache. The width of this cache
needs to be the same as the size of requests that are received.
The initial default size if 4096.
When you mkfs or fsck, the I/O requests that arrive are 1024 bytes
long, so the cache is flushed and rebuilt with a different size.
After you mount a filesystem, requests start coming at filesystem
blocksize, which is typically 4096 bytes.
If you happen to use LVM to partition a raid5 device, and have a
1K-block filesystem in one partition and a 4k-block filesystem in
another, then requests of different sizes will arrived mixed together
and the stripe cache will constantly be flushed and rebuilt and you
will gets lots of these messages together with a performance hit as
lots of requests will get serialised by the cache flushing.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel freeze on 2.4.0.prerelease (smp,raid5)
@ 2001-01-03 11:05 Otto Meier
  2001-01-04 21:53 ` Neil Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Otto Meier @ 2001-01-03 11:05 UTC (permalink / raw)
  To: linux-kernel

On Tue, 02 Jan 2001 18:19:41 +0100, Otto Meier wrote:

>>Dual Celeron (SMP,raid5)
>> As stated in my first mail I run actually my raid5 devices in degrated mode
>> and as I remenber there has been some raid5 stuff changed between 
>> test13p3 and newer kernels.

>So tell us, why do you run your raid5 devices in degraded mode?? I
>cannot be good for performance, and certainly isn't good for
>redundancy!!! But I'm not complaining as you found a bug...

I am actually in the middle of the conversion process to raid5 but it takes a while
I am to lazy :-) to get the next drive free to get raid5 into the fully running mode.  

>> Hope this gives someone an idea?

>Yep. This, combined with a related bug report from n0ymv@callsign.net
>strongly suggests the following patch.
>Writes to the failed drive are never completing, so you eventually
>run out of stripes in the stripe cache and you block waiting for a
>stripe to become free. 

>Please test this and confirm that it works.

It really did the trick you are great.
The system runs now for over a hour otherwise it would have crashed after some 
seconds (20 to 30).

btw what does this message in boot.msg mean?

<4>raid5: switching cache buffer size, 4096 --> 1024
<4>raid5: switching cache buffer size, 1024 --> 4096

the log of the raid init you find below.

Thanks again

Otto

--- ./drivers/md/raid5.c 2001/01/03 09:04:05 1.1
+++ ./drivers/md/raid5.c 2001/01/03 09:04:13
@@ -1096,8 +1096,10 @@
bh->b_rdev = bh->b_dev;
bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
generic_make_request(action[i]-1, bh);
- } else
+ } else {
PRINTK("skip op %d on disc %d for sector %ld\n", action[i]-1, i, sh->sector);
+ clear_bit(BH_Lock, &bh->b_state);
+ }
}
}

>Raid5 (3 drives actuall 2 drives degra. mode)
<6>raid5: device hdg7 operational as raid disk 1
<6>raid5: device hde7 operational as raid disk 0
<1>raid5: md1, not all disks are operational -- trying to recover array
<6>raid5: allocated 3264kB for md1
<1>raid5: raid level 5 set md1 active with 2 out of 3 devices, algorithm 2
<4>RAID5 conf printout:
<4> --- rd:3 wd:2 fd:1
<4> disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hde7
<4> disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg7
<4> disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev 00:00]
<4>RAID5 conf printout:
<4> --- rd:3 wd:2 fd:1
<4> disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hde7
<4> disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg7
<4> disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev 00:00]
<6>md: updating md1 RAID superblock on device
<4>hdg7 [events: 00000087](write) hdg7's sb offset: 24989696
<6>md: recovery thread got woken up ...
<3>md1: no spare disk to reconstruct array! -- continuing in degraded mode
<6>md: recovery thread finished ...
<4>hde7 [events: 00000087](write) hde7's sb offset: 24989696
<4>.
<4>... autorun DONE.





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel freeze on 2.4.0.prerelease (smp,raid5)
  2001-01-02 20:18 Otto Meier
@ 2001-01-03  9:18 ` Neil Brown
  0 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2001-01-03  9:18 UTC (permalink / raw)
  To: otto meier; +Cc: linux-kernel

On Tuesday January 2, gf435@gmx.net wrote:
> 
> Perhaps a deadlock with a normal (not irq) spinlock.
> 
> Could you enable SysRQ and press <Alt>+<SysRq>+<P> ("showPc")
> Then write down the EIP values (including the [< >] brackets) and
> translate them with ksymoops.
> 
> Ksymoops repeats only the EIP values.
> 
> But searching through the System.map file has only Labels from
> the raid5 staff around.
> 
> As stated in my first mail I run actually my raid5 devices in degrated mode
> and as I remenber there has been some raid5 stuff changed between 
> test13p3 and newer kernels.

So tell us, why do you run your raid5 devices in degraded mode??  I
cannot be good for performance, and certainly isn't good for
redundancy!!!  But I'm not complaining as you found a bug...


> 
> Hope this gives someone an idea?

Yep.  This, combined with a related bug report from  n0ymv@callsign.net
strongly suggests the following patch.
Writes to the failed drive are never completing, so you eventually
run out of stripes in the stripe cache and you block waiting for a
stripe to become free.  

Please test this and confirm that it works.

NeilBrown


--- ./drivers/md/raid5.c	2001/01/03 09:04:05	1.1
+++ ./drivers/md/raid5.c	2001/01/03 09:04:13
@@ -1096,8 +1096,10 @@
 				bh->b_rdev = bh->b_dev;
 				bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
 				generic_make_request(action[i]-1, bh);
-			} else
+			} else {
 				PRINTK("skip op %d on disc %d for sector %ld\n", action[i]-1, i, sh->sector);
+				clear_bit(BH_Lock, &bh->b_state);
+			}
 		}
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel freeze on 2.4.0.prerelease (smp,raid5)
@ 2001-01-02 20:18 Otto Meier
  2001-01-03  9:18 ` Neil Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Otto Meier @ 2001-01-02 20:18 UTC (permalink / raw)
  To: linux-kernel


Perhaps a deadlock with a normal (not irq) spinlock.

Could you enable SysRQ and press <Alt>+<SysRq>+<P> ("showPc")
Then write down the EIP values (including the [< >] brackets) and
translate them with ksymoops.

Ksymoops repeats only the EIP values.

But searching through the System.map file has only Labels from
the raid5 staff around.

As stated in my first mail I run actually my raid5 devices in degrated mode
and as I remenber there has been some raid5 stuff changed between 
test13p3 and newer kernels.

Hope this gives someone an idea?

Bye Otto
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel freeze on 2.4.0.prerelease (smp,raid5)
@ 2001-01-02 17:34 Manfred
  0 siblings, 0 replies; 6+ messages in thread
From: Manfred @ 2001-01-02 17:34 UTC (permalink / raw)
  To: gf435, linux-kernel

> No oops, nothing in the logs after reboot with 240t13p3. 
>
>Perhaps someone has an idea where to dig? 
>
> ps: Here is my short system description: 
>
> Dual Celeron (SMP) 

Perhaps a deadlock with a normal (not irq) spinlock.

Could you enable SysRQ and press <Alt>+<SysRq>+<P> ("showPc")
Then write down the EIP values (including the [< >] brackets) and
translate them with ksymoops.

See Documentation/sysrq.txt and oops-tracing.txt.

--
	Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2001-01-04 21:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-02 17:19 kernel freeze on 2.4.0.prerelease (smp,raid5) Otto Meier
2001-01-02 17:34 Manfred
2001-01-02 20:18 Otto Meier
2001-01-03  9:18 ` Neil Brown
2001-01-03 11:05 Otto Meier
2001-01-04 21:53 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).