linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Lockups with loop'ed sparse files on reiserfs?
@ 2003-06-13 13:38 Christian Jaeger
  2003-06-13 15:56 ` Oleg Drokin
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Jaeger @ 2003-06-13 13:38 UTC (permalink / raw)
  To: linux-kernel

Hello

I've experienced 3 lockups in the last few days, all while using 
sparse files. Could also be problems with UML, SKAS, raid5 over loop 
device, or loop devices with vfat files, but it looks like the only 
common thing is sparse files on reiserfs.

1.) kernel 2.4.20 from debian unstable (= kernel.org kernel with 
quite a few security and other patches), additionally patched with 
kernel-patch-skas 3-1 from debian. Started user-mode-linux using a 
sparse file with an ext2 filesystem on it, using tap0 networking, did 
apt-get upgrade inside this uml (which started to download (and 
already unpack?) quite a bit of stuff), halfway through the whole 
(host) system froze. Still responded to pings, but telnet $host 80 
would not show any activity from running apache. Went to the server 
room, I could change virtual terminals with Alt-<number>, but could 
not log in. Reset.

2.) same kernel:
- created 6 sparse files of 650MB each, on reiserfs filesystems (some 
of them on the same filesystem), and 2 files of 650MB on a vfat 
filesystem.
- Tied them to /dev/loop*,
- mdadm /dev/md0 -C -l 5 -n 7 -x 1 /dev/loop*
- then (while the array was building) mkreiser /dev/md0,
- mount /dev/md0 /mnt/md0
- cd /mnt/md0; netcat -l -p "$port" | multifeed '|' sh -c 'exec 
md5sum >&2' '&' cat | gpg | lzop -d | tar xf -
   (where multifeed is a C program by myself feeding the data to 
multiple processes)
   basically fetch data from tcp and untar it onto the filesystem.
After about 500MB of data has been written onto /mnt/md0, the box 
froze. Still responded to ping, but not to telnet $host 80. Could 
switch vt's, type root and enter password, but didn't get a login.

3.) kernel 2.4.18 from kernel.org (the machine ran without any 
problem (except for sporadically switching off dma on /dev/hda) with 
this kernel for about a year):
Did same thing as mentioned under 2.) (rm -rf /mnt/md0/* before 
starting the write again). This time it happened already after 
filling the md partition with about 200MB. And this time, while still 
responding to pings and being able to switch vt's, it wouldn't react 
to hitting the keys 'root'.

I'd mainly like to know if all of what I did is supported or not.

The machine is a AMD Duron 1Ghz with 256MB RAM, 3 IDE harddisks (but 
only hda and hdd involved in the above), 2 ethernet cards using 
8139too.

Christian.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockups with loop'ed sparse files on reiserfs?
  2003-06-13 13:38 Lockups with loop'ed sparse files on reiserfs? Christian Jaeger
@ 2003-06-13 15:56 ` Oleg Drokin
  2003-06-13 15:59   ` Oleg Drokin
  0 siblings, 1 reply; 6+ messages in thread
From: Oleg Drokin @ 2003-06-13 15:56 UTC (permalink / raw)
  To: Christian Jaeger; +Cc: linux-kernel

Hello!

On Fri, Jun 13, 2003 at 03:38:44PM +0200, Christian Jaeger wrote:

> I've experienced 3 lockups in the last few days, all while using 
> sparse files. Could also be problems with UML, SKAS, raid5 over loop 
> device, or loop devices with vfat files, but it looks like the only 
> common thing is sparse files on reiserfs.
> 
> 1.) kernel 2.4.20 from debian unstable (= kernel.org kernel with 
> quite a few security and other patches), additionally patched with 
> kernel-patch-skas 3-1 from debian. Started user-mode-linux using a 
> sparse file with an ext2 filesystem on it, using tap0 networking, did 
> apt-get upgrade inside this uml (which started to download (and 
> already unpack?) quite a bit of stuff), halfway through the whole 
> (host) system froze. Still responded to pings, but telnet $host 80 
> would not show any activity from running apache. Went to the server 
> room, I could change virtual terminals with Alt-<number>, but could 
> not log in. Reset.

Were there anything interesting on the console where your kernel outputs
its messages (the host kernel?)?
Any chance to hit say sysrq-T/sysrq-P to find out where CPU spins?

> I'd mainly like to know if all of what I did is supported or not.

Yes it is supported. I am doing this kind of stuff (with uml and skas3)
on reiserfs everyday and everything works just fine with 2.4.19, 1.4.20 and 2.4.21.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockups with loop'ed sparse files on reiserfs?
  2003-06-13 15:56 ` Oleg Drokin
@ 2003-06-13 15:59   ` Oleg Drokin
  2003-06-13 18:07     ` Christian Jaeger
  0 siblings, 1 reply; 6+ messages in thread
From: Oleg Drokin @ 2003-06-13 15:59 UTC (permalink / raw)
  To: Christian Jaeger; +Cc: linux-kernel

Hello!

On Fri, Jun 13, 2003 at 07:56:34PM +0400, Oleg Drokin wrote:
> > already unpack?) quite a bit of stuff), halfway through the whole 
> > (host) system froze. Still responded to pings, but telnet $host 80 
> > would not show any activity from running apache. Went to the server 
> > room, I could change virtual terminals with Alt-<number>, but could 
> > not log in. Reset.
> Were there anything interesting on the console where your kernel outputs
> its messages (the host kernel?)?

BTW, while we are at it, were there enough space on the partition with sparse
files to hold all the data you was writing there?

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockups with loop'ed sparse files on reiserfs?
  2003-06-13 15:59   ` Oleg Drokin
@ 2003-06-13 18:07     ` Christian Jaeger
  2003-06-13 20:22       ` Oleg Drokin
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Jaeger @ 2003-06-13 18:07 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: linux-kernel

At 19:59 Uhr +0400 13.06.2003, Oleg Drokin wrote:
>  > Were there anything interesting on the console where your kernel outputs
>  > its messages (the host kernel?)?

IIRC nothing was output, at least I don't remember anything that I 
thought was significant. But see below re kern.log entries.

>Any chance to hit say sysrq-T/sysrq-P to find out where CPU spins?

I've never used those, I'll have to learn about those debugging 
options first. Where should I go to?

>BTW, while we are at it, were there enough space on the partition with sparse
>files to hold all the data you was writing there?

I did calculate all space bevor I started a few days ago. I did now 
recalculate on current free space on the partitions and in fact on 
one partition there's not enough space (anymore?):

losetup /dev/loop0 /root/raid5_1
losetup /dev/loop1 /root/raid5_2
   du /root -> 1675228 k free. 650*1024*2=1331200 k, => ok
losetup /dev/loop2 /mnt/hdd8/raid5_3
losetup /dev/loop3 /mnt/hdd8/raid5_4
losetup /dev/loop4 /mnt/hdd8/raid5_5
   du /mnt/hdd8/ -> 1973856 k free. 650*1024*3=1996800k => *not* ok.
   (pity that I already deleted those 3 files)
losetup /dev/loop5 /mnt/hda11/raid5_6
   du /mnt/hda11	-> 849044 free. => ok.
losetup /dev/loop6 /mnt/hdd6/.c/raid5_8
losetup /dev/loop7 /mnt/hdd6/.c/raid5_9
   this is a vfat partition so no sparse files (and 2.9GB free too)
(The files looks like:
-rw-------    1 root     root     681574400  8. Jun 23:46 raid5_6
)

Now the question is wbat happens if a partition is full.
In fact I've seen this in kern.log (full log at
http://pflanze.mine.nu/~chris/scratch/kern.log ):

Jun 13 11:34:57 pflanze kernel: raid5: md0, not all disks are 
operational -- trying to recover array
...
Jun 13 11:34:57 pflanze kernel: md0: resyncing spare disk [dev 07:07] 
to replace failed disk

Though I think that was before I started writing stuff onto the array.

What does happen if a raid array fails (i.e. 2 disks fail and there's 
no spare, or 1 spare and 3 disks fail etc.)? If it's not an important 
array (i.e. no swap or root filesystem on it), is there a reason for 
the system to go down? Isn't it possible to just mark the mounted 
filesystem  as erroneous and return EIO to applications accessing it?

There's also the case 1, using uml. In this case I'm sure there was 
no problem with space. The sparse filesystem image file I used is 
exactly 500'000'000 bytes, and there's 1675228 k free space on the 
partition where it is put on.

Christian.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockups with loop'ed sparse files on reiserfs?
  2003-06-13 18:07     ` Christian Jaeger
@ 2003-06-13 20:22       ` Oleg Drokin
  2003-06-14 23:10         ` Christian Jaeger
  0 siblings, 1 reply; 6+ messages in thread
From: Oleg Drokin @ 2003-06-13 20:22 UTC (permalink / raw)
  To: Christian Jaeger; +Cc: linux-kernel

Hello!

On Fri, Jun 13, 2003 at 08:07:55PM +0200, Christian Jaeger wrote:

> >Any chance to hit say sysrq-T/sysrq-P to find out where CPU spins?
> I've never used those, I'll have to learn about those debugging 
> options first. Where should I go to?

Read /usr/src/linux/Documentation/sysrq.txt

> Now the question is wbat happens if a partition is full.

There were a known problem with reiserfs that it might sometimes
deadlock in out-of-space situation.
This is fixed in 2.4.21

> In fact I've seen this in kern.log (full log at
> http://pflanze.mine.nu/~chris/scratch/kern.log ):
> Jun 13 11:34:57 pflanze kernel: raid5: md0, not all disks are 
> operational -- trying to recover array
> ...
> Jun 13 11:34:57 pflanze kernel: md0: resyncing spare disk [dev 07:07] 
> to replace failed disk

This is raid5 stuff resyncing. Probably it is normal if you just
setup the raid5 array.

> What does happen if a raid array fails (i.e. 2 disks fail and there's 
> no spare, or 1 spare and 3 disks fail etc.)? If it's not an important 

Everything that will access this array will break, I presume ;)

> array (i.e. no swap or root filesystem on it), is there a reason for 
> the system to go down? Isn't it possible to just mark the mounted 
> filesystem  as erroneous and return EIO to applications accessing it?

Something like that will happen.

> There's also the case 1, using uml. In this case I'm sure there was 
> no problem with space. The sparse filesystem image file I used is 
> exactly 500'000'000 bytes, and there's 1675228 k free space on the 
> partition where it is put on.

Ok, that's where sysrq-T/sysrq-P traceswould be most useful.
And if you'd try with 2.4.21 that would be even better.

Thank you.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Lockups with loop'ed sparse files on reiserfs?
  2003-06-13 20:22       ` Oleg Drokin
@ 2003-06-14 23:10         ` Christian Jaeger
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Jaeger @ 2003-06-14 23:10 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: linux-kernel

At 0:22 Uhr +0400 14.06.2003, Oleg Drokin wrote:
>Read /usr/src/linux/Documentation/sysrq.txt

Done, new kernels now compiled with CONFIG_MAGIC_SYSRQ.

>There were a known problem with reiserfs that it might sometimes
>deadlock in out-of-space situation.
>This is fixed in 2.4.21

Good to know.

>  > There's also the case 1, using uml. In this case I'm sure there was
>>  no problem with space. The sparse filesystem image file I used is
>  > exactly 500'000'000 bytes, and there's 1675228 k free space on the
>  > partition where it is put on.
>
>Ok, that's where sysrq-T/sysrq-P traceswould be most useful.
>And if you'd try with 2.4.21 that would be even better.

I've now compiled 2.4.21 from kernel.org with skas3 from debian, as 
well as 2.4.21 with grsecurity (from grsecurity.net, with medium 
setting) and skas3, and tried uml again with the same sparse image 
multiple times under both. I haven't managed to lock the machine up 
to now even while installing quite some stuff, so maybe the problem 
is already solved. If not, I'll tell you when it happens again. (I 
think I'll run 2.4.21-grsec-skas3 for the near future now.)

Thanks for your help
Christian.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-06-14 22:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-13 13:38 Lockups with loop'ed sparse files on reiserfs? Christian Jaeger
2003-06-13 15:56 ` Oleg Drokin
2003-06-13 15:59   ` Oleg Drokin
2003-06-13 18:07     ` Christian Jaeger
2003-06-13 20:22       ` Oleg Drokin
2003-06-14 23:10         ` Christian Jaeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).