linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.6.test11 bug
@ 2003-12-08  3:46 Rafal Skoczylas
  2003-12-08  4:17 ` William Lee Irwin III
  2003-12-08  5:17 ` Linus Torvalds
  0 siblings, 2 replies; 11+ messages in thread
From: Rafal Skoczylas @ 2003-12-08  3:46 UTC (permalink / raw)
  To: linux-kernel

[ I am sorry if this message doesn't get threaded in the original
thread in your software, but I read lkml ocassionally through
usenet-gate so didn't have the original message in my mbox and
couldn't just hit "reply" ;) ]

On Mon, 08 Dec 2003 03:30:12 +0100 Gordon Cormack wrote:
> I have read the FAQ but I'm confused about how to report a 2.6
> kernel bug, or who to report it to.
> [...]
> Dec  6 13:16:01 flax20 kernel: Bad page state at free_hot_cold_page
> Dec  6 13:16:01 flax20 kernel: flags:0x02000114 mapping:00000000
> mapped:1 count:0
> [...]

Hello.
I am experiencing similiar behaviour as described below.
In my case it is mlnetd (of mldonkey package) which seems to be
responsible for driving kernel to a crash[1].
After a few hours of running, either the process gets killed or system
crashes (I am only able to reboot it with alt+prntscr+b, but it seems
like it is not able to [S]ync or [U]nmount filesystems - i have lost
a few files which were open at the time of crash[2]).

It may be worth to mention that I don't remember having such a crash
on 2.6.0-test9 which i used for a couple of weeks (since first day
it apeared on ftp.kernel.org untill test11 - i skipped test10).

Hardware:
---------
ALi M1647, Duron 1200, 512MB sdram.

Kernel:
-------
Linux poziomka 2.6.0-test11 #32 Fri Dec 5 21:10:40 CET 2003 i686
AMD_Duron(TM)Processor unknown Shameless Compilation

Compiled with gcc-3.3.2.

Logs:
-----

--- The last time, i got the following:

Bad page state at free_hot_cold_page
flags:0x01020008 mapping:d38afe68 mapped:0 count:0
Backtrace: 
Call Trace:
 [bad_page+93/144] bad_page+0x5d/0x90
 [free_hot_cold_page+82/256] free_hot_cold_page+0x52/0x100
 [zap_pte_range+358/416] zap_pte_range+0x166/0x1a0
 [zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70
 [unmap_page_range+67/112] unmap_page_range+0x43/0x70 
 [unmap_vmas+225/528] unmap_vmas+0xe1/0x210
 [exit_mmap+123/400] exit_mmap+0x7b/0x190
 [mmput+100/192] mmput+0x64/0xc0
 [do_exit+282/976] do_exit+0x11a/0x3d0
 [do_group_exit+58/176] do_group_exit+0x3a/0xb0
 [get_signal_to_deliver+590/848] get_signal_to_deliver+0x24e/0x350
 [do_signal+149/288] do_signal+0x95/0x120
 [schedule+761/1392] schedule+0x2f9/0x570
 [pipe_write+0/800] pipe_write+0x0/0x320
 [sys_rt_sigsuspend+222/272] sys_rt_sigsuspend+0xde/0x110
 [syscall_call+7/11] syscall_call+0x7/0xb
Trying to fix it up, but a reboot is needed

--- But sometimes it get things like this:

Unable to handle kernel paging request at virtual address 5a85fb5c
 printing eip:
c011e6c4
*pde = 00000000
Oops: 0002 [#1]
CPU:    0
EIP:    0060:[remove_wait_queue+36/112]    Not tainted
EFLAGS: 00010002
EIP is at remove_wait_queue+0x24/0x70
eax: defb4000   ebx: da85fb58   ecx: 5a85fb58   edx: db0468b0
esi: db0468bc   edi: 00000292   ebp: defb5fa0   esp: defb5f58
ds: 007b   es: 007b   ss: 0068
Process mlnetd (pid: 1456, threadinfo=defb4000 task=dd72e100)
Stack: db0468ac db046008 db046000 c0167484 00000000 cad86c08 00000001 c01681f5
       defb5fa0 cad86c00 defb5fa0 00000041 defb4000 08376b48 cad86c08 00000000
       cad86c00 00000001 c01674b0 db046000 00000000 3fd32b9c 083767d8 00000001
Call Trace:
 [poll_freewait+36/80] poll_freewait+0x24/0x50
 [sys_poll+581/656] sys_poll+0x245/0x290
 [__pollwait+0/208] __pollwait+0x0/0xd0
 [syscall_call+7/11] syscall_call+0x7/0xb

Code: 89 59 04 89 0b c7 46 04 00 02 20 00 c7 42 0c 00 01 10 00 57
 <6>note: mlnetd[1456] exited with preempt_count 1
bad: scheduling while atomic!
Call Trace:
 [schedule+1373/1392] schedule+0x55d/0x570
 [unmap_page_range+67/112] unmap_page_range+0x43/0x70
 [unmap_vmas+436/528] unmap_vmas+0x1b4/0x210
 [exit_mmap+123/400] exit_mmap+0x7b/0x190
 [mmput+100/192] mmput+0x64/0xc0
 [do_exit+282/976] do_exit+0x11a/0x3d0
 [do_page_fault+0/1292] do_page_fault+0x0/0x50c
 [die+225/240] die+0xe1/0xf0
 [do_page_fault+474/1292] do_page_fault+0x1da/0x50c
 [do_IRQ+253/304] do_IRQ+0xfd/0x130
 [common_interrupt+24/32] common_interrupt+0x18/0x20
 [tcp_poll+18/352] tcp_poll+0x12/0x160
 [do_page_fault+0/1292] do_page_fault+0x0/0x50c
 [error_code+45/56] error_code+0x2d/0x38
 [remove_wait_queue+36/112] remove_wait_queue+0x24/0x70
 [poll_freewait+36/80] poll_freewait+0x24/0x50
 [sys_poll+581/656] sys_poll+0x245/0x290
 [__pollwait+0/208] __pollwait+0x0/0xd0
 [syscall_call+7/11] syscall_call+0x7/0xb

(this last Call Trace repeted 2 more times)

If there is any important information missing, feel free to ask.
Regards.

[1] Actually, I am not sure, but this is the only candidate because
it uses more and more memory over the time and the crash or kill occurs
more or less at the same level of memory usage (~10-12% of 512Meg).
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
nils      1781 11.1 10.5 41252 38796 pts/2   S    03:39   3:15 mlnetd
                    ^^^^
[2] The files loss is probably XFS-related problem.

nils.
-- 
"Blessed is the man, who having nothing to say, abstains from giving wordy
evidence of the fact."  -- http://secprog.org/who/rs/quote.php?id=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
  2003-12-08  3:46 2.6.test11 bug Rafal Skoczylas
@ 2003-12-08  4:17 ` William Lee Irwin III
  2003-12-08  5:17 ` Linus Torvalds
  1 sibling, 0 replies; 11+ messages in thread
From: William Lee Irwin III @ 2003-12-08  4:17 UTC (permalink / raw)
  To: Rafal Skoczylas; +Cc: linux-kernel

On Mon, Dec 08, 2003 at 04:46:31AM +0100, Rafal Skoczylas wrote:
> I am experiencing similiar behaviour as described below.
> In my case it is mlnetd (of mldonkey package) which seems to be
> responsible for driving kernel to a crash[1].
> After a few hours of running, either the process gets killed or system
> crashes (I am only able to reboot it with alt+prntscr+b, but it seems
> like it is not able to [S]ync or [U]nmount filesystems - i have lost
> a few files which were open at the time of crash[2]).
> It may be worth to mention that I don't remember having such a crash
> on 2.6.0-test9 which i used for a couple of weeks (since first day
> it apeared on ftp.kernel.org untill test11 - i skipped test10).

I'm grabbing mldonkey and taking it for a spin.


-- wli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
  2003-12-08  3:46 2.6.test11 bug Rafal Skoczylas
  2003-12-08  4:17 ` William Lee Irwin III
@ 2003-12-08  5:17 ` Linus Torvalds
  2003-12-08  9:02   ` Xavier Bestel
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Linus Torvalds @ 2003-12-08  5:17 UTC (permalink / raw)
  To: Rafal Skoczylas; +Cc: Kernel Mailing List, William Lee Irwin III



On Mon, 8 Dec 2003, Rafal Skoczylas wrote:
>
> --- But sometimes it get things like this:
>
> Unable to handle kernel paging request at virtual address 5a85fb5c
>  printing eip:
> c011e6c4
> *pde = 00000000
> Oops: 0002 [#1]
> CPU:    0
> EIP:    0060:[remove_wait_queue+36/112]    Not tainted
> EFLAGS: 00010002
> EIP is at remove_wait_queue+0x24/0x70
> eax: defb4000   ebx: da85fb58   ecx: 5a85fb58   edx: db0468b0
> esi: db0468bc   edi: 00000292   ebp: defb5fa0   esp: defb5f58
> Trace:
>  [poll_freewait+36/80] poll_freewait+0x24/0x50
>  [sys_poll+581/656] sys_poll+0x245/0x290
>  [__pollwait+0/208] __pollwait+0x0/0xd0
>  [syscall_call+7/11] syscall_call+0x7/0xb

Ahh.. Btw, this looks like a single-bit error.

In particular, this is "list_del()" in remove_wait_queue(), and it's
list_del() on a list that usually only has a single entry (ie a wait
queue). In particular, it's the "prev->next = next" part, where "prev" is
5a85fb58 and next is da85fb58.

Now, with a single entry, prev and next _should_ be the same (they both
point to the head). And they are - except for a single bit error. The
difference is the high bit of the word. Interesting.

You get an oops because that 5a85fb58 _should_ be da85fb58. Notice?

> If there is any important information missing, feel free to ask.

It could be bad memory. We even know the address that is bad: it's
(%esi+4), ie bit 31 of the word at physical address 0x1b0468f0.

However, if you don't see random SIGSEGV's while compiling etc issues, it
doesnt' sound like flaky RAM.

I'm wondering if there is some bit operation out there somewhere with a
wild pointer?

Rafal - how consistent is the second form of the oops? Have you seen that
trace more than once? Might it actually be "poll()" itself that has some
bad behaviour?

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
  2003-12-08  5:17 ` Linus Torvalds
@ 2003-12-08  9:02   ` Xavier Bestel
  2003-12-08 16:27   ` Rafal Skoczylas
       [not found]   ` <20031208161742.GB9087@secprog.org>
  2 siblings, 0 replies; 11+ messages in thread
From: Xavier Bestel @ 2003-12-08  9:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafal Skoczylas, Linux Kernel Mailing List, William Lee Irwin III

Le lun 08/12/2003 à 06:17, Linus Torvalds a écrit :
> It could be bad memory. We even know the address that is bad: it's
> (%esi+4), ie bit 31 of the word at physical address 0x1b0468f0.
> 
> However, if you don't see random SIGSEGV's while compiling etc issues, it
> doesnt' sound like flaky RAM.

(FWIW) I have a server running 2.4.22 and pppoe (to talk to an ADSL
modem). It works really flawlessly *except* when I run mldonkey: then
pppoe regularly fails and drops the connection. As this thing generates
a lot of packets to/from different hosts, I suspects it's an excellent
test workload for some paths of the kernel.

	Xav


^ permalink raw reply	[flat|nested] 11+ messages in thread

* (no subject)
  2003-12-08  5:17 ` Linus Torvalds
  2003-12-08  9:02   ` Xavier Bestel
@ 2003-12-08 16:27   ` Rafal Skoczylas
       [not found]   ` <20031208161742.GB9087@secprog.org>
  2 siblings, 0 replies; 11+ messages in thread
From: Rafal Skoczylas @ 2003-12-08 16:27 UTC (permalink / raw)
  To: linux-kernel

On Sun, Dec 07, 2003 at 09:17:35PM -0800, Linus Torvalds wrote:
> On Mon, 8 Dec 2003, Rafal Skoczylas wrote:
> > Unable to handle kernel paging request at virtual address 5a85fb5c
> > [...]
> > EIP:    0060:[remove_wait_queue+36/112]    Not tainted
> > [...]
> > eax: defb4000   ebx: da85fb58   ecx: 5a85fb58   edx: db0468b0
> > esi: db0468bc   edi: 00000292   ebp: defb5fa0   esp: defb5f58
> > Trace:
> >  [poll_freewait+36/80] poll_freewait+0x24/0x50
> >  [sys_poll+581/656] sys_poll+0x245/0x290
> >  [__pollwait+0/208] __pollwait+0x0/0xd0
> >  [syscall_call+7/11] syscall_call+0x7/0xb
> 
> It could be bad memory. We even know the address that is bad: it's
> (%esi+4), ie bit 31 of the word at physical address 0x1b0468f0.
> However, if you don't see random SIGSEGV's while compiling etc issues, it
> doesnt' sound like flaky RAM.

Indeed, I do not have any random SIGSEGVs at any time.
Additionaly, as what you said sounded right to me I performed extensive
memory tests with x86-memtest v3.0 during the night and as I expected
memory seems to be OK.

> Rafal - how consistent is the second form of the oops?
> Have you seen that trace more than once?

Not exactly the same, but there are some similarities (If I understand
this log correctly). I ripped those oopses out of the logs so maybe you
could look yourself and see something I don't see:
http://secprog.org/who/rs/linux/2.6-test11-log.txt
These are oopses I have experienced on test11 (Unfortunately, I dont have
logs from test9 since I don't keep logs that long on workstation).

nils.
-- 
"Blessed is the man, who having nothing to say, abstains from giving wordy
evidence of the fact."  -- http://secprog.org/who/rs/quote.php?id=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
       [not found]     ` <Pine.LNX.4.58.0312080848560.13236@home.osdl.org>
@ 2003-12-08 17:12       ` Linus Torvalds
       [not found]         ` <20031209194827.GA22265@secprog.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2003-12-08 17:12 UTC (permalink / raw)
  To: Rafal Skoczylas; +Cc: Kernel Mailing List


Rafael - you had the wrong address for the kernel mailing list, and I
didn't notice when I replied.

Here is my comment again for the mailing list in case somebody else can
think of something. The more heads, the merrier.

		Linus

On Mon, 8 Dec 2003, Linus Torvalds wrote:
>
> On Mon, 8 Dec 2003, Rafal Skoczylas wrote:
> >
> > > Rafal - how consistent is the second form of the oops?
> > > Have you seen that trace more than once?
> >
> > Not exactly the same, but there are some similarities (If I understand
> > this log correctly). I ripped those oopses out of the logs so maybe you
> > could look yourself and see something I don't see:
> > http://secprog.org/who/rs/linux/2.6-test11-log.txt
>
> Oh wow.
>
> They all look to be (except for the odd last "bad page state" one, which
> is likely just a result of some _other_ earlier corruption) due to the
> high bit being cleared. And it's consistent across reboots too, so it's
> not just some corruption that stayed around in memory.
>
> In all cases it looks like a pointer that should have been of the form
> "0xDxxxxxxx", and was corrupted into the form "0x5xxxxxxx".
>
> And every time it's "mlnetd" - which may just be a coincidence (possibly
> brought on by that being the most commonly run thing on your box), but it
> certainly looks like it could also be an indication of the source of the
> corruption.
>
> I find it interesting that it seems to be bit 31 that is corrupt:
> admittedly that is also the only bit that is almost guaranteed to cause
> oopses when it corrupts (because it changes the pointer "the most"), but
> it's also to some degree one of the least used bits for bit operations. A
> lot of things use bit #0 for locking etc.
>
> I'll have to think about this, but quite frankly I'm also hoping to see
> more of a pattern about what this is all about. Can you keep your oopses
> up somewhere? Maybe opening a bug on bugme.osdl.org? Even though I don't
> use bugme personally, it's good to keep the record around when we don't
> immediately see the reason for something..
>
> 			Linus
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
       [not found]           ` <Pine.LNX.4.58.0312091221440.21456@home.osdl.org>
@ 2003-12-09 22:31             ` Rafal Skoczylas
  2003-12-09 23:26               ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Rafal Skoczylas @ 2003-12-09 22:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

[ I am leaving most Linus' quotations so everyone could see what we were
talking about as I haven't CC'ed previous email to lkml nor have Linus ]

On Tue, Dec 09, 2003 at 12:31:15PM -0800, Linus Torvalds wrote:
> > First I had an impression like my ram has to be broken as the address is
> > the same in all but one, but how would it pass all those memtests
> > I performed 2 nights ago? And why is it allways the same call trace?
> 
> It's interesting - it looks like it's the
> 
> 	dentry = file->f_dentry;
> 
> thing, with "file" being NULL.
> 
> Can you do a "gdb vmlinux" and then do "x/20i __fput" for me, to show what
> your compiler is doing?
> 
> Oh, and do a "disassemble fput" too.
> 
> Because "file" really shouldn't be NULL there - and even if it was, we
> should have gotten an oops _earlier_, when we did the
> "atomic_dec_and_test()" on "file->f_count".

2.6.0-test11 -- no frame pointers
gcc (GCC) 3.3.2 (Shameless Linux)

GNU gdb 6.0
[...]
(gdb) x/20i __fput
0xc0155850 <__fput>:    push   %ebp
0xc0155851 <__fput+1>:  push   %edi
0xc0155852 <__fput+2>:  push   %esi
0xc0155853 <__fput+3>:  push   %ebx
0xc0155854 <__fput+4>:  mov    %eax,%ebx
0xc0155856 <__fput+6>:  sub    $0xc,%esp
0xc0155859 <__fput+9>:  mov    0xc(%ebx),%edi
0xc015585c <__fput+12>: mov    0x8(%eax),%eax
0xc015585f <__fput+15>: mov    %eax,0x8(%esp,1)
0xc0155863 <__fput+19>: mov    0x8(%eax),%ebp
0xc0155866 <__fput+22>: lea    0x78(%ebx),%eax
0xc0155869 <__fput+25>: cmp    %eax,0x78(%ebx)
0xc015586c <__fput+28>: jne    0xc0155963 <__fput+275>
0xc0155872 <__fput+34>: mov    %ebx,(%esp,1)
0xc0155875 <__fput+37>: call   0xc016ae50 <locks_remove_flock>
0xc015587a <__fput+42>: mov    0x10(%ebx),%eax
0xc015587d <__fput+45>: test   %eax,%eax
0xc015587f <__fput+47>: je     0xc015588c <__fput+60>
0xc0155881 <__fput+49>: mov    0x30(%eax),%esi
0xc0155884 <__fput+52>: test   %esi,%esi
(gdb) disassemble fput
Dump of assembler code for function fput:
0xc0155830 <fput+0>:    mov    %eax,%edx
0xc0155832 <fput+2>:    decl   0x14(%eax)
0xc0155835 <fput+5>:    sete   %al
0xc0155838 <fput+8>:    test   %al,%al
0xc015583a <fput+10>:   jne    0xc0155840 <fput+16>
0xc015583c <fput+12>:   nop
0xc015583d <fput+13>:   ret
0xc015583e <fput+14>:   mov    %esi,%esi
0xc0155840 <fput+16>:   mov    %edx,%eax
0xc0155842 <fput+18>:   jmp    0xc0155850 <__fput>
0xc0155847 <fput+23>:   mov    %esi,%esi
0xc0155849 <fput+25>:   lea    0x0(%edi,1),%edi
End of assembler dump.
(gdb)


> Hmm.. Maybe that is actually the "dentry->d_inode" access instead -
> although for me the inode field is at offset 12. But if you have an UP
> kernel, offset 8 might be the right one.. Anyway, a NULL inode is _also_ a
> bug at that point, so it's not as if that solves anything.
> 
> I wonder if there is a problem with gcc-3.3.2. Have you tried other
> compilers?

Not yet. Now I'm downloading 2.9x and will try with that one.

> Also, it might be interesting to hear if there is a difference if you
> compile with frame pointers off (or on, if they are now off).

Now they are off. I'll try turning on and if it won't get any better I'll
try with another compiler.
Anyway, while compiling with frame pointers I got another oops ;/
This one is different those we have seen so far:
http://secprog.org/who/rs/linux/2003-12-09b-linux-2.6.0-test11.txt

For those who haven't seen previous ones from today:
http://secprog.org/who/rs/linux/2003-12-09-linux-2.6.0-test11.txt

nils.
-- 
"Blessed is the man, who having nothing to say, abstains from giving wordy
evidence of the fact."  -- http://secprog.org/who/rs/quote.php?id=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
  2003-12-09 22:31             ` Rafal Skoczylas
@ 2003-12-09 23:26               ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2003-12-09 23:26 UTC (permalink / raw)
  To: Rafal Skoczylas; +Cc: Kernel Mailing List



On Tue, 9 Dec 2003, Rafal Skoczylas wrote:
>
> Anyway, while compiling with frame pointers I got another oops ;/
> This one is different those we have seen so far:

Actually, it's very similar to the ones you had earlier - the backtrace is
different, but it's the same thing again:

	ecx: 5944f160   edx: d944f160

that's a doubly-linked list, and the two values _should_ be the same, I
bet.

Except one of them has the high bit clear, and when following the bogus
pointer, you get a page fault.

I think the second one (the "kernel bug in mmap.c" one) is just a result
of the first oops one leaving the mm lists in a bad state and map_count
being off as a result, or something like that.

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
       [not found]       ` <10xdJ-28r-21@gated-at.bofh.it>
@ 2003-12-08 22:28         ` Rafal Skoczylas
  0 siblings, 0 replies; 11+ messages in thread
From: Rafal Skoczylas @ 2003-12-08 22:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On 2003-12-08 17:20 Linus Torvalds wrote:
> On Mon, 8 Dec 2003, Linus Torvalds wrote:
>> They all look to be (except for the odd last "bad page state" one, which
>> is likely just a result of some _other_ earlier corruption) due to the
>> high bit being cleared. And it's consistent across reboots too, so it's
>> not just some corruption that stayed around in memory.

I have already stuffed kernel source with debugging printk's, etc.,
so hopefully we could more precisely say where the problem is. Anyway
it will probably take quite a long to determine because (as I previously
said in our "private" e-mail exchange) it takes long hours to crash.
(eg. today mlnetd was killed twice but no system crash yet)
Now, I am starting mlnet in a 'while [ 1 ]; do mlnet; done;' loop just
in case it gets killed and let's see what will happen during the night.

>> And every time it's "mlnetd" - which may just be a coincidence (possibly
>> brought on by that being the most commonly run thing on your box), but it
>> certainly looks like it could also be an indication of the source of the
>> corruption.

Well, most of the time I use only X11+ctwm+aterm+vim+gcc (and sometimes
mozilla or xpdf) so the fact is that mlnetd is the best candidate for
such things since it is the only one on my box which uses resources so
intensively (a few hundrets open sockets constantly being opened and
closed, ram usage ~10% of 512MB, cpu usage ~20% of D1200, etc.).

>> I'll have to think about this, but quite frankly I'm also hoping to see
>> more of a pattern about what this is all about. Can you keep your oopses
>> up somewhere? Maybe opening a bug on bugme.osdl.org? Even though I don't
>> use bugme personally, it's good to keep the record around when we don't
>> immediately see the reason for something..

Sure, I will keep all the oopses. For now I'm gonna collect them on
my webpage in http://secprog.org/who/rs/linux/ (I will report when there
is something new so there is no need to poll() ;>). And then if you are
interested in registering them in bugme, will do it.

Btw. Linus, there is no need CC'ing messages to my e-mail. I will follow
this thread on lkml.

nils.
-- 
"Blessed is the man, who having nothing to say, abstains from giving wordy
evidence of the fact."  -- http://secprog.org/who/rs/quote.php?id=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.test11 bug
  2003-12-08  2:24 Gordon Cormack
@ 2003-12-08  2:37 ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2003-12-08  2:37 UTC (permalink / raw)
  To: Gordon Cormack; +Cc: linux-kernel



On Sun, 7 Dec 2003, Gordon Cormack wrote:
>
> I have read the FAQ but I'm confused about how to report a 2.6
> kernel bug, or who to report it to.

This is good.

> Here it is in a nutshell.

And we want some more information on what kind of load/machine/config this
is. However, I'm guessing from the report..

> Dec  6 13:16:01 flax20 kernel: Bad page state at free_hot_cold_page

..that this might be running XFS? We've had this report from XFS users
before.

But if it isn't using XFS, holler _loudly_, and please do include more
information about what configuration (both hardware and kernel config) the
machine has.

> As an aside, all versions of the 2.4 kernel are brought to their knees
> in this application ("kswapd problems" hit full force and none of the
> suggested patches worked).  Even with the occasional crash, 2.6.test11 is
> way better.

Hey, I don't like the occasional crash either.

		Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* 2.6.test11 bug
@ 2003-12-08  2:24 Gordon Cormack
  2003-12-08  2:37 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Gordon Cormack @ 2003-12-08  2:24 UTC (permalink / raw)
  To: linux-kernel

Hi,

I have read the FAQ but I'm confused about how to report a 2.6
kernel bug, or who to report it to.

Here it is in a nutshell.

--- kernel ---

Linux xxx20.uwaterloo.ca 2.6.0-test11 #2 SMP Thu Nov 27 14:46:01 EST 2003 i686 athlon i386 GNU/Linux

---- log ----

Dec  6 13:16:01 flax20 kernel: Bad page state at free_hot_cold_page
Dec  6 13:16:01 flax20 kernel: flags:0x02000114 mapping:00000000 mapped:1 count:0
Dec  6 13:16:01 flax20 kernel: Backtrace:
Dec  6 13:16:01 flax20 kernel: Call Trace:
Dec  6 13:16:01 flax20 kernel:  [<c013f98d>] bad_page+0x5d/0x90
Dec  6 13:16:01 flax20 kernel:  [<c0140041>] free_hot_cold_page+0x61/0xf0
Dec  6 13:16:01 flax20 kernel:  [<c014063c>] __pagevec_free+0x1c/0x30
Dec  6 13:16:01 flax20 kernel:  [<c014527f>] release_pages+0x11f/0x140
Dec  6 13:16:01 flax20 kernel:  [<c01544a6>] free_pages_and_swap_cache+0x56/0x90
Dec  6 13:16:01 flax20 kernel:  [<c014cf60>] unmap_region+0x150/0x160
Dec  6 13:16:01 flax20 kernel:  [<c014d28f>] do_munmap+0x11f/0x170
Dec  6 13:16:01 flax20 kernel:  [<c014d325>] sys_munmap+0x45/0x70
Dec  6 13:16:01 flax20 kernel:  [<c010adcf>] syscall_call+0x7/0xb
Dec  6 13:16:01 flax20 kernel:
Dec  6 13:16:01 flax20 kernel: Trying to fix it up, but a reboot is needed
Dec  6 16:31:14 flax20 syslogd 1.4.1: restart.

--- comments ---

The problem has occurred only once on one of 13 dual-processor machines.
The one that crashed was one (of seven) with dual AMD 1900+, 2GB RAM,
and 4 ATA hard drives.  None of the other machines have crashed in the
9 days since I installed the kernel.

It appears to be related to high memory pressure but I can't reproduce
it with simple programs.  The machines run text search software that is
both CPU and IO intensive.

---

I'm not a developer but I am running a pretty significant load on these
machines and am prepared to do instrumentation/investigation in order to
diagnose the problem.

As an aside, all versions of the 2.4 kernel are brought to their knees
in this application ("kswapd problems" hit full force and none of the
suggested patches worked).  Even with the occasional crash, 2.6.test11 is
way better.

-- 
Gordon V. Cormack     CS Dept, University of Waterloo, Canada N2L 3G1
gvcormack@uwaterloo.ca            http://cormack.uwaterloo.ca/cormack

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-12-09 23:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-08  3:46 2.6.test11 bug Rafal Skoczylas
2003-12-08  4:17 ` William Lee Irwin III
2003-12-08  5:17 ` Linus Torvalds
2003-12-08  9:02   ` Xavier Bestel
2003-12-08 16:27   ` Rafal Skoczylas
     [not found]   ` <20031208161742.GB9087@secprog.org>
     [not found]     ` <Pine.LNX.4.58.0312080848560.13236@home.osdl.org>
2003-12-08 17:12       ` 2.6.test11 bug Linus Torvalds
     [not found]         ` <20031209194827.GA22265@secprog.org>
     [not found]           ` <Pine.LNX.4.58.0312091221440.21456@home.osdl.org>
2003-12-09 22:31             ` Rafal Skoczylas
2003-12-09 23:26               ` Linus Torvalds
     [not found] <10kzo-7mZ-11@gated-at.bofh.it>
     [not found] ` <10m88-2wd-1@gated-at.bofh.it>
     [not found]   ` <10xdJ-28r-23@gated-at.bofh.it>
     [not found]     ` <10xdJ-28r-25@gated-at.bofh.it>
     [not found]       ` <10xdJ-28r-21@gated-at.bofh.it>
2003-12-08 22:28         ` Rafal Skoczylas
  -- strict thread matches above, loose matches on Subject: below --
2003-12-08  2:24 Gordon Cormack
2003-12-08  2:37 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).