* Re: 2.6.test11 bug
@ 2003-12-08 3:46 Rafal Skoczylas
2003-12-08 4:17 ` William Lee Irwin III
2003-12-08 5:17 ` Linus Torvalds
0 siblings, 2 replies; 8+ messages in thread
From: Rafal Skoczylas @ 2003-12-08 3:46 UTC (permalink / raw)
To: linux-kernel
[ I am sorry if this message doesn't get threaded in the original
thread in your software, but I read lkml ocassionally through
usenet-gate so didn't have the original message in my mbox and
couldn't just hit "reply" ;) ]
On Mon, 08 Dec 2003 03:30:12 +0100 Gordon Cormack wrote:
> I have read the FAQ but I'm confused about how to report a 2.6
> kernel bug, or who to report it to.
> [...]
> Dec 6 13:16:01 flax20 kernel: Bad page state at free_hot_cold_page
> Dec 6 13:16:01 flax20 kernel: flags:0x02000114 mapping:00000000
> mapped:1 count:0
> [...]
Hello.
I am experiencing similiar behaviour as described below.
In my case it is mlnetd (of mldonkey package) which seems to be
responsible for driving kernel to a crash[1].
After a few hours of running, either the process gets killed or system
crashes (I am only able to reboot it with alt+prntscr+b, but it seems
like it is not able to [S]ync or [U]nmount filesystems - i have lost
a few files which were open at the time of crash[2]).
It may be worth to mention that I don't remember having such a crash
on 2.6.0-test9 which i used for a couple of weeks (since first day
it apeared on ftp.kernel.org untill test11 - i skipped test10).
Hardware:
---------
ALi M1647, Duron 1200, 512MB sdram.
Kernel:
-------
Linux poziomka 2.6.0-test11 #32 Fri Dec 5 21:10:40 CET 2003 i686
AMD_Duron(TM)Processor unknown Shameless Compilation
Compiled with gcc-3.3.2.
Logs:
-----
--- The last time, i got the following:
Bad page state at free_hot_cold_page
flags:0x01020008 mapping:d38afe68 mapped:0 count:0
Backtrace:
Call Trace:
[bad_page+93/144] bad_page+0x5d/0x90
[free_hot_cold_page+82/256] free_hot_cold_page+0x52/0x100
[zap_pte_range+358/416] zap_pte_range+0x166/0x1a0
[zap_pmd_range+75/112] zap_pmd_range+0x4b/0x70
[unmap_page_range+67/112] unmap_page_range+0x43/0x70
[unmap_vmas+225/528] unmap_vmas+0xe1/0x210
[exit_mmap+123/400] exit_mmap+0x7b/0x190
[mmput+100/192] mmput+0x64/0xc0
[do_exit+282/976] do_exit+0x11a/0x3d0
[do_group_exit+58/176] do_group_exit+0x3a/0xb0
[get_signal_to_deliver+590/848] get_signal_to_deliver+0x24e/0x350
[do_signal+149/288] do_signal+0x95/0x120
[schedule+761/1392] schedule+0x2f9/0x570
[pipe_write+0/800] pipe_write+0x0/0x320
[sys_rt_sigsuspend+222/272] sys_rt_sigsuspend+0xde/0x110
[syscall_call+7/11] syscall_call+0x7/0xb
Trying to fix it up, but a reboot is needed
--- But sometimes it get things like this:
Unable to handle kernel paging request at virtual address 5a85fb5c
printing eip:
c011e6c4
*pde = 00000000
Oops: 0002 [#1]
CPU: 0
EIP: 0060:[remove_wait_queue+36/112] Not tainted
EFLAGS: 00010002
EIP is at remove_wait_queue+0x24/0x70
eax: defb4000 ebx: da85fb58 ecx: 5a85fb58 edx: db0468b0
esi: db0468bc edi: 00000292 ebp: defb5fa0 esp: defb5f58
ds: 007b es: 007b ss: 0068
Process mlnetd (pid: 1456, threadinfo=defb4000 task=dd72e100)
Stack: db0468ac db046008 db046000 c0167484 00000000 cad86c08 00000001 c01681f5
defb5fa0 cad86c00 defb5fa0 00000041 defb4000 08376b48 cad86c08 00000000
cad86c00 00000001 c01674b0 db046000 00000000 3fd32b9c 083767d8 00000001
Call Trace:
[poll_freewait+36/80] poll_freewait+0x24/0x50
[sys_poll+581/656] sys_poll+0x245/0x290
[__pollwait+0/208] __pollwait+0x0/0xd0
[syscall_call+7/11] syscall_call+0x7/0xb
Code: 89 59 04 89 0b c7 46 04 00 02 20 00 c7 42 0c 00 01 10 00 57
<6>note: mlnetd[1456] exited with preempt_count 1
bad: scheduling while atomic!
Call Trace:
[schedule+1373/1392] schedule+0x55d/0x570
[unmap_page_range+67/112] unmap_page_range+0x43/0x70
[unmap_vmas+436/528] unmap_vmas+0x1b4/0x210
[exit_mmap+123/400] exit_mmap+0x7b/0x190
[mmput+100/192] mmput+0x64/0xc0
[do_exit+282/976] do_exit+0x11a/0x3d0
[do_page_fault+0/1292] do_page_fault+0x0/0x50c
[die+225/240] die+0xe1/0xf0
[do_page_fault+474/1292] do_page_fault+0x1da/0x50c
[do_IRQ+253/304] do_IRQ+0xfd/0x130
[common_interrupt+24/32] common_interrupt+0x18/0x20
[tcp_poll+18/352] tcp_poll+0x12/0x160
[do_page_fault+0/1292] do_page_fault+0x0/0x50c
[error_code+45/56] error_code+0x2d/0x38
[remove_wait_queue+36/112] remove_wait_queue+0x24/0x70
[poll_freewait+36/80] poll_freewait+0x24/0x50
[sys_poll+581/656] sys_poll+0x245/0x290
[__pollwait+0/208] __pollwait+0x0/0xd0
[syscall_call+7/11] syscall_call+0x7/0xb
(this last Call Trace repeted 2 more times)
If there is any important information missing, feel free to ask.
Regards.
[1] Actually, I am not sure, but this is the only candidate because
it uses more and more memory over the time and the crash or kill occurs
more or less at the same level of memory usage (~10-12% of 512Meg).
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nils 1781 11.1 10.5 41252 38796 pts/2 S 03:39 3:15 mlnetd
^^^^
[2] The files loss is probably XFS-related problem.
nils.
--
"Blessed is the man, who having nothing to say, abstains from giving wordy
evidence of the fact." -- http://secprog.org/who/rs/quote.php?id=1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.6.test11 bug 2003-12-08 3:46 2.6.test11 bug Rafal Skoczylas @ 2003-12-08 4:17 ` William Lee Irwin III 2003-12-08 5:17 ` Linus Torvalds 1 sibling, 0 replies; 8+ messages in thread From: William Lee Irwin III @ 2003-12-08 4:17 UTC (permalink / raw) To: Rafal Skoczylas; +Cc: linux-kernel On Mon, Dec 08, 2003 at 04:46:31AM +0100, Rafal Skoczylas wrote: > I am experiencing similiar behaviour as described below. > In my case it is mlnetd (of mldonkey package) which seems to be > responsible for driving kernel to a crash[1]. > After a few hours of running, either the process gets killed or system > crashes (I am only able to reboot it with alt+prntscr+b, but it seems > like it is not able to [S]ync or [U]nmount filesystems - i have lost > a few files which were open at the time of crash[2]). > It may be worth to mention that I don't remember having such a crash > on 2.6.0-test9 which i used for a couple of weeks (since first day > it apeared on ftp.kernel.org untill test11 - i skipped test10). I'm grabbing mldonkey and taking it for a spin. -- wli ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.6.test11 bug 2003-12-08 3:46 2.6.test11 bug Rafal Skoczylas 2003-12-08 4:17 ` William Lee Irwin III @ 2003-12-08 5:17 ` Linus Torvalds 2003-12-08 9:02 ` Xavier Bestel ` (2 more replies) 1 sibling, 3 replies; 8+ messages in thread From: Linus Torvalds @ 2003-12-08 5:17 UTC (permalink / raw) To: Rafal Skoczylas; +Cc: Kernel Mailing List, William Lee Irwin III On Mon, 8 Dec 2003, Rafal Skoczylas wrote: > > --- But sometimes it get things like this: > > Unable to handle kernel paging request at virtual address 5a85fb5c > printing eip: > c011e6c4 > *pde = 00000000 > Oops: 0002 [#1] > CPU: 0 > EIP: 0060:[remove_wait_queue+36/112] Not tainted > EFLAGS: 00010002 > EIP is at remove_wait_queue+0x24/0x70 > eax: defb4000 ebx: da85fb58 ecx: 5a85fb58 edx: db0468b0 > esi: db0468bc edi: 00000292 ebp: defb5fa0 esp: defb5f58 > Trace: > [poll_freewait+36/80] poll_freewait+0x24/0x50 > [sys_poll+581/656] sys_poll+0x245/0x290 > [__pollwait+0/208] __pollwait+0x0/0xd0 > [syscall_call+7/11] syscall_call+0x7/0xb Ahh.. Btw, this looks like a single-bit error. In particular, this is "list_del()" in remove_wait_queue(), and it's list_del() on a list that usually only has a single entry (ie a wait queue). In particular, it's the "prev->next = next" part, where "prev" is 5a85fb58 and next is da85fb58. Now, with a single entry, prev and next _should_ be the same (they both point to the head). And they are - except for a single bit error. The difference is the high bit of the word. Interesting. You get an oops because that 5a85fb58 _should_ be da85fb58. Notice? > If there is any important information missing, feel free to ask. It could be bad memory. We even know the address that is bad: it's (%esi+4), ie bit 31 of the word at physical address 0x1b0468f0. However, if you don't see random SIGSEGV's while compiling etc issues, it doesnt' sound like flaky RAM. I'm wondering if there is some bit operation out there somewhere with a wild pointer? Rafal - how consistent is the second form of the oops? Have you seen that trace more than once? Might it actually be "poll()" itself that has some bad behaviour? Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.6.test11 bug 2003-12-08 5:17 ` Linus Torvalds @ 2003-12-08 9:02 ` Xavier Bestel 2003-12-08 16:27 ` Rafal Skoczylas [not found] ` <20031208161742.GB9087@secprog.org> 2 siblings, 0 replies; 8+ messages in thread From: Xavier Bestel @ 2003-12-08 9:02 UTC (permalink / raw) To: Linus Torvalds Cc: Rafal Skoczylas, Linux Kernel Mailing List, William Lee Irwin III Le lun 08/12/2003 à 06:17, Linus Torvalds a écrit : > It could be bad memory. We even know the address that is bad: it's > (%esi+4), ie bit 31 of the word at physical address 0x1b0468f0. > > However, if you don't see random SIGSEGV's while compiling etc issues, it > doesnt' sound like flaky RAM. (FWIW) I have a server running 2.4.22 and pppoe (to talk to an ADSL modem). It works really flawlessly *except* when I run mldonkey: then pppoe regularly fails and drops the connection. As this thing generates a lot of packets to/from different hosts, I suspects it's an excellent test workload for some paths of the kernel. Xav ^ permalink raw reply [flat|nested] 8+ messages in thread
* (no subject) 2003-12-08 5:17 ` Linus Torvalds 2003-12-08 9:02 ` Xavier Bestel @ 2003-12-08 16:27 ` Rafal Skoczylas [not found] ` <20031208161742.GB9087@secprog.org> 2 siblings, 0 replies; 8+ messages in thread From: Rafal Skoczylas @ 2003-12-08 16:27 UTC (permalink / raw) To: linux-kernel On Sun, Dec 07, 2003 at 09:17:35PM -0800, Linus Torvalds wrote: > On Mon, 8 Dec 2003, Rafal Skoczylas wrote: > > Unable to handle kernel paging request at virtual address 5a85fb5c > > [...] > > EIP: 0060:[remove_wait_queue+36/112] Not tainted > > [...] > > eax: defb4000 ebx: da85fb58 ecx: 5a85fb58 edx: db0468b0 > > esi: db0468bc edi: 00000292 ebp: defb5fa0 esp: defb5f58 > > Trace: > > [poll_freewait+36/80] poll_freewait+0x24/0x50 > > [sys_poll+581/656] sys_poll+0x245/0x290 > > [__pollwait+0/208] __pollwait+0x0/0xd0 > > [syscall_call+7/11] syscall_call+0x7/0xb > > It could be bad memory. We even know the address that is bad: it's > (%esi+4), ie bit 31 of the word at physical address 0x1b0468f0. > However, if you don't see random SIGSEGV's while compiling etc issues, it > doesnt' sound like flaky RAM. Indeed, I do not have any random SIGSEGVs at any time. Additionaly, as what you said sounded right to me I performed extensive memory tests with x86-memtest v3.0 during the night and as I expected memory seems to be OK. > Rafal - how consistent is the second form of the oops? > Have you seen that trace more than once? Not exactly the same, but there are some similarities (If I understand this log correctly). I ripped those oopses out of the logs so maybe you could look yourself and see something I don't see: http://secprog.org/who/rs/linux/2.6-test11-log.txt These are oopses I have experienced on test11 (Unfortunately, I dont have logs from test9 since I don't keep logs that long on workstation). nils. -- "Blessed is the man, who having nothing to say, abstains from giving wordy evidence of the fact." -- http://secprog.org/who/rs/quote.php?id=1 ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20031208161742.GB9087@secprog.org>]
[parent not found: <Pine.LNX.4.58.0312080848560.13236@home.osdl.org>]
* Re: 2.6.test11 bug [not found] ` <Pine.LNX.4.58.0312080848560.13236@home.osdl.org> @ 2003-12-08 17:12 ` Linus Torvalds [not found] ` <20031209194827.GA22265@secprog.org> 0 siblings, 1 reply; 8+ messages in thread From: Linus Torvalds @ 2003-12-08 17:12 UTC (permalink / raw) To: Rafal Skoczylas; +Cc: Kernel Mailing List Rafael - you had the wrong address for the kernel mailing list, and I didn't notice when I replied. Here is my comment again for the mailing list in case somebody else can think of something. The more heads, the merrier. Linus On Mon, 8 Dec 2003, Linus Torvalds wrote: > > On Mon, 8 Dec 2003, Rafal Skoczylas wrote: > > > > > Rafal - how consistent is the second form of the oops? > > > Have you seen that trace more than once? > > > > Not exactly the same, but there are some similarities (If I understand > > this log correctly). I ripped those oopses out of the logs so maybe you > > could look yourself and see something I don't see: > > http://secprog.org/who/rs/linux/2.6-test11-log.txt > > Oh wow. > > They all look to be (except for the odd last "bad page state" one, which > is likely just a result of some _other_ earlier corruption) due to the > high bit being cleared. And it's consistent across reboots too, so it's > not just some corruption that stayed around in memory. > > In all cases it looks like a pointer that should have been of the form > "0xDxxxxxxx", and was corrupted into the form "0x5xxxxxxx". > > And every time it's "mlnetd" - which may just be a coincidence (possibly > brought on by that being the most commonly run thing on your box), but it > certainly looks like it could also be an indication of the source of the > corruption. > > I find it interesting that it seems to be bit 31 that is corrupt: > admittedly that is also the only bit that is almost guaranteed to cause > oopses when it corrupts (because it changes the pointer "the most"), but > it's also to some degree one of the least used bits for bit operations. A > lot of things use bit #0 for locking etc. > > I'll have to think about this, but quite frankly I'm also hoping to see > more of a pattern about what this is all about. Can you keep your oopses > up somewhere? Maybe opening a bug on bugme.osdl.org? Even though I don't > use bugme personally, it's good to keep the record around when we don't > immediately see the reason for something.. > > Linus > ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20031209194827.GA22265@secprog.org>]
[parent not found: <Pine.LNX.4.58.0312091221440.21456@home.osdl.org>]
* Re: 2.6.test11 bug [not found] ` <Pine.LNX.4.58.0312091221440.21456@home.osdl.org> @ 2003-12-09 22:31 ` Rafal Skoczylas 2003-12-09 23:26 ` Linus Torvalds 0 siblings, 1 reply; 8+ messages in thread From: Rafal Skoczylas @ 2003-12-09 22:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel [ I am leaving most Linus' quotations so everyone could see what we were talking about as I haven't CC'ed previous email to lkml nor have Linus ] On Tue, Dec 09, 2003 at 12:31:15PM -0800, Linus Torvalds wrote: > > First I had an impression like my ram has to be broken as the address is > > the same in all but one, but how would it pass all those memtests > > I performed 2 nights ago? And why is it allways the same call trace? > > It's interesting - it looks like it's the > > dentry = file->f_dentry; > > thing, with "file" being NULL. > > Can you do a "gdb vmlinux" and then do "x/20i __fput" for me, to show what > your compiler is doing? > > Oh, and do a "disassemble fput" too. > > Because "file" really shouldn't be NULL there - and even if it was, we > should have gotten an oops _earlier_, when we did the > "atomic_dec_and_test()" on "file->f_count". 2.6.0-test11 -- no frame pointers gcc (GCC) 3.3.2 (Shameless Linux) GNU gdb 6.0 [...] (gdb) x/20i __fput 0xc0155850 <__fput>: push %ebp 0xc0155851 <__fput+1>: push %edi 0xc0155852 <__fput+2>: push %esi 0xc0155853 <__fput+3>: push %ebx 0xc0155854 <__fput+4>: mov %eax,%ebx 0xc0155856 <__fput+6>: sub $0xc,%esp 0xc0155859 <__fput+9>: mov 0xc(%ebx),%edi 0xc015585c <__fput+12>: mov 0x8(%eax),%eax 0xc015585f <__fput+15>: mov %eax,0x8(%esp,1) 0xc0155863 <__fput+19>: mov 0x8(%eax),%ebp 0xc0155866 <__fput+22>: lea 0x78(%ebx),%eax 0xc0155869 <__fput+25>: cmp %eax,0x78(%ebx) 0xc015586c <__fput+28>: jne 0xc0155963 <__fput+275> 0xc0155872 <__fput+34>: mov %ebx,(%esp,1) 0xc0155875 <__fput+37>: call 0xc016ae50 <locks_remove_flock> 0xc015587a <__fput+42>: mov 0x10(%ebx),%eax 0xc015587d <__fput+45>: test %eax,%eax 0xc015587f <__fput+47>: je 0xc015588c <__fput+60> 0xc0155881 <__fput+49>: mov 0x30(%eax),%esi 0xc0155884 <__fput+52>: test %esi,%esi (gdb) disassemble fput Dump of assembler code for function fput: 0xc0155830 <fput+0>: mov %eax,%edx 0xc0155832 <fput+2>: decl 0x14(%eax) 0xc0155835 <fput+5>: sete %al 0xc0155838 <fput+8>: test %al,%al 0xc015583a <fput+10>: jne 0xc0155840 <fput+16> 0xc015583c <fput+12>: nop 0xc015583d <fput+13>: ret 0xc015583e <fput+14>: mov %esi,%esi 0xc0155840 <fput+16>: mov %edx,%eax 0xc0155842 <fput+18>: jmp 0xc0155850 <__fput> 0xc0155847 <fput+23>: mov %esi,%esi 0xc0155849 <fput+25>: lea 0x0(%edi,1),%edi End of assembler dump. (gdb) > Hmm.. Maybe that is actually the "dentry->d_inode" access instead - > although for me the inode field is at offset 12. But if you have an UP > kernel, offset 8 might be the right one.. Anyway, a NULL inode is _also_ a > bug at that point, so it's not as if that solves anything. > > I wonder if there is a problem with gcc-3.3.2. Have you tried other > compilers? Not yet. Now I'm downloading 2.9x and will try with that one. > Also, it might be interesting to hear if there is a difference if you > compile with frame pointers off (or on, if they are now off). Now they are off. I'll try turning on and if it won't get any better I'll try with another compiler. Anyway, while compiling with frame pointers I got another oops ;/ This one is different those we have seen so far: http://secprog.org/who/rs/linux/2003-12-09b-linux-2.6.0-test11.txt For those who haven't seen previous ones from today: http://secprog.org/who/rs/linux/2003-12-09-linux-2.6.0-test11.txt nils. -- "Blessed is the man, who having nothing to say, abstains from giving wordy evidence of the fact." -- http://secprog.org/who/rs/quote.php?id=1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.6.test11 bug 2003-12-09 22:31 ` Rafal Skoczylas @ 2003-12-09 23:26 ` Linus Torvalds 0 siblings, 0 replies; 8+ messages in thread From: Linus Torvalds @ 2003-12-09 23:26 UTC (permalink / raw) To: Rafal Skoczylas; +Cc: Kernel Mailing List On Tue, 9 Dec 2003, Rafal Skoczylas wrote: > > Anyway, while compiling with frame pointers I got another oops ;/ > This one is different those we have seen so far: Actually, it's very similar to the ones you had earlier - the backtrace is different, but it's the same thing again: ecx: 5944f160 edx: d944f160 that's a doubly-linked list, and the two values _should_ be the same, I bet. Except one of them has the high bit clear, and when following the bogus pointer, you get a page fault. I think the second one (the "kernel bug in mmap.c" one) is just a result of the first oops one leaving the mm lists in a bad state and map_count being off as a result, or something like that. Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2003-12-09 23:27 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-12-08 3:46 2.6.test11 bug Rafal Skoczylas 2003-12-08 4:17 ` William Lee Irwin III 2003-12-08 5:17 ` Linus Torvalds 2003-12-08 9:02 ` Xavier Bestel 2003-12-08 16:27 ` Rafal Skoczylas [not found] ` <20031208161742.GB9087@secprog.org> [not found] ` <Pine.LNX.4.58.0312080848560.13236@home.osdl.org> 2003-12-08 17:12 ` 2.6.test11 bug Linus Torvalds [not found] ` <20031209194827.GA22265@secprog.org> [not found] ` <Pine.LNX.4.58.0312091221440.21456@home.osdl.org> 2003-12-09 22:31 ` Rafal Skoczylas 2003-12-09 23:26 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).