All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: kernel BUG at mm/prio_tree.c:377
@ 2004-11-04  1:27 Rajesh Venkatasubramanian
  2004-11-09  0:43 ` Ray Van Dolson
  0 siblings, 1 reply; 6+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-11-04  1:27 UTC (permalink / raw)
  To: Ray Van Dolson; +Cc: LKML


Hi Ray,

Can you please apply the patch I recently posted and report
back.

http://marc.theaimsgroup.com/?l=linux-kernel&m=109926628920398

The patch fixes a bug reported earlier. However, earlier
oops were triggered at mm/prio_tree.c:538.

I haven't looked at the trace carefully. I will do so.
Please report back if the previous patch fixes your problem.

Thanks,
Rajesh

-----------------------------------------------------

Ray Van Dolson <rayvd@digitalpath.net> wrote:

Description of problem:
Running on an HP DL140, w/ Dual 2.4GHz Xeon's.  1GB of ECC DDR.  Fedora
Core 2.

This server operates as a PPTP Concentrator running the PoPToP server
(1.2.1) along with pppd 2.4.3.  We have tried this system using both
the onboard Broadcom gigabit NIC's as well as a dual Intel EEPro 100.

Usually within 24 hours of bootup, the following oops occurs:

kernel BUG at mm/prio_tree.c:377!
invalid operand: 0000 [#1]
SMP nntrack(U) ip_tables(U) md5(U) ipv6(U) sunrpc(U) e100(U) mii(U)
sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U)
battery(U) asus_acpi(U) ac(U) ext3(U) jbd(U)
Modules linked in: ipt_LOG(U) sch_tbf(U) ppp_mppe(U) ppp_async(U)
crc_ccitt(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U)
ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_co
CPU:    1
EIP:    0060:[<021425de>]    Tainted: P
EFLAGS: 00010202   (2.6.8-1.521custom)
EIP is at prio_tree_right+0x85/0xc5
eax: 00000009   ebx: 0cf1acf8   ecx: 00000000   edx: 12da3d00
esi: 00000000   edi: 00000004   ebp: 404a6d78   esp: 0cf1ac90
ds: 007b   es: 007b   ss: 0068
Process yum (pid: 24194, threadinfo=0cf1a000 task=12e4ecb0)
Stack: 0cf1acf8 00000004 00000004 404a6d78 021427ae 00000004 0cf1acb0
0cf1acb4 00000000 00000043 0cf1acf8 404a6d78 00000004 08ec1ac4 02142968
00000004 0000007b 404a6d54 034fac80 02150cf7 00000004 00000004 00000004
00000001
Call Trace:
 [<021427ae>] prio_tree_next+0x89/0x9b
 [<02142968>] vma_prio_tree_next+0x4b/0x63
 [<02150cf7>] page_referenced+0x14d/0x18d
 [<021478cd>] refill_inactive_zone+0x245/0x6a0
 [<0211b29e>] activate_task+0x86/0x93
 [<02147db5>] shrink_zone+0x8d/0xb4
 [<02147e1f>] shrink_caches+0x43/0x4e
 [<02147edd>] try_to_free_pages+0xb3/0x16c
 [<02140369>] __alloc_pages+0x1c8/0x2be
 [<0214bd83>] do_anonymous_page+0xb6/0x241
 [<0214bf77>] do_no_page+0x69/0x3a0
 [<0214c460>] handle_mm_fault+0xdf/0x1d4
 [<0211955b>] do_page_fault+0x17c/0x58b
 [<0214e81d>] unmap_vma_list+0xe/0x17
 [<0214ebd5>] do_munmap+0x17a/0x186
 [<0214fcef>] move_page_tables+0x3f/0x4c
 [<0214fded>] move_vma+0xf1/0x175
 [<0215017a>] do_mremap+0x309/0x32c
 [<021193df>] do_page_fault+0x0/0x58b
Code: 0f 0b 79 01 cf fa 2e 02 39 52 04 74 08 0f 0b 7a 01 cf fa 2e

The system continues to function for approxiamately another minute.  I
see messages such as the following on the console repeatedly:

dst cache overflow

Eventually the system becomes completely unresponsive.  When I hit the
power button, ACPI tries to power down the system, but hangs after
killing a few processes and I must hard reset it.

I do not think this is bad hardware as we have approximately 11
DL140's and this will happen on all of them although more quickly on
the ones with higher user load (network traffic, CPU usage, etc).

Hoping someone can give me some suggestions if this is more likely to be a
hardware issue... just can't imagine getting that many bad servers. :)

Thanks in advance.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at mm/prio_tree.c:377
  2004-11-04  1:27 kernel BUG at mm/prio_tree.c:377 Rajesh Venkatasubramanian
@ 2004-11-09  0:43 ` Ray Van Dolson
  0 siblings, 0 replies; 6+ messages in thread
From: Ray Van Dolson @ 2004-11-09  0:43 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian; +Cc: LKML

Rajesh, I applied your patch and it definitely seems to have halped.  The
server lasted nearly three days. :-)  In fact, it didn't really seem to
hard lock but I had to reset it to get things working after the latest
crash.

Details:


 kernel BUG at kernel/exit.c:842!
 invalid operand: 0000 [#1]
 SMP 
 Modules linked in: sch_tbf(U) ppp_async(U) crc_ccitt(U) ppp_mppe(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_conntrack(U) ip_tables(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) ac(U) ext3(U) jbd(U)
 CPU:    2
 EIP:    0060:[<02121d10>]    Tainted: P   VLI
 EFLAGS: 00010246   (2.6.9-1.1_FC2custom) 
 EIP is at do_exit+0x3b3/0x3bd
 eax: 00000000   ebx: 26506560   ecx: 26506000   edx: 0381dd60
 esi: 41fec340   edi: 26506030   ebp: 00001000   esp: 23b82f98
 ds: 007b   es: 007b   ss: 0068
 Process pppd (pid: 27091, threadinfo=23b82000 task=26506030)
 Stack: 0d611e00 00001000 23b82000 23b82000 02121e05 00001000 23b82fc4 00000010 
        f6f32684 23b82000 fffec200 00000010 00000000 00000000 00000010 f6f32684 
        fef87938 000000fc 0000007b 0000007b 000000fc f6fa37a2 00000073 00000246 
 Call Trace:
  [<02121e05>] sys_exit_group+0x0/0xd
 Code: c1 e0 07 8d 04 10 ff 88 00 01 00 00 83 3a 02 75 0b 8b 82 08 11 00 00 e8 d8 95 ff ff 89 6f 7c 89 f8 e8 88 f5 ff ff e8 bc 74 19 00 <0f> 0b 4a 03 96 09 2d 02 eb fe 53 85 c0 89 d3 74 05 e8 35 ab ff 
  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000024
  printing eip:
 0211ddb0
 *pde = 00004001
 Oops: 0000 [#2]
 SMP 
 Modules linked in: sch_tbf(U) ppp_async(U) crc_ccitt(U) ppp_mppe(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_conntrack(U) ip_tables(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) ac(U) ext3(U) jbd(U)
 CPU:    2
 EIP:    0060:[<0211ddb0>]    Tainted: P   VLI
 EFLAGS: 00010286   (2.6.9-1.1_FC2custom) 
 EIP is at mm_release+0x33/0x70
 eax: 00000000   ebx: 26506030   ecx: 00000000   edx: 00000000
 esi: f6ff6828   edi: 00000000   ebp: 0000000b   esp: 23b82e50
 ds: 007b   es: 007b   ss: 0068
 Process pppd (pid: 27091, threadinfo=23b82000 task=26506030)
 Stack: 00000000 00000000 23b82f64 26506030 02121a20 23b82000 23b82f64 00000000 
        022c9112 021064a2 0000000b 23b82f64 022c9112 00000000 000000ff 0000000b 
        00000000 02106784 00001000 23b82f64 00000000 02106784 00001000 02106850 
 Call Trace:
  [<02121a20>] do_exit+0xc3/0x3bd
  [<021064a2>] do_divide_error+0x0/0xea
  [<02106784>] do_invalid_op+0x0/0xd5
  [<02106784>] do_invalid_op+0x0/0xd5
  [<02106850>] do_invalid_op+0xcc/0xd5
  [<0211bff5>] load_balance+0x27/0x135
  [<02121d10>] do_exit+0x3b3/0x3bd
  [<022b9a4a>] schedule+0x87e/0x8aa
  [<0217e45d>] proc_delete_inode+0x0/0x61
  [<022b9a4a>] schedule+0x87e/0x8aa
  [<02121d10>] do_exit+0x3b3/0x3bd
  [<02121e05>] sys_exit_group+0x0/0xd
 Code: 8b 90 14 01 00 00 31 c0 8e e0 8e e8 85 d2 74 11 c7 83 14 01 00 00 00 00 00 00 89 d0 e8 b5 ea ff ff 8b b3 1c 01 00 00 85 f6 74 38 <8b> 47 24 48 7e 32 c7 83 1c 01 00 00 00 00 00 00 89 e2 89 f1 c7 
  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000024
  printing eip:
 0211ddb0
 *pde = 00004001
 Oops: 0000 [#3]
 SMP 
 Modules linked in: sch_tbf(U) ppp_async(U) crc_ccitt(U) ppp_mppe(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_conntrack(U) ip_tables(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) ac(U) ext3(U) jbd(U)
 CPU:    2
 EIP:    0060:[<0211ddb0>]    Tainted: P   VLI
 EFLAGS: 00010286   (2.6.9-1.1_FC2custom) 
 EIP is at mm_release+0x33/0x70
 eax: 00000000   ebx: 26506030   ecx: 00000000   edx: 00000000
 esi: f6ff6828   edi: 00000000   ebp: 0000000b   esp: 23b82cc8
 ds: 007b   es: 007b   ss: 0068

I started also noticing "Neighbour table overflow" error messages as well.
This server makes heavy use of proxy arp, so I wonder if I need to tweak
the gc_thresh* and the other gc* variables in proc...

The weird thing is that even after these "oopses" happened, the box was
still functioning.  I could access the web server running on it, it was
still passing traffic for existing tunnels, but I could not establish new
ones.  Couldn't ssh in, etc (thus I had to hard reset it).

As you can see, this is running on kernel 2.6.9 (from Fedora Core 2 testing
update tree) w/ your patch you mentioned below.

Any ideas?

On Wed, Nov 03, 2004 at 08:27:09PM -0500, Rajesh Venkatasubramanian wrote:
> Hi Ray,
>
> Can you please apply the patch I recently posted and report
> back.
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109926628920398
>
> The patch fixes a bug reported earlier. However, earlier
> oops were triggered at mm/prio_tree.c:538.
>
> I haven't looked at the trace carefully. I will do so                        .
> Please report back if the previous patch fixes your problem                  .
>
> Thanks,
> Rajesh
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at mm/prio_tree.c:377
  2004-11-04 16:14   ` Ray Van Dolson
@ 2004-11-04 16:32     ` Ray Van Dolson
  0 siblings, 0 replies; 6+ messages in thread
From: Ray Van Dolson @ 2004-11-04 16:32 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

I should note that the reason it taints the kernel is that it uses the BSD
license.

On Thu, Nov 04, 2004 at 08:14:30AM -0800, Ray Van Dolson wrote:
> ppp_mppe patch from the pppd package. Lots of people use it without
> problems. If it is the source of troubles, that won't be good as we need
> it for our clients to connect. :)
>
> On Thu, Nov 04, 2004 at 09:04:16AM +0100, Arjan van de Ven wrote:
> > On Wed, 2004-11-03 at 16:36 -0800, Ray Van Dolson wrote:
> > > Description of problem:
> > > Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
> > > Core 2.
> > > EIP: 0060:[<021425de>] Tainted: P
> > Which binary only driver are you using ?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at mm/prio_tree.c:377
  2004-11-04  8:04 ` Arjan van de Ven
@ 2004-11-04 16:14   ` Ray Van Dolson
  2004-11-04 16:32     ` Ray Van Dolson
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Van Dolson @ 2004-11-04 16:14 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

ppp_mppe patch from the pppd package.  Lots of people use it without
problems.  If it is the source of troubles, that won't be good as we need
it for our clients to connect. :)

On Thu, Nov 04, 2004 at 09:04:16AM +0100, Arjan van de Ven wrote:
> On Wed, 2004-11-03 at 16:36 -0800, Ray Van Dolson wrote:
> > Description of problem:
> > Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
> > Core 2.
> > EIP: 0060:[<021425de>] Tainted: P
> Which binary only driver are you using ?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at mm/prio_tree.c:377
  2004-11-04  0:36 Ray Van Dolson
@ 2004-11-04  8:04 ` Arjan van de Ven
  2004-11-04 16:14   ` Ray Van Dolson
  0 siblings, 1 reply; 6+ messages in thread
From: Arjan van de Ven @ 2004-11-04  8:04 UTC (permalink / raw)
  To: Ray Van Dolson; +Cc: linux-kernel

On Wed, 2004-11-03 at 16:36 -0800, Ray Van Dolson wrote:
> Description of problem:
> Running on an HP DL140, w/ Dual 2.4GHz Xeon's.  1GB of ECC DDR.  Fedora
> Core 2.

> EIP:    0060:[<021425de>]    Tainted: P  

Which binary only driver are you using ?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* kernel BUG at mm/prio_tree.c:377
@ 2004-11-04  0:36 Ray Van Dolson
  2004-11-04  8:04 ` Arjan van de Ven
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Van Dolson @ 2004-11-04  0:36 UTC (permalink / raw)
  To: linux-kernel

Description of problem:
Running on an HP DL140, w/ Dual 2.4GHz Xeon's.  1GB of ECC DDR.  Fedora
Core 2.

This server operates as a PPTP Concentrator running the PoPToP server
(1.2.1) along with pppd 2.4.3.  We have tried this system using both
the onboard Broadcom gigabit NIC's as well as a dual Intel EEPro 100.

Usually within 24 hours of bootup, the following oops occurs:

kernel BUG at mm/prio_tree.c:377!
invalid operand: 0000 [#1]
SMP nntrack(U) ip_tables(U) md5(U) ipv6(U) sunrpc(U) e100(U) mii(U)
sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U)
battery(U) asus_acpi(U) ac(U) ext3(U) jbd(U)
Modules linked in: ipt_LOG(U) sch_tbf(U) ppp_mppe(U) ppp_async(U)
crc_ccitt(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U)
ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_co
CPU:    1
EIP:    0060:[<021425de>]    Tainted: P  
EFLAGS: 00010202   (2.6.8-1.521custom) 
EIP is at prio_tree_right+0x85/0xc5
eax: 00000009   ebx: 0cf1acf8   ecx: 00000000   edx: 12da3d00
esi: 00000000   edi: 00000004   ebp: 404a6d78   esp: 0cf1ac90
ds: 007b   es: 007b   ss: 0068
Process yum (pid: 24194, threadinfo=0cf1a000 task=12e4ecb0)
Stack: 0cf1acf8 00000004 00000004 404a6d78 021427ae 00000004 0cf1acb0
0cf1acb4 00000000 00000043 0cf1acf8 404a6d78 00000004 08ec1ac4 02142968
00000004 0000007b 404a6d54 034fac80 02150cf7 00000004 00000004 00000004
00000001 
Call Trace:
 [<021427ae>] prio_tree_next+0x89/0x9b
 [<02142968>] vma_prio_tree_next+0x4b/0x63
 [<02150cf7>] page_referenced+0x14d/0x18d
 [<021478cd>] refill_inactive_zone+0x245/0x6a0
 [<0211b29e>] activate_task+0x86/0x93
 [<02147db5>] shrink_zone+0x8d/0xb4
 [<02147e1f>] shrink_caches+0x43/0x4e
 [<02147edd>] try_to_free_pages+0xb3/0x16c
 [<02140369>] __alloc_pages+0x1c8/0x2be
 [<0214bd83>] do_anonymous_page+0xb6/0x241
 [<0214bf77>] do_no_page+0x69/0x3a0
 [<0214c460>] handle_mm_fault+0xdf/0x1d4
 [<0211955b>] do_page_fault+0x17c/0x58b
 [<0214e81d>] unmap_vma_list+0xe/0x17
 [<0214ebd5>] do_munmap+0x17a/0x186
 [<0214fcef>] move_page_tables+0x3f/0x4c
 [<0214fded>] move_vma+0xf1/0x175
 [<0215017a>] do_mremap+0x309/0x32c
 [<021193df>] do_page_fault+0x0/0x58b
Code: 0f 0b 79 01 cf fa 2e 02 39 52 04 74 08 0f 0b 7a 01 cf fa 2e 

The system continues to function for approxiamately another minute.  I
see messages such as the following on the console repeatedly:

dst cache overflow 

Eventually the system becomes completely unresponsive.  When I hit the
power button, ACPI tries to power down the system, but hangs after
killing a few processes and I must hard reset it.

I do not think this is bad hardware as we have approximately 11
DL140's and this will happen on all of them although more quickly on
the ones with higher user load (network traffic, CPU usage, etc).

Hoping someone can give me some suggestions if this is more likely to be a
hardware issue... just can't imagine getting that many bad servers. :)

Thanks in advance.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-11-09  0:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-04  1:27 kernel BUG at mm/prio_tree.c:377 Rajesh Venkatasubramanian
2004-11-09  0:43 ` Ray Van Dolson
  -- strict thread matches above, loose matches on Subject: below --
2004-11-04  0:36 Ray Van Dolson
2004-11-04  8:04 ` Arjan van de Ven
2004-11-04 16:14   ` Ray Van Dolson
2004-11-04 16:32     ` Ray Van Dolson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.