On Fri, 2011-09-02 at 23:34 -0400, Pavel Ivanov wrote: > Hi, > > I can reliably reproduce a complete machine lockup when compiling > kernel sources with "make -j". After making some progress machine > stops responding to anything (including CapsLock/NumLock switching or > mouse moving) and after hard reboot nothing is left in kern.log or > syslog. Only attaching a serial console gives me the following clues > to what happens: > > [ 376.460584] INFO: task cc1:6839 blocked for more than 60 seconds. > [ 376.533411] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 376.627129] INFO: task cc1:6840 blocked for more than 60 seconds. > [ 376.699991] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 376.793636] INFO: task cc1:6850 blocked for more than 60 seconds. > [ 376.866397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 376.960026] INFO: task cc1:7017 blocked for more than 60 seconds. > [ 377.032776] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 377.128156] INFO: task cc1:7079 blocked for more than 60 seconds. > [ 377.200907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 377.294522] INFO: task cc1:7188 blocked for more than 60 seconds. > [ 377.367274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 377.460984] INFO: task cc1:8342 blocked for more than 60 seconds. > [ 377.533746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 377.627372] INFO: task cc1:8425 blocked for more than 60 seconds. > [ 377.700119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 377.793737] INFO: task cc1:8502 blocked for more than 60 seconds. > [ 377.866488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 377.960103] INFO: task cc1:8535 blocked for more than 60 seconds. > [ 378.034788] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. I've investigated this problem a little more. It turned our that problem happens on ext4 filesystem too (initially I reported it on ecryptfs and thought that it was somehow related to that). And it looks like machine is not completely stalled, looks like it still manages to do some work (repeated reports from hung task detector are a little different). But I still can't understand what happens there. I was able to catch some stack traces via the serial console (kernel is 3.1-rc8 with minor changes). I've attached everything I caught. In short: there's a lot of tasks that are repeatedly reported hung with the following stack: [ 264.428209] [] schedule+0x3f/0x60 [ 264.488832] [] schedule_timeout+0x335/0x3c0 [ 264.559760] [] ? wait_for_common+0x4a/0x180 [ 264.630695] [] ? wait_for_common+0x4a/0x180 [ 264.701621] [] ? wait_for_common+0x10f/0x180 [ 264.773592] [] wait_for_common+0x117/0x180 [ 264.843481] [] ? try_to_wake_up+0x2f0/0x2f0 [ 264.914414] [] wait_for_completion+0x1d/0x20 [ 264.986383] [] do_fork+0x1b1/0x380 [ 265.047957] [] ? set_current_blocked+0x52/0x60 [ 265.122002] [] sys_vfork+0x25/0x30 [ 265.183573] [] stub_vfork+0x13/0x20 [ 265.246188] [] ? system_call_fastpath+0x16/0x1b Among others there's khugepaged, compiz and unity processes apparently waiting on disk read and even this: [ 514.443268] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 0, t=6002 jiffies) [ 514.443460] Stack: [ 514.443462] ffff88013bc03d38 [ 514.443464] ffffffff8135a222 [ 514.443465] 000000205d343732 [ 514.443467] 00000000000003e9 [ 514.443469] 0000000000001000 [ 514.443470] ffffffff81cd6e00 [ 514.443472] 0000000000000400 [ 514.443473] 0000000000000096 [ 514.443475] ffff88013bc03d48 [ 514.443477] ffffffff8135a13e [ 514.443478] ffff88013bc03d68 [ 514.443480] ffffffff810722d2 [ 514.443482] Call Trace: [ 514.443484] [ 514.443485] [] delay_tsc+0x82/0xf0 [ 514.443491] [] __const_udelay+0x2e/0x30 [ 514.443495] [] native_safe_apic_wait_icr_idle+0x22/0x50 [ 514.443500] [] default_send_IPI_mask_sequence_phys+0x103/0x110 [ 514.443506] [] physflat_send_IPI_all+0x17/0x20 [ 514.443510] [] arch_trigger_all_cpu_backtrace+0x5a/0x90 [ 514.443514] [] __rcu_pending+0x37f/0x3e0 [ 514.443519] [] rcu_check_callbacks+0x132/0x1b0 [ 514.443523] [] update_process_times+0x48/0x90 [ 514.443528] [] tick_sched_timer+0x60/0xc0 [ 514.443534] [] __run_hrtimer+0x74/0x250 [ 514.443537] [] ? tick_nohz_handler+0x100/0x100 [ 514.443541] [] hrtimer_interrupt+0x103/0x230 [ 514.443544] [] smp_apic_timer_interrupt+0x66/0x98 [ 514.443549] [] apic_timer_interrupt+0x73/0x80 [ 514.443554] [ 514.443555] [] ? intel_idle+0xdf/0x140 [ 514.443559] [] ? intel_idle+0xdb/0x140 [ 514.443563] [] cpuidle_idle_call+0xc0/0x240 [ 514.443568] [] cpu_idle+0xd5/0x140 [ 514.443572] [] rest_init+0xd5/0xe4 [ 514.443575] [] ? csum_partial_copy_generic+0x16c/0x16c [ 514.443578] [] start_kernel+0x3e1/0x3ec [ 514.443583] [] x86_64_start_reservations+0x132/0x136 [ 514.443587] [] ? early_idt_handlers+0x140/0x140 [ 514.443590] [] x86_64_start_kernel+0x102/0x111 (and then all CPUs stacks are inside cpuidle_idle_call) So can somebody suggest how can I debug this problem further and pinpoint the reason of such freeze? Or maybe someone has ideas on what's the culprit? Thank you, Pavel