* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-17 22:34 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw) To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz Hello, I have a system I recently upgraded from 2.6.30.x and after approximately 24-48 hours--sometimes longer, the system cannot write any more files to disk (luckily though I can still write to /dev/shm) -- to which I have saved the sysrq-t and sysrq-w output: http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt Configuration: $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid1 sdb2[1] sda2[0] 136448 blocks [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 129596288 blocks [2/2] [UU] md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0] 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] md0 : active raid1 sdb1[1] sda1[0] 16787776 blocks [2/2] [UU] $ mount /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) udev on /dev type tmpfs (rw,mode=0755) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) /dev/md1 on /boot type ext3 (rw,noatime) /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) Distribution: Debian Testing Arch: x86_64 The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem persists. Here is a snippet of two processes in D-state, the first was not doing anything, the second was mrtg. [121444.684000] pickup D 0000000000000003 0 18407 4521 0x00000000 [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000 [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90 [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90 [121444.684000] Call Trace: [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0 [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160 [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40 [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70 [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90 [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80 [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b [121444.684000] rateup D 0000000000000000 0 18538 18465 0x00000000 [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8 [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0 [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0 [121444.684000] Call Trace: [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90 [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50 [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0 [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130 [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0 [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40 [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0 [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0 [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220 [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0 [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220 [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300 [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0 [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30 [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0 [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130 [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40 [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0 [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40 [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130 [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40 [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60 [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0 [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0 [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20 [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0 [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90 [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100 [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b Anyone know what is going on here? Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-17 22:34 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw) To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz Hello, I have a system I recently upgraded from 2.6.30.x and after approximately 24-48 hours--sometimes longer, the system cannot write any more files to disk (luckily though I can still write to /dev/shm) -- to which I have saved the sysrq-t and sysrq-w output: http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt Configuration: $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid1 sdb2[1] sda2[0] 136448 blocks [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 129596288 blocks [2/2] [UU] md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0] 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] md0 : active raid1 sdb1[1] sda1[0] 16787776 blocks [2/2] [UU] $ mount /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) udev on /dev type tmpfs (rw,mode=0755) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) /dev/md1 on /boot type ext3 (rw,noatime) /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) Distribution: Debian Testing Arch: x86_64 The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem persists. Here is a snippet of two processes in D-state, the first was not doing anything, the second was mrtg. [121444.684000] pickup D 0000000000000003 0 18407 4521 0x00000000 [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000 [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90 [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90 [121444.684000] Call Trace: [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0 [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160 [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40 [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70 [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90 [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80 [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b [121444.684000] rateup D 0000000000000000 0 18538 18465 0x00000000 [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8 [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0 [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0 [121444.684000] Call Trace: [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90 [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50 [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0 [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130 [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0 [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40 [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0 [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0 [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220 [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0 [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220 [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300 [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0 [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30 [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0 [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130 [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40 [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0 [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40 [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130 [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40 [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60 [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0 [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0 [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20 [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0 [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90 [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100 [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b Anyone know what is going on here? Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-17 22:34 ` Justin Piszcz @ 2009-10-18 20:17 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw) To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz On Sat, 17 Oct 2009, Justin Piszcz wrote: > Hello, It has happened again, all sysrq-X output was saved this time. wget http://home.comcast.net/~jpiszcz/20091018/crash.txt wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt Kernel configuration: wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt Diff of the two configs: $ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_" > CONFIG_OUTPUT_FORMAT="elf64-x86-64" > CONFIG_CONSTRUCTORS=y > CONFIG_HAVE_PERF_COUNTERS=y > CONFIG_HAVE_DMA_ATTRS=y > CONFIG_BLK_DEV_BSG=y > CONFIG_X86_NEW_MCE=y > CONFIG_X86_THERMAL_VECTOR=y < CONFIG_UNEVICTABLE_LRU=y < CONFIG_PHYSICAL_START=0x200000 > CONFIG_PHYSICAL_START=0x1000000 < CONFIG_PHYSICAL_ALIGN=0x200000 > CONFIG_PHYSICAL_ALIGN=0x1000000 < CONFIG_COMPAT_NET_DEV_OPS=y < CONFIG_SND_JACK=y > CONFIG_HID_DRAGONRISE=y > CONFIG_HID_GREENASIA=y > CONFIG_HID_SMARTJOYPLUS=y > CONFIG_HID_THRUSTMASTER=y > CONFIG_HID_ZEROPLUS=y > CONFIG_FSNOTIFY=y > CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y > CONFIG_HAVE_ARCH_KMEMCHECK=y I have reverted back to 2.6.30.9 to see if the problem recurs with this kernel version. I do not recall seeing this on the older 2.6.30.x kernels: [ 9.276427] md3: detected capacity change from 0 to 5251073572864 [ 9.277411] md2: detected capacity change from 0 to 132706598912 [ 9.278305] md1: detected capacity change from 0 to 139722752 [ 9.278921] md0: detected capacity change from 0 to 17190682624 Again, some more D-state processes: [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 [76325.608087] Call Trace: [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130 [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 [76325.608214] Call Trace: [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0 [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0 [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30 [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0 [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190 [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0 [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 All of the D-state processes: $ cat sysrq-w.txt |grep ' D' [76307.285125] alpine D 0000000000000000 0 7659 29120 0x00000000 [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 [76325.608257] syslogd D 0000000000000002 0 2438 1 0x00000000 [76325.608318] freshclam D 0000000000000000 0 2877 1 0x00000000 [76325.608428] asterisk D 0000000000000001 0 3278 1 0x00000000 [76325.608492] console-kit-d D 0000000000000000 0 3299 1 0x00000000 [76325.608562] dhcpd3 D 0000000000000000 0 3554 1 0x00000000 [76325.608621] plasma-deskto D 0000000000000002 0 32482 1 0x00000000 [76325.608713] kaccess D 0000000000000001 0 32488 1 0x00000000 [76325.608752] mail D 0000000000000000 0 7397 7386 0x00000000 [76325.608830] hal-acl-tool D 0000000000000000 0 7430 3399 0x00000004 [76325.608888] mrtg D 0000000000000000 0 7444 7433 0x00000000 [76325.608981] cron D 0000000000000000 0 7500 3630 0x00000000 [76325.609000] alpine D 0000000000000000 0 7659 29120 0x00000000 List of functions underneath the D-state processes (sorted/uniqued)-- 121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30 77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b 62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0 60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170 60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180 59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0 59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0 59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80 59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0 59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100 59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80 58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270 58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0 57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200 57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0 57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90 57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30 55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0 8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220 7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150 6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0 4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70 4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90 4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50 4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0 4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20 4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60 4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110 4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000 3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160 3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40 3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130 3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190 3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0 3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0 3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50 3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40 3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0 3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0 3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40 3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0 3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130 3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140 3 [<ffffffff810c2110>] ? pollwake+0x0/0x60 3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0 3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0 3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0 3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0 3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80 3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0 3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70 2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0 2 [<ffffffff81455151>] ? __down+0x61/0xa0 2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0 2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140 2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0 2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0 2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190 2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220 2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130 2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0 2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30 2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530 2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340 2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360 2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0 2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300 2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0 2 [<ffffffff810c31dc>] ? dput+0xac/0x160 2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70 2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130 2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780 2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0 2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220 2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40 2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0 2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40 2 [<ffffffff8108b210>] ? __writepage+0x0/0x40 2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0 2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0 2 [<ffffffff8106e016>] ? down+0x46/0x50 2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80 2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130 2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0 2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30 2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30 2 [<ffffffff81069886>] ? kthread+0xa6/0xb0 2 [<ffffffff810697e0>] ? kthread+0x0/0xb0 2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10 2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60 2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70 2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20 2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20 1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20 1 [<ffffffff81455857>] ? __down_read+0x17/0xae 1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80 1 [<ffffffff81454144>] ? io_schedule+0x34/0x50 1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180 1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30 1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10 1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20 1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0 1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150 1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60 1 [<ffffffff8126c391>] ? tty_write+0x221/0x270 1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100 1 [<ffffffff81219361>] ? __up_read+0x21/0xc0 1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80 1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0 1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0 1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0 1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0 1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190 1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50 1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20 1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0 1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30 1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20 1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100 1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340 1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220 1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880 1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190 1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0 1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0 1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530 1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100 1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0 1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60 1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40 1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130 1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0 1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0 1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0 1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0 1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0 1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320 1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0 1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480 1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480 1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70 1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320 1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80 1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110 1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0 1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0 1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0 1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0 1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0 1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0 1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450 1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0 1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190 1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0 1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0 1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0 1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0 1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460 1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0 1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140 1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120 1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110 1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310 1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670 1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50 1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0 1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0 1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60 1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0 1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0 1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130 1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80 1 [<ffffffff810af636>] ? sys_close+0xa6/0x100 1 [<ffffffff810af556>] ? filp_close+0x56/0x90 1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590 1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30 1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220 1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140 1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140 1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60 1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140 1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80 1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80 1 [<ffffffff81084d95>] ? sync_page+0x35/0x60 1 [<ffffffff81084d60>] ? sync_page+0x0/0x60 1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250 1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30 1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30 1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240 1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360 1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0 1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90 1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90 1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90 1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30 1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-18 20:17 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw) To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz On Sat, 17 Oct 2009, Justin Piszcz wrote: > Hello, It has happened again, all sysrq-X output was saved this time. wget http://home.comcast.net/~jpiszcz/20091018/crash.txt wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt Kernel configuration: wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt Diff of the two configs: $ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_" > CONFIG_OUTPUT_FORMAT="elf64-x86-64" > CONFIG_CONSTRUCTORS=y > CONFIG_HAVE_PERF_COUNTERS=y > CONFIG_HAVE_DMA_ATTRS=y > CONFIG_BLK_DEV_BSG=y > CONFIG_X86_NEW_MCE=y > CONFIG_X86_THERMAL_VECTOR=y < CONFIG_UNEVICTABLE_LRU=y < CONFIG_PHYSICAL_START=0x200000 > CONFIG_PHYSICAL_START=0x1000000 < CONFIG_PHYSICAL_ALIGN=0x200000 > CONFIG_PHYSICAL_ALIGN=0x1000000 < CONFIG_COMPAT_NET_DEV_OPS=y < CONFIG_SND_JACK=y > CONFIG_HID_DRAGONRISE=y > CONFIG_HID_GREENASIA=y > CONFIG_HID_SMARTJOYPLUS=y > CONFIG_HID_THRUSTMASTER=y > CONFIG_HID_ZEROPLUS=y > CONFIG_FSNOTIFY=y > CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y > CONFIG_HAVE_ARCH_KMEMCHECK=y I have reverted back to 2.6.30.9 to see if the problem recurs with this kernel version. I do not recall seeing this on the older 2.6.30.x kernels: [ 9.276427] md3: detected capacity change from 0 to 5251073572864 [ 9.277411] md2: detected capacity change from 0 to 132706598912 [ 9.278305] md1: detected capacity change from 0 to 139722752 [ 9.278921] md0: detected capacity change from 0 to 17190682624 Again, some more D-state processes: [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 [76325.608087] Call Trace: [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130 [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 [76325.608214] Call Trace: [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0 [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0 [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30 [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0 [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190 [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0 [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 All of the D-state processes: $ cat sysrq-w.txt |grep ' D' [76307.285125] alpine D 0000000000000000 0 7659 29120 0x00000000 [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 [76325.608257] syslogd D 0000000000000002 0 2438 1 0x00000000 [76325.608318] freshclam D 0000000000000000 0 2877 1 0x00000000 [76325.608428] asterisk D 0000000000000001 0 3278 1 0x00000000 [76325.608492] console-kit-d D 0000000000000000 0 3299 1 0x00000000 [76325.608562] dhcpd3 D 0000000000000000 0 3554 1 0x00000000 [76325.608621] plasma-deskto D 0000000000000002 0 32482 1 0x00000000 [76325.608713] kaccess D 0000000000000001 0 32488 1 0x00000000 [76325.608752] mail D 0000000000000000 0 7397 7386 0x00000000 [76325.608830] hal-acl-tool D 0000000000000000 0 7430 3399 0x00000004 [76325.608888] mrtg D 0000000000000000 0 7444 7433 0x00000000 [76325.608981] cron D 0000000000000000 0 7500 3630 0x00000000 [76325.609000] alpine D 0000000000000000 0 7659 29120 0x00000000 List of functions underneath the D-state processes (sorted/uniqued)-- 121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30 77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b 62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0 60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170 60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180 59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0 59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0 59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80 59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0 59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100 59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80 58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270 58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0 57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200 57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0 57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90 57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30 55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0 8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220 7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150 6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0 4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70 4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90 4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50 4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0 4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20 4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60 4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110 4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000 3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160 3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40 3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130 3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190 3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0 3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0 3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50 3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40 3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0 3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0 3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40 3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0 3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130 3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140 3 [<ffffffff810c2110>] ? pollwake+0x0/0x60 3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0 3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0 3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0 3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0 3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80 3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0 3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70 2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0 2 [<ffffffff81455151>] ? __down+0x61/0xa0 2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0 2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140 2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0 2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0 2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190 2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220 2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130 2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0 2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30 2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530 2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340 2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360 2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0 2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300 2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0 2 [<ffffffff810c31dc>] ? dput+0xac/0x160 2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70 2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130 2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780 2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0 2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220 2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40 2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0 2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40 2 [<ffffffff8108b210>] ? __writepage+0x0/0x40 2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0 2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0 2 [<ffffffff8106e016>] ? down+0x46/0x50 2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80 2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130 2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0 2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30 2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30 2 [<ffffffff81069886>] ? kthread+0xa6/0xb0 2 [<ffffffff810697e0>] ? kthread+0x0/0xb0 2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10 2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60 2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70 2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20 2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20 1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20 1 [<ffffffff81455857>] ? __down_read+0x17/0xae 1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80 1 [<ffffffff81454144>] ? io_schedule+0x34/0x50 1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180 1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30 1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10 1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20 1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0 1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150 1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60 1 [<ffffffff8126c391>] ? tty_write+0x221/0x270 1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100 1 [<ffffffff81219361>] ? __up_read+0x21/0xc0 1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80 1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0 1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0 1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0 1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0 1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190 1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50 1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20 1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0 1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30 1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20 1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100 1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340 1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220 1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880 1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190 1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0 1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0 1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530 1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100 1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0 1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60 1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40 1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130 1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0 1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0 1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0 1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0 1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0 1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320 1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0 1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480 1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480 1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70 1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320 1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80 1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110 1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0 1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0 1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0 1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0 1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0 1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0 1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450 1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0 1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190 1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0 1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0 1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0 1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0 1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460 1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0 1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140 1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120 1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110 1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310 1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670 1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50 1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0 1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0 1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60 1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0 1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0 1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130 1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80 1 [<ffffffff810af636>] ? sys_close+0xa6/0x100 1 [<ffffffff810af556>] ? filp_close+0x56/0x90 1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590 1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30 1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220 1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140 1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140 1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60 1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140 1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80 1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80 1 [<ffffffff81084d95>] ? sync_page+0x35/0x60 1 [<ffffffff81084d60>] ? sync_page+0x0/0x60 1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250 1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30 1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30 1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240 1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360 1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0 1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90 1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90 1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90 1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30 1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-18 20:17 ` Justin Piszcz @ 2009-10-19 3:04 ` Dave Chinner -1 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2009-10-19 3:04 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: > It has happened again, all sysrq-X output was saved this time. > > wget http://home.comcast.net/~jpiszcz/20091018/crash.txt > wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt > wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt ..... > Again, some more D-state processes: > > [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 > [76325.608087] Call Trace: > [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130 > [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 > [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 > [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 > [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 > > [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 > [76325.608214] Call Trace: > [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0 > [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0 > [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30 > [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0 > [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190 > [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 > [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0 > [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 > > All of the D-state processes: All pointing to log IO not completing. That is, all of the D state processes are backed up on locks or waiting for IO completion processing. A lot of the processes are waiting for _xfs_log_force to complete, others are waiting for inodes to be unpinned or are backed up behind locked inodes that are waiting on log IO to complete before they can complete the transaction and unlock the inode, and so on. Unfortunately, the xfslogd and xfsdatad kernel threads are not present in any of the output given, so I can't tell if these have deadlocked themselves and caused the problem. However, my experience with such pile-ups is that an I/O completion has not been run for some reason and that is the cause of the problem. I don't know if you can provide enough information to tell us if this happened or not. Instead, do you have a test case that you can share? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-19 3:04 ` Dave Chinner 0 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2009-10-19 3:04 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: > It has happened again, all sysrq-X output was saved this time. > > wget http://home.comcast.net/~jpiszcz/20091018/crash.txt > wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt > wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt ..... > Again, some more D-state processes: > > [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 > [76325.608087] Call Trace: > [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130 > [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 > [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 > [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 > [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 > > [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 > [76325.608214] Call Trace: > [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0 > [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0 > [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30 > [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0 > [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190 > [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 > [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0 > [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0 > > All of the D-state processes: All pointing to log IO not completing. That is, all of the D state processes are backed up on locks or waiting for IO completion processing. A lot of the processes are waiting for _xfs_log_force to complete, others are waiting for inodes to be unpinned or are backed up behind locked inodes that are waiting on log IO to complete before they can complete the transaction and unlock the inode, and so on. Unfortunately, the xfslogd and xfsdatad kernel threads are not present in any of the output given, so I can't tell if these have deadlocked themselves and caused the problem. However, my experience with such pile-ups is that an I/O completion has not been run for some reason and that is the cause of the problem. I don't know if you can provide enough information to tell us if this happened or not. Instead, do you have a test case that you can share? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-19 3:04 ` Dave Chinner @ 2009-10-19 10:18 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Mon, 19 Oct 2009, Dave Chinner wrote: > On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >> It has happened again, all sysrq-X output was saved this time. >> >> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt >> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt >> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt > ..... > > All pointing to log IO not completing. > > That is, all of the D state processes are backed up on locks or > waiting for IO completion processing. A lot of the processes are > waiting for _xfs_log_force to complete, others are waiting for > inodes to be unpinned or are backed up behind locked inodes that are > waiting on log IO to complete before they can complete the > transaction and unlock the inode, and so on. > > Unfortunately, the xfslogd and xfsdatad kernel threads are not > present in any of the output given, so I can't tell if these have > deadlocked themselves and caused the problem. However, my experience > with such pile-ups is that an I/O completion has not been run for > some reason and that is the cause of the problem. I don't know if > you can provide enough information to tell us if this happened or > not. Instead, do you have a test case that you can share? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > Hello, So far I do not have a reproducible test case, the only other thing not posted was the output of ps auxww during the time of the lockup, not sure if it will help, but here it is: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2] root 2 0.0 0.0 0 0 ? S< Oct16 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/0] root 4 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/0] root 5 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/1] root 6 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/1] root 7 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/2] root 8 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/2] root 9 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/3] root 10 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/3] root 11 0.0 0.0 0 0 ? R< Oct16 0:00 [events/0] root 12 0.0 0.0 0 0 ? S< Oct16 0:00 [events/1] root 13 0.0 0.0 0 0 ? S< Oct16 0:00 [events/2] root 14 0.0 0.0 0 0 ? S< Oct16 0:00 [events/3] root 15 0.0 0.0 0 0 ? S< Oct16 0:00 [khelper] root 20 0.0 0.0 0 0 ? S< Oct16 0:00 [async/mgr] root 180 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/0] root 181 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/1] root 182 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/2] root 183 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/3] root 185 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpid] root 186 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_notify] root 187 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_hotplug] root 271 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/0] root 272 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/1] root 273 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/2] root 274 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/3] root 275 0.0 0.0 0 0 ? S< Oct16 0:00 [ata_aux] root 276 0.0 0.0 0 0 ? S< Oct16 0:00 [ksuspend_usbd] root 280 0.0 0.0 0 0 ? S< Oct16 0:00 [khubd] root 283 0.0 0.0 0 0 ? S< Oct16 0:00 [kseriod] root 318 0.0 0.0 0 0 ? S< Oct16 0:00 [khpsbpkt] root 361 0.0 0.0 0 0 ? S Oct16 0:00 [pdflush] root 362 0.0 0.0 0 0 ? D Oct16 0:43 [pdflush] root 363 0.0 0.0 0 0 ? S< Oct16 0:21 [kswapd0] root 364 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/0] root 365 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/1] root 366 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/2] root 367 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/3] root 368 0.0 0.0 0 0 ? S< Oct16 0:00 [nfsiod] root 369 0.0 0.0 0 0 ? S< Oct16 0:00 [cifsoplockd] root 370 0.0 0.0 0 0 ? S< Oct16 0:00 [xfs_mru_cache] root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0] root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1] root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2] root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3] root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0] root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1] root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2] root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3] root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0] root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1] root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2] root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3] root 518 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_0] root 521 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_1] root 524 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_2] root 527 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_3] root 530 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_4] root 533 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_5] root 542 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_6] root 545 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_7] root 551 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_8] root 554 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_9] root 558 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_10] root 562 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_11] root 568 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_12] root 571 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_13] root 584 0.0 0.0 0 0 ? S< Oct16 0:00 [knodemgrd_0] root 616 0.0 0.0 0 0 ? S< Oct16 0:00 [kpsmoused] root 666 0.0 0.0 0 0 ? S< Oct16 0:00 [usbhid_resumer] root 683 0.0 0.0 0 0 ? S< Oct16 0:00 [hd-audio0] root 703 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/0] root 704 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/1] root 705 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/2] root 706 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/3] root 811 4.0 0.0 0 0 ? S< Oct16 81:54 [md3_raid5] root 817 0.3 0.0 0 0 ? S< Oct16 6:30 [md2_raid1] root 823 0.0 0.0 0 0 ? S< Oct16 0:00 [md1_raid1] root 827 0.0 0.0 0 0 ? S< Oct16 0:50 [md0_raid1] root 829 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd] root 830 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild] root 831 0.0 0.0 0 0 ? D< Oct16 0:00 [xfssyncd] root 884 0.0 0.0 16736 740 ? S<s Oct16 0:00 udevd --daemon postfix 1649 0.0 0.0 39124 2468 ? S 05:00 0:00 qmgr -l -t fifo -u -c www-data 1877 0.0 0.0 146612 5248 ? S 05:01 0:00 /usr/sbin/apache2 -k start root 3182 0.0 0.0 0 0 ? S< Oct16 0:00 [kjournald] root 3183 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd] root 3184 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild] root 3185 0.0 0.0 0 0 ? S< Oct16 0:00 [xfssyncd] root 3337 0.0 0.0 8072 132 ? Ss Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d root 3339 0.0 0.0 8072 424 ? Ds Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d daemon 3526 0.0 0.0 8016 564 ? Ss Oct16 0:00 /sbin/portmap statd 3538 0.0 0.0 10148 776 ? Ss Oct16 0:00 /sbin/rpc.statd root 3547 0.0 0.0 26956 572 ? Ss Oct16 0:00 /usr/sbin/rpc.idmapd root 3732 0.0 0.0 5900 676 ? Ds Oct16 0:00 /sbin/syslogd root 3741 0.0 0.0 3796 452 ? Ss Oct16 0:00 /sbin/klogd -x root 3750 0.0 0.0 3804 640 ? Ss Oct16 0:00 /usr/sbin/acpid 110 3760 0.0 0.0 23560 1476 ? Ss Oct16 0:02 /usr/bin/dbus-daemon --system bind 3773 0.0 0.8 264364 70004 ? Ssl Oct16 0:05 /usr/sbin/named -u bind -S 1024 root 3796 0.0 0.0 49028 1260 ? Ss Oct16 0:00 /usr/sbin/sshd root 3827 0.0 0.0 104804 7452 ? Ssl Oct16 0:08 /usr/sbin/console-kit-daemon amavis 3909 0.0 1.1 217948 91004 ? Ss Oct16 0:00 amavisd (master) nobody 3913 0.0 0.2 118612 21632 ? Sl Oct16 0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid polw 3932 0.0 0.1 51868 12084 ? Ss Oct16 0:00 policyd-weight (master) polw 3933 0.0 0.1 51868 11744 ? Ss Oct16 0:00 policyd-weight (cache) postfw 3937 0.0 0.1 55248 14064 ? Ss Oct16 0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid postgrey 3940 0.0 0.1 57900 13232 ? Ss Oct16 0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421 clamav 4166 0.0 1.3 215576 112772 ? Ssl Oct16 0:04 /usr/sbin/clamd clamav 4262 0.0 0.0 25776 1328 ? Ss Oct16 0:00 /usr/bin/freshclam -d --quiet amavis 4307 0.0 1.1 224072 96732 ? S Oct16 0:02 amavisd (ch14-avail) amavis 4308 0.0 1.2 228168 100820 ? S Oct16 0:05 amavisd (ch14-avail) root 4353 0.0 0.0 7224 3944 ? S Oct16 0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj root 4362 0.0 0.0 8864 520 ? Ss Oct16 0:00 /usr/sbin/irqbalance daemon 4377 0.0 0.0 38488 840 ? Ss Oct16 0:00 lpd Waiting root 4412 0.0 0.0 0 0 ? S< Oct16 0:00 [lockd] root 4413 0.0 0.0 0 0 ? S< Oct16 0:10 [nfsd] root 4414 0.0 0.0 0 0 ? D< Oct16 0:03 [nfsd] root 4415 0.0 0.0 0 0 ? D< Oct16 0:13 [nfsd] root 4416 0.0 0.0 0 0 ? S< Oct16 0:07 [nfsd] root 4417 0.0 0.0 0 0 ? D< Oct16 0:11 [nfsd] root 4418 0.0 0.0 0 0 ? D< Oct16 0:01 [nfsd] root 4419 0.0 0.0 0 0 ? D< Oct16 0:04 [nfsd] root 4420 0.0 0.0 0 0 ? D< Oct16 0:02 [nfsd] root 4424 0.0 0.0 18812 1208 ? Ss Oct16 0:00 /usr/sbin/rpc.mountd --manage-gids oident 4432 0.0 0.0 12232 576 ? Ss Oct16 0:00 /usr/sbin/oidentd -m -u oident -g oident root 4444 0.0 0.0 10124 660 ? Ss Oct16 0:00 /usr/sbin/inetd root 4521 0.0 0.0 37020 2388 ? Ss Oct16 0:00 /usr/lib/postfix/master root 4545 0.0 0.0 58084 1796 ? Ds Oct16 0:00 /usr/sbin/nmbd -D root 4547 0.0 0.0 93724 3044 ? Ss Oct16 0:00 /usr/sbin/smbd -D root 4568 0.0 0.0 93752 1776 ? S Oct16 0:00 /usr/sbin/smbd -D root 4700 0.0 0.0 18768 1348 ? S Oct16 0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid asterisk 4714 1.3 0.2 666972 24460 ? Ssl Oct16 26:36 /usr/sbin/asterisk -p -U asterisk asterisk 4715 0.0 0.0 13656 880 ? S Oct16 0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet ntp 4734 0.0 0.0 23428 1400 ? Ss Oct16 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g root 4789 0.0 0.0 16644 780 ? Ss Oct16 0:00 /usr/sbin/dovecot root 4790 0.0 0.0 74664 3196 ? S Oct16 0:00 dovecot-auth root 4792 0.0 0.0 74664 3240 ? S Oct16 0:00 dovecot-auth -w 111 4797 0.0 0.0 38084 4616 ? Ss Oct16 0:07 /usr/sbin/hald root 4798 0.0 0.0 20040 1432 ? S Oct16 0:01 hald-runner root 4895 0.0 0.0 22028 1240 ? S Oct16 0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3 root 4899 0.0 0.0 22032 1232 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled root 4900 0.0 0.0 22028 1228 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled 111 4902 0.0 0.0 25920 1204 ? S Oct16 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket root 4911 0.0 0.0 12484 656 ? Ss Oct16 0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog root 4929 0.0 0.0 24020 584 ? Ss Oct16 0:00 /usr/sbin/squid -D -YC proxy 4931 0.0 0.0 26900 5136 ? S Oct16 0:00 (squid) -D -YC root 4942 0.0 0.0 28924 3548 ? Ss Oct16 0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345 root 4943 0.0 0.0 17392 1380 ? S Oct16 0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1 daemon 4945 0.3 0.0 16596 3172 ? S Oct16 7:01 p0f -u daemon -i eth1 -l tcp dst port 25 root 4947 0.0 0.0 93604 5832 ? SNs Oct16 0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph root 4951 0.0 0.0 9296 3716 ? Ss Oct16 0:00 /usr/sbin/dhcpd3 -q eth0 nut 4964 0.0 0.0 14384 744 ? Ss Oct16 0:15 /lib/nut/usbhid-ups -a apc nut 4966 0.0 0.0 14316 600 ? Ss Oct16 0:01 /sbin/upsd root 4968 0.0 0.0 14284 728 ? Ss Oct16 0:00 /sbin/upsmon nut 4970 0.0 0.0 14284 712 ? S Oct16 0:01 /sbin/upsmon daemon 5007 0.0 0.0 16356 416 ? Ss Oct16 0:00 /usr/sbin/atd root 5027 0.0 0.0 20988 1060 ? Ss Oct16 0:00 /usr/sbin/cron dovecot 5887 0.0 0.0 18504 2108 ? S 15:31 0:00 imap-login root 6184 0.0 0.0 73496 6380 ? Sl Oct16 0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock root 6205 0.0 0.0 26292 720 ? Ss Oct16 0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc root 6213 0.1 1.3 192020 112852 tty7 Ss+ Oct16 2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6 root 6284 0.0 0.0 5852 600 tty1 Ss+ Oct16 0:00 /sbin/getty 38400 tty1 root 6285 0.0 0.0 5852 596 tty2 Ss+ Oct16 0:00 /sbin/getty 38400 tty2 root 6286 0.0 0.0 5852 596 tty3 Ss+ Oct16 0:00 /sbin/getty 38400 tty3 root 6287 0.0 0.0 5852 600 tty4 Ss+ Oct16 0:00 /sbin/getty 38400 tty4 root 6288 0.0 0.0 5852 596 tty5 Ss+ Oct16 0:00 /sbin/getty 38400 tty5 root 6289 0.0 0.0 5852 600 tty6 Ss+ Oct16 0:00 /sbin/getty 38400 tty6 root 6297 0.0 0.0 55716 1864 ? S Oct16 0:00 -:0 postfix 6362 0.0 0.0 39076 2308 ? S 15:36 0:00 anvil -l -t unix -u -o max_idle=3600s root 7003 0.0 0.0 80904 3412 ? Ss Oct16 0:00 sshd: ap [priv] polw 7443 0.0 0.1 52000 12516 ? S Oct16 0:00 policyd-weight (child) dovecot 8922 0.0 0.0 18504 2108 ? S 16:01 0:00 imap-login dovecot 8923 0.0 0.0 18504 2112 ? S 16:01 0:00 imap-login postfix 18407 0.0 0.0 39076 2352 ? D 17:50 0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks asterisk 18424 0.0 0.0 36996 2260 ? D 17:54 0:00 /usr/sbin/postdrop -r root 18425 0.0 0.0 46372 1440 ? S 17:55 0:00 /USR/SBIN/CRON root 18459 0.0 0.0 11404 1328 ? Ss 17:55 0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi root 18460 0.0 0.0 21888 1432 ? D 17:55 0:00 /usr/lib/hal/hal-acl-tool --reconfigure root 18465 0.0 0.2 43416 16892 ? S 17:55 0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg root 18479 0.0 0.0 4140 612 ? S 17:55 0:00 tee -a /var/log/mrtg/mrtg.log root 18538 0.0 0.0 38476 1976 ? D 17:55 0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0 postfix 18539 0.0 0.1 50740 9836 ? S 17:55 0:00 cleanup -z -t unix -u -c root 18555 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18556 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18557 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18558 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18559 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18560 0.0 0.0 46368 1172 ? S 18:02 0:00 /USR/SBIN/CRON root 18561 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18562 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18563 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18564 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18565 0.0 0.0 46368 1172 ? S 18:09 0:00 /USR/SBIN/CRON root 18566 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON root 18567 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON root 18568 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON root 18578 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18579 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18580 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18581 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18582 0.0 0.0 46368 1172 ? S 18:17 0:00 /USR/SBIN/CRON root 18583 0.0 0.0 46368 1172 ? S 18:18 0:00 /USR/SBIN/CRON postfix 18584 0.0 0.0 71452 8104 ? S 18:18 0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings postfix 18585 0.0 0.0 39076 2300 ? S 18:18 0:00 proxymap -t unix -u sshd 18597 0.0 0.0 0 0 ? Z 18:19 0:00 [sshd] <defunct> root 18601 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON root 18602 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON root 18603 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON root 30866 0.0 0.0 54392 944 ? Ss 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30867 0.0 0.0 54392 652 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30868 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30869 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30870 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 31018 0.0 0.0 146612 7856 ? Ss 04:31 0:00 /usr/sbin/apache2 -k start www-data 31025 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start www-data 31044 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start root 31076 0.0 0.0 23128 756 ? Ss 04:31 0:00 pure-ftpd (SERVER) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-19 10:18 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Mon, 19 Oct 2009, Dave Chinner wrote: > On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >> It has happened again, all sysrq-X output was saved this time. >> >> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt >> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt >> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt >> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt > ..... > > All pointing to log IO not completing. > > That is, all of the D state processes are backed up on locks or > waiting for IO completion processing. A lot of the processes are > waiting for _xfs_log_force to complete, others are waiting for > inodes to be unpinned or are backed up behind locked inodes that are > waiting on log IO to complete before they can complete the > transaction and unlock the inode, and so on. > > Unfortunately, the xfslogd and xfsdatad kernel threads are not > present in any of the output given, so I can't tell if these have > deadlocked themselves and caused the problem. However, my experience > with such pile-ups is that an I/O completion has not been run for > some reason and that is the cause of the problem. I don't know if > you can provide enough information to tell us if this happened or > not. Instead, do you have a test case that you can share? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > Hello, So far I do not have a reproducible test case, the only other thing not posted was the output of ps auxww during the time of the lockup, not sure if it will help, but here it is: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2] root 2 0.0 0.0 0 0 ? S< Oct16 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/0] root 4 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/0] root 5 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/1] root 6 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/1] root 7 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/2] root 8 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/2] root 9 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/3] root 10 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/3] root 11 0.0 0.0 0 0 ? R< Oct16 0:00 [events/0] root 12 0.0 0.0 0 0 ? S< Oct16 0:00 [events/1] root 13 0.0 0.0 0 0 ? S< Oct16 0:00 [events/2] root 14 0.0 0.0 0 0 ? S< Oct16 0:00 [events/3] root 15 0.0 0.0 0 0 ? S< Oct16 0:00 [khelper] root 20 0.0 0.0 0 0 ? S< Oct16 0:00 [async/mgr] root 180 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/0] root 181 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/1] root 182 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/2] root 183 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/3] root 185 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpid] root 186 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_notify] root 187 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_hotplug] root 271 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/0] root 272 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/1] root 273 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/2] root 274 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/3] root 275 0.0 0.0 0 0 ? S< Oct16 0:00 [ata_aux] root 276 0.0 0.0 0 0 ? S< Oct16 0:00 [ksuspend_usbd] root 280 0.0 0.0 0 0 ? S< Oct16 0:00 [khubd] root 283 0.0 0.0 0 0 ? S< Oct16 0:00 [kseriod] root 318 0.0 0.0 0 0 ? S< Oct16 0:00 [khpsbpkt] root 361 0.0 0.0 0 0 ? S Oct16 0:00 [pdflush] root 362 0.0 0.0 0 0 ? D Oct16 0:43 [pdflush] root 363 0.0 0.0 0 0 ? S< Oct16 0:21 [kswapd0] root 364 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/0] root 365 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/1] root 366 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/2] root 367 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/3] root 368 0.0 0.0 0 0 ? S< Oct16 0:00 [nfsiod] root 369 0.0 0.0 0 0 ? S< Oct16 0:00 [cifsoplockd] root 370 0.0 0.0 0 0 ? S< Oct16 0:00 [xfs_mru_cache] root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0] root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1] root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2] root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3] root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0] root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1] root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2] root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3] root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0] root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1] root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2] root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3] root 518 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_0] root 521 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_1] root 524 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_2] root 527 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_3] root 530 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_4] root 533 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_5] root 542 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_6] root 545 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_7] root 551 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_8] root 554 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_9] root 558 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_10] root 562 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_11] root 568 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_12] root 571 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_13] root 584 0.0 0.0 0 0 ? S< Oct16 0:00 [knodemgrd_0] root 616 0.0 0.0 0 0 ? S< Oct16 0:00 [kpsmoused] root 666 0.0 0.0 0 0 ? S< Oct16 0:00 [usbhid_resumer] root 683 0.0 0.0 0 0 ? S< Oct16 0:00 [hd-audio0] root 703 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/0] root 704 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/1] root 705 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/2] root 706 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/3] root 811 4.0 0.0 0 0 ? S< Oct16 81:54 [md3_raid5] root 817 0.3 0.0 0 0 ? S< Oct16 6:30 [md2_raid1] root 823 0.0 0.0 0 0 ? S< Oct16 0:00 [md1_raid1] root 827 0.0 0.0 0 0 ? S< Oct16 0:50 [md0_raid1] root 829 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd] root 830 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild] root 831 0.0 0.0 0 0 ? D< Oct16 0:00 [xfssyncd] root 884 0.0 0.0 16736 740 ? S<s Oct16 0:00 udevd --daemon postfix 1649 0.0 0.0 39124 2468 ? S 05:00 0:00 qmgr -l -t fifo -u -c www-data 1877 0.0 0.0 146612 5248 ? S 05:01 0:00 /usr/sbin/apache2 -k start root 3182 0.0 0.0 0 0 ? S< Oct16 0:00 [kjournald] root 3183 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd] root 3184 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild] root 3185 0.0 0.0 0 0 ? S< Oct16 0:00 [xfssyncd] root 3337 0.0 0.0 8072 132 ? Ss Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d root 3339 0.0 0.0 8072 424 ? Ds Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d daemon 3526 0.0 0.0 8016 564 ? Ss Oct16 0:00 /sbin/portmap statd 3538 0.0 0.0 10148 776 ? Ss Oct16 0:00 /sbin/rpc.statd root 3547 0.0 0.0 26956 572 ? Ss Oct16 0:00 /usr/sbin/rpc.idmapd root 3732 0.0 0.0 5900 676 ? Ds Oct16 0:00 /sbin/syslogd root 3741 0.0 0.0 3796 452 ? Ss Oct16 0:00 /sbin/klogd -x root 3750 0.0 0.0 3804 640 ? Ss Oct16 0:00 /usr/sbin/acpid 110 3760 0.0 0.0 23560 1476 ? Ss Oct16 0:02 /usr/bin/dbus-daemon --system bind 3773 0.0 0.8 264364 70004 ? Ssl Oct16 0:05 /usr/sbin/named -u bind -S 1024 root 3796 0.0 0.0 49028 1260 ? Ss Oct16 0:00 /usr/sbin/sshd root 3827 0.0 0.0 104804 7452 ? Ssl Oct16 0:08 /usr/sbin/console-kit-daemon amavis 3909 0.0 1.1 217948 91004 ? Ss Oct16 0:00 amavisd (master) nobody 3913 0.0 0.2 118612 21632 ? Sl Oct16 0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid polw 3932 0.0 0.1 51868 12084 ? Ss Oct16 0:00 policyd-weight (master) polw 3933 0.0 0.1 51868 11744 ? Ss Oct16 0:00 policyd-weight (cache) postfw 3937 0.0 0.1 55248 14064 ? Ss Oct16 0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid postgrey 3940 0.0 0.1 57900 13232 ? Ss Oct16 0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421 clamav 4166 0.0 1.3 215576 112772 ? Ssl Oct16 0:04 /usr/sbin/clamd clamav 4262 0.0 0.0 25776 1328 ? Ss Oct16 0:00 /usr/bin/freshclam -d --quiet amavis 4307 0.0 1.1 224072 96732 ? S Oct16 0:02 amavisd (ch14-avail) amavis 4308 0.0 1.2 228168 100820 ? S Oct16 0:05 amavisd (ch14-avail) root 4353 0.0 0.0 7224 3944 ? S Oct16 0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj root 4362 0.0 0.0 8864 520 ? Ss Oct16 0:00 /usr/sbin/irqbalance daemon 4377 0.0 0.0 38488 840 ? Ss Oct16 0:00 lpd Waiting root 4412 0.0 0.0 0 0 ? S< Oct16 0:00 [lockd] root 4413 0.0 0.0 0 0 ? S< Oct16 0:10 [nfsd] root 4414 0.0 0.0 0 0 ? D< Oct16 0:03 [nfsd] root 4415 0.0 0.0 0 0 ? D< Oct16 0:13 [nfsd] root 4416 0.0 0.0 0 0 ? S< Oct16 0:07 [nfsd] root 4417 0.0 0.0 0 0 ? D< Oct16 0:11 [nfsd] root 4418 0.0 0.0 0 0 ? D< Oct16 0:01 [nfsd] root 4419 0.0 0.0 0 0 ? D< Oct16 0:04 [nfsd] root 4420 0.0 0.0 0 0 ? D< Oct16 0:02 [nfsd] root 4424 0.0 0.0 18812 1208 ? Ss Oct16 0:00 /usr/sbin/rpc.mountd --manage-gids oident 4432 0.0 0.0 12232 576 ? Ss Oct16 0:00 /usr/sbin/oidentd -m -u oident -g oident root 4444 0.0 0.0 10124 660 ? Ss Oct16 0:00 /usr/sbin/inetd root 4521 0.0 0.0 37020 2388 ? Ss Oct16 0:00 /usr/lib/postfix/master root 4545 0.0 0.0 58084 1796 ? Ds Oct16 0:00 /usr/sbin/nmbd -D root 4547 0.0 0.0 93724 3044 ? Ss Oct16 0:00 /usr/sbin/smbd -D root 4568 0.0 0.0 93752 1776 ? S Oct16 0:00 /usr/sbin/smbd -D root 4700 0.0 0.0 18768 1348 ? S Oct16 0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid asterisk 4714 1.3 0.2 666972 24460 ? Ssl Oct16 26:36 /usr/sbin/asterisk -p -U asterisk asterisk 4715 0.0 0.0 13656 880 ? S Oct16 0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet ntp 4734 0.0 0.0 23428 1400 ? Ss Oct16 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g root 4789 0.0 0.0 16644 780 ? Ss Oct16 0:00 /usr/sbin/dovecot root 4790 0.0 0.0 74664 3196 ? S Oct16 0:00 dovecot-auth root 4792 0.0 0.0 74664 3240 ? S Oct16 0:00 dovecot-auth -w 111 4797 0.0 0.0 38084 4616 ? Ss Oct16 0:07 /usr/sbin/hald root 4798 0.0 0.0 20040 1432 ? S Oct16 0:01 hald-runner root 4895 0.0 0.0 22028 1240 ? S Oct16 0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3 root 4899 0.0 0.0 22032 1232 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled root 4900 0.0 0.0 22028 1228 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled 111 4902 0.0 0.0 25920 1204 ? S Oct16 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket root 4911 0.0 0.0 12484 656 ? Ss Oct16 0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog root 4929 0.0 0.0 24020 584 ? Ss Oct16 0:00 /usr/sbin/squid -D -YC proxy 4931 0.0 0.0 26900 5136 ? S Oct16 0:00 (squid) -D -YC root 4942 0.0 0.0 28924 3548 ? Ss Oct16 0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345 root 4943 0.0 0.0 17392 1380 ? S Oct16 0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1 daemon 4945 0.3 0.0 16596 3172 ? S Oct16 7:01 p0f -u daemon -i eth1 -l tcp dst port 25 root 4947 0.0 0.0 93604 5832 ? SNs Oct16 0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph root 4951 0.0 0.0 9296 3716 ? Ss Oct16 0:00 /usr/sbin/dhcpd3 -q eth0 nut 4964 0.0 0.0 14384 744 ? Ss Oct16 0:15 /lib/nut/usbhid-ups -a apc nut 4966 0.0 0.0 14316 600 ? Ss Oct16 0:01 /sbin/upsd root 4968 0.0 0.0 14284 728 ? Ss Oct16 0:00 /sbin/upsmon nut 4970 0.0 0.0 14284 712 ? S Oct16 0:01 /sbin/upsmon daemon 5007 0.0 0.0 16356 416 ? Ss Oct16 0:00 /usr/sbin/atd root 5027 0.0 0.0 20988 1060 ? Ss Oct16 0:00 /usr/sbin/cron dovecot 5887 0.0 0.0 18504 2108 ? S 15:31 0:00 imap-login root 6184 0.0 0.0 73496 6380 ? Sl Oct16 0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock root 6205 0.0 0.0 26292 720 ? Ss Oct16 0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc root 6213 0.1 1.3 192020 112852 tty7 Ss+ Oct16 2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6 root 6284 0.0 0.0 5852 600 tty1 Ss+ Oct16 0:00 /sbin/getty 38400 tty1 root 6285 0.0 0.0 5852 596 tty2 Ss+ Oct16 0:00 /sbin/getty 38400 tty2 root 6286 0.0 0.0 5852 596 tty3 Ss+ Oct16 0:00 /sbin/getty 38400 tty3 root 6287 0.0 0.0 5852 600 tty4 Ss+ Oct16 0:00 /sbin/getty 38400 tty4 root 6288 0.0 0.0 5852 596 tty5 Ss+ Oct16 0:00 /sbin/getty 38400 tty5 root 6289 0.0 0.0 5852 600 tty6 Ss+ Oct16 0:00 /sbin/getty 38400 tty6 root 6297 0.0 0.0 55716 1864 ? S Oct16 0:00 -:0 postfix 6362 0.0 0.0 39076 2308 ? S 15:36 0:00 anvil -l -t unix -u -o max_idle=3600s root 7003 0.0 0.0 80904 3412 ? Ss Oct16 0:00 sshd: ap [priv] polw 7443 0.0 0.1 52000 12516 ? S Oct16 0:00 policyd-weight (child) dovecot 8922 0.0 0.0 18504 2108 ? S 16:01 0:00 imap-login dovecot 8923 0.0 0.0 18504 2112 ? S 16:01 0:00 imap-login postfix 18407 0.0 0.0 39076 2352 ? D 17:50 0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks asterisk 18424 0.0 0.0 36996 2260 ? D 17:54 0:00 /usr/sbin/postdrop -r root 18425 0.0 0.0 46372 1440 ? S 17:55 0:00 /USR/SBIN/CRON root 18459 0.0 0.0 11404 1328 ? Ss 17:55 0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi root 18460 0.0 0.0 21888 1432 ? D 17:55 0:00 /usr/lib/hal/hal-acl-tool --reconfigure root 18465 0.0 0.2 43416 16892 ? S 17:55 0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg root 18479 0.0 0.0 4140 612 ? S 17:55 0:00 tee -a /var/log/mrtg/mrtg.log root 18538 0.0 0.0 38476 1976 ? D 17:55 0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0 postfix 18539 0.0 0.1 50740 9836 ? S 17:55 0:00 cleanup -z -t unix -u -c root 18555 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18556 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18557 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18558 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18559 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON root 18560 0.0 0.0 46368 1172 ? S 18:02 0:00 /USR/SBIN/CRON root 18561 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18562 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18563 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18564 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON root 18565 0.0 0.0 46368 1172 ? S 18:09 0:00 /USR/SBIN/CRON root 18566 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON root 18567 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON root 18568 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON root 18578 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18579 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18580 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18581 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON root 18582 0.0 0.0 46368 1172 ? S 18:17 0:00 /USR/SBIN/CRON root 18583 0.0 0.0 46368 1172 ? S 18:18 0:00 /USR/SBIN/CRON postfix 18584 0.0 0.0 71452 8104 ? S 18:18 0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings postfix 18585 0.0 0.0 39076 2300 ? S 18:18 0:00 proxymap -t unix -u sshd 18597 0.0 0.0 0 0 ? Z 18:19 0:00 [sshd] <defunct> root 18601 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON root 18602 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON root 18603 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON root 30866 0.0 0.0 54392 944 ? Ss 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30867 0.0 0.0 54392 652 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30868 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30869 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 30870 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5 root 31018 0.0 0.0 146612 7856 ? Ss 04:31 0:00 /usr/sbin/apache2 -k start www-data 31025 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start www-data 31044 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start root 31076 0.0 0.0 23128 756 ? Ss 04:31 0:00 pure-ftpd (SERVER) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-19 10:18 ` Justin Piszcz @ 2009-10-20 0:33 ` Dave Chinner -1 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2009-10-20 0:33 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: > On Mon, 19 Oct 2009, Dave Chinner wrote: >> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>> It has happened again, all sysrq-X output was saved this time. >> ..... >> >> All pointing to log IO not completing. >> .... > So far I do not have a reproducible test case, Ok. What sort of load is being placed on the machine? > the only other thing not posted was the output of ps auxww during > the time of the lockup, not sure if it will help, but here it is: > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2] .... > root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0] > root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1] > root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2] > root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3] > root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0] > root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1] > root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2] > root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3] > root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0] > root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1] > root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2] > root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3] ..... It appears that both the xfslogd and the xfsdatad on CPU 0 are in the running state but don't appear to be consuming any significant CPU time. If they remain like this then I think that means they are stuck waiting on the run queue. Do these XFS threads always appear like this when the hang occurs? If so, is there something else that is hogging CPU 0 preventing these threads from getting the CPU? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-20 0:33 ` Dave Chinner 0 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2009-10-20 0:33 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: > On Mon, 19 Oct 2009, Dave Chinner wrote: >> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>> It has happened again, all sysrq-X output was saved this time. >> ..... >> >> All pointing to log IO not completing. >> .... > So far I do not have a reproducible test case, Ok. What sort of load is being placed on the machine? > the only other thing not posted was the output of ps auxww during > the time of the lockup, not sure if it will help, but here it is: > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2] .... > root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0] > root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1] > root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2] > root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3] > root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0] > root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1] > root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2] > root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3] > root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0] > root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1] > root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2] > root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3] ..... It appears that both the xfslogd and the xfsdatad on CPU 0 are in the running state but don't appear to be consuming any significant CPU time. If they remain like this then I think that means they are stuck waiting on the run queue. Do these XFS threads always appear like this when the hang occurs? If so, is there something else that is hogging CPU 0 preventing these threads from getting the CPU? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-20 0:33 ` Dave Chinner @ 2009-10-20 8:33 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-20 8:33 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Tue, 20 Oct 2009, Dave Chinner wrote: > On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >> On Mon, 19 Oct 2009, Dave Chinner wrote: >>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>> It has happened again, all sysrq-X output was saved this time. >>> ..... >>> >>> All pointing to log IO not completing. >>> > .... >> So far I do not have a reproducible test case, > > Ok. What sort of load is being placed on the machine? Hello, generally the load is low, it mainly serves out some samba shares. > >> the only other thing not posted was the output of ps auxww during >> the time of the lockup, not sure if it will help, but here it is: >> >> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND >> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2] > .... >> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0] >> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1] >> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2] >> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3] >> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0] >> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1] >> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2] >> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3] >> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0] >> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1] >> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2] >> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3] > ..... > > It appears that both the xfslogd and the xfsdatad on CPU 0 are in > the running state but don't appear to be consuming any significant > CPU time. If they remain like this then I think that means they are > stuck waiting on the run queue. Do these XFS threads always appear > like this when the hang occurs? If so, is there something else that > is hogging CPU 0 preventing these threads from getting the CPU? Yes, the XFS threads show up like this on each time the kernel crashed. So far with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some issue between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations on how to catch this bug w/certain options enabled/etc? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-20 8:33 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-20 8:33 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Tue, 20 Oct 2009, Dave Chinner wrote: > On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >> On Mon, 19 Oct 2009, Dave Chinner wrote: >>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>> It has happened again, all sysrq-X output was saved this time. >>> ..... >>> >>> All pointing to log IO not completing. >>> > .... >> So far I do not have a reproducible test case, > > Ok. What sort of load is being placed on the machine? Hello, generally the load is low, it mainly serves out some samba shares. > >> the only other thing not posted was the output of ps auxww during >> the time of the lockup, not sure if it will help, but here it is: >> >> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND >> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2] > .... >> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0] >> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1] >> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2] >> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3] >> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0] >> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1] >> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2] >> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3] >> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0] >> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1] >> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2] >> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3] > ..... > > It appears that both the xfslogd and the xfsdatad on CPU 0 are in > the running state but don't appear to be consuming any significant > CPU time. If they remain like this then I think that means they are > stuck waiting on the run queue. Do these XFS threads always appear > like this when the hang occurs? If so, is there something else that > is hogging CPU 0 preventing these threads from getting the CPU? Yes, the XFS threads show up like this on each time the kernel crashed. So far with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some issue between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations on how to catch this bug w/certain options enabled/etc? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-20 8:33 ` Justin Piszcz @ 2009-10-21 10:19 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Tue, 20 Oct 2009, Justin Piszcz wrote: > > > On Tue, 20 Oct 2009, Dave Chinner wrote: > >> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >>> On Mon, 19 Oct 2009, Dave Chinner wrote: >>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>>> It has happened again, all sysrq-X output was saved this time. >>>> ..... >>>> >>>> All pointing to log IO not completing. >>>> >> .... >>> So far I do not have a reproducible test case, >> >> Ok. What sort of load is being placed on the machine? > Hello, generally the load is low, it mainly serves out some samba shares. > >> >> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >> the running state but don't appear to be consuming any significant >> CPU time. If they remain like this then I think that means they are >> stuck waiting on the run queue. Do these XFS threads always appear >> like this when the hang occurs? If so, is there something else that >> is hogging CPU 0 preventing these threads from getting the CPU? > Yes, the XFS threads show up like this on each time the kernel crashed. So > far > with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some > issue > between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations > on how to catch this bug w/certain options enabled/etc? > > >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > Uptime with 2.6.30.9: 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 No issues yet, so it first started happening in 2.6.(31).(x). Any further recommendations on how to debug this issue? BTW: Do you view this as an XFS bug or MD/VFS layer issue based on the logs/output thus far? Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-21 10:19 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Tue, 20 Oct 2009, Justin Piszcz wrote: > > > On Tue, 20 Oct 2009, Dave Chinner wrote: > >> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >>> On Mon, 19 Oct 2009, Dave Chinner wrote: >>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>>> It has happened again, all sysrq-X output was saved this time. >>>> ..... >>>> >>>> All pointing to log IO not completing. >>>> >> .... >>> So far I do not have a reproducible test case, >> >> Ok. What sort of load is being placed on the machine? > Hello, generally the load is low, it mainly serves out some samba shares. > >> >> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >> the running state but don't appear to be consuming any significant >> CPU time. If they remain like this then I think that means they are >> stuck waiting on the run queue. Do these XFS threads always appear >> like this when the hang occurs? If so, is there something else that >> is hogging CPU 0 preventing these threads from getting the CPU? > Yes, the XFS threads show up like this on each time the kernel crashed. So > far > with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some > issue > between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations > on how to catch this bug w/certain options enabled/etc? > > >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > Uptime with 2.6.30.9: 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 No issues yet, so it first started happening in 2.6.(31).(x). Any further recommendations on how to debug this issue? BTW: Do you view this as an XFS bug or MD/VFS layer issue based on the logs/output thus far? Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* mdadm --detail showing annoying device 2009-10-21 10:19 ` Justin Piszcz (?) @ 2009-10-21 14:17 ` Stephane Bunel 2009-10-21 21:46 ` Neil Brown -1 siblings, 1 reply; 49+ messages in thread From: Stephane Bunel @ 2009-10-21 14:17 UTC (permalink / raw) To: linux-raid Hi, I'm a newbie in the mdadm world. I defined some udev rules to make disk staticly named according to the bus/host/target. I.e. /dev/sda become /dev/raid_disk0. So nothing very special with that, it's just a convenient way to assign disk name to it's physical location. #ls -la /dev/raid* brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0 brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1 A RAID1 (/dev/md0) is assembled over this two disk. When looking for detailed information, mdadm show annoying device name in place of /dev/raid_disk*: ---8x--- #mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Tue Oct 13 12:53:54 2009 Raid Level : raid1 Array Size : 488386496 (465.76 GiB 500.11 GB) Used Dev Size : 488386496 (465.76 GiB 500.11 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Oct 21 15:16:09 2009 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 3fea95a0:e1b8a341:3b119117:e416f62b Events : 0.1526 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/char/21:0 1 8 16 1 active sync /dev/char/21:1 ---8x--- Looking in the source code of mdadm I found that the device name selection rule is (too) simply: select the shortest name in case of multiple possibility. So '/dev/char/21:0' being shorter than '/dev/raid_disk0', mdadm display '/dev/char...'. It's annoying for me to see a CHAR (/dev/char/...) device to represent a hard disk (wich is is a block device of course). The purpose of the following patch is to handle the directory part to always prefere the name that is closer to /dev. In case of equality, the shorter name wins. ---8x--- #./mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Tue Oct 13 12:53:54 2009 Raid Level : raid1 Array Size : 488386496 (465.76 GiB 500.11 GB) Used Dev Size : 488386496 (465.76 GiB 500.11 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Oct 21 15:30:34 2009 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 3fea95a0:e1b8a341:3b119117:e416f62b Events : 0.1526 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/raid_disk0 1 8 16 1 active sync /dev/raid_disk1 ---8x--- Fell free to refactor the code. The last time I wrote C code I still had hair on my head ;-) ---8x--- --- /var/tmp/mdadm_snapshot/util.c 2009-10-20 04:50:23.000000000 +0200 +++ /var/tmp/mdadm_dirlevel/util.c 2009-10-20 13:13:32.000000000 +0200 @@ -507,6 +507,38 @@ #endif /* HAVE_FTW */ #endif /* HAVE_NFTW */ +char *select_by_directory_level(char *current, char *registered) +{ + unsigned int current_level = 0; + unsigned int registered_level = 0; + + unsigned int level(char *pathname) + { + unsigned int count = 0; + char *p = pathname; + + while ( (p = strchr( p, '/' )) ) { + count++; + p++; + } + + return( count ); + } + + current_level = level( current ); + registered_level = level( registered ); + + if ( current_level < registered_level ) + return current; + + if ( current_level == registered_level ) + if ( strlen( strrchr( current, '/')) < + strlen( strrchr( registered,'/')) ) + return current; + + return registered; +} + /* * Find a block device with the right major/minor number. * If we find multiple names, choose the shortest. @@ -544,14 +576,18 @@ if (p->major == major && p->minor == minor) { if (strncmp(p->name, "/dev/md/",8) == 0) { - if (preferred == NULL || - strlen(p->name) < strlen(preferred)) - preferred = p->name; - } else { - if (regular == NULL || - strlen(p->name) < strlen(regular)) - regular = p->name; - } + if (preferred == NULL) + preferred = p->name; + else + preferred = select_by_directory_level( + p->name, preferred ); + } else { + if (regular == NULL ) + regular = p->name; + else + regular = select_by_directory_level( + p->name, regular ); + } } if (!regular && !preferred && !did_check) { devlist_ready = 0; ---8x--- Stéphane Bunel. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-21 14:17 ` mdadm --detail showing annoying device Stephane Bunel @ 2009-10-21 21:46 ` Neil Brown 2009-10-22 11:22 ` Stephane Bunel 2009-10-22 11:29 ` Mario 'BitKoenig' Holbe 0 siblings, 2 replies; 49+ messages in thread From: Neil Brown @ 2009-10-21 21:46 UTC (permalink / raw) To: Stephane Bunel; +Cc: linux-raid On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote: > Hi, > > I'm a newbie in the mdadm world. I defined some udev rules to make disk > staticly named according to the bus/host/target. I.e. /dev/sda become > /dev/raid_disk0. So nothing very special with that, it's just a convenient way > to assign disk name to it's physical location. > > #ls -la /dev/raid* > brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0 > brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1 > > A RAID1 (/dev/md0) is assembled over this two disk. > When looking for detailed information, mdadm show annoying device name in > place of /dev/raid_disk*: > .... > Number Major Minor RaidDevice State > 0 8 0 0 active sync /dev/char/21:0 > 1 8 16 1 active sync /dev/char/21:1 What is a block device doing in /dev/char ??? There should only be character devices in there. If these are actually block device, then I think there is something wrong with your udev rules. If these are char devices, then mdadm is doing the wrong thing, but I cannot see that from the code. Your proposal of choosing the highest rather than the shortest name has some merit, but I your current situation doesn't seem to justify it, and I particularly like the simplicity of the current heuristic. So for now I think I'll leave it as it is and encourage you to fix your udev rule. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-21 21:46 ` Neil Brown @ 2009-10-22 11:22 ` Stephane Bunel 2009-10-29 3:44 ` Neil Brown 2009-10-22 11:29 ` Mario 'BitKoenig' Holbe 1 sibling, 1 reply; 49+ messages in thread From: Stephane Bunel @ 2009-10-22 11:22 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil Brown a écrit : > On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote: >> Hi, >> >> I'm a newbie in the mdadm world. I defined some udev rules to make disk >> staticly named according to the bus/host/target. I.e. /dev/sda become >> /dev/raid_disk0. So nothing very special with that, it's just a convenient way >> to assign disk name to it's physical location. >> >> #ls -la /dev/raid* >> brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0 >> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1 >> >> A RAID1 (/dev/md0) is assembled over this two disk. >> When looking for detailed information, mdadm show annoying device name in >> place of /dev/raid_disk*: >> > .... >> Number Major Minor RaidDevice State >> 0 8 0 0 active sync /dev/char/21:0 >> 1 8 16 1 active sync /dev/char/21:1 > > What is a block device doing in /dev/char ??? There should only be > character devices in there. > > If these are actually block device, then I think there is something > wrong with your udev rules. I think my udev rules are not in cause because they just change /dev/sd* to /dev/raid_disk*. For udev /dev/char/21:0 seems correspond to the generic scsi device driver (sg) wich is binded to scsi devices. > If these are char devices, then mdadm is doing the wrong thing, but I > cannot see that from the code. mdadm by choosing the shorter name without differentiate path (/dev/.../ ) and name (sda) choose /dev/char/21:0 just because it is shorter than /dev/raid_disk0. > Your proposal of choosing the highest rather than the shortest name > has some merit, but I your current situation doesn't seem to justify > it, and I particularly like the simplicity of the current heuristic. My proposal doesn't choose the highest name but does a selection based on the shortest path to the device name. I.e. my proposal choose: /dev/raid_disk0 (1 directory level) over /dev/char/21:0 (2 directory level) /dev/raid_disk0 (1 directory level) over /dev/block/8:0 (2 directory level) /dev/sda1 (1 directory level) over /dev/disk/by-label/BOOT (3 directory level) So in fact, my proposal doesn't change current situation but "adjust" the heuristic to avoid seeing an annoying device name (char device) as a member of raid just because it's fullname is shorter than a "semantically better" name. Add a printf in map_dev() show that my proposal seems help the heuristic to be more robust in case of "advanced" device naming, without changing current things. #./mdadm --detail /dev/md0 (/dev/char/21:0) (/dev/block/8:0) (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA) (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA) (/dev/raid_disk0) (/dev/char/21:1) (/dev/block/8:16) (/dev/raid_disk1) (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA) (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA) /dev/md0: Version : 0.90 Creation Time : Tue Oct 13 12:53:54 2009 Raid Level : raid1 Array Size : 488386496 (465.76 GiB 500.11 GB) Used Dev Size : 488386496 (465.76 GiB 500.11 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Oct 22 12:42:44 2009 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 3fea95a0:e1b8a341:3b119117:e416f62b Events : 0.1526 Number Major Minor RaidDevice State 0 8 0 0 active sync(/dev/char/21:0) (/dev/block/8:0) (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA) (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA) (/dev/raid_disk0) /dev/raid_disk0 1 8 16 1 active sync(/dev/char/21:1) (/dev/block/8:16) (/dev/raid_disk1) (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA) (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA) /dev/raid_disk1 Thanks, Stéphane Bunel. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-22 11:22 ` Stephane Bunel @ 2009-10-29 3:44 ` Neil Brown 2009-11-03 9:37 ` Stephane Bunel 0 siblings, 1 reply; 49+ messages in thread From: Neil Brown @ 2009-10-29 3:44 UTC (permalink / raw) To: Stephane Bunel; +Cc: linux-raid On Thursday October 22, stephane.bunel@forumdesimages.fr wrote: > Neil Brown a écrit : > > On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote: > >> Hi, > >> > >> I'm a newbie in the mdadm world. I defined some udev rules to make disk > >> staticly named according to the bus/host/target. I.e. /dev/sda become > >> /dev/raid_disk0. So nothing very special with that, it's just a convenient way > >> to assign disk name to it's physical location. > >> > >> #ls -la /dev/raid* > >> brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0 > >> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1 > >> > >> A RAID1 (/dev/md0) is assembled over this two disk. > >> When looking for detailed information, mdadm show annoying device name in > >> place of /dev/raid_disk*: > >> > > .... > >> Number Major Minor RaidDevice State > >> 0 8 0 0 active sync /dev/char/21:0 > >> 1 8 16 1 active sync /dev/char/21:1 > > > > What is a block device doing in /dev/char ??? There should only be > > character devices in there. > > > > If these are actually block device, then I think there is something > > wrong with your udev rules. > > I think my udev rules are not in cause because they just change /dev/sd* to > /dev/raid_disk*. For udev /dev/char/21:0 seems correspond to the generic scsi > device driver (sg) wich is binded to scsi devices. That doesn't answer the question of why a block device is appearing in /dev/char/. My guess (which is quite possibly wrong, but it is the best I can do) is that whatever change to udev.rules that you made to get /dev/sdXX to be renamed to /dev/raid_diskXX, also renamed the scsi-generic devices to be /dev/raid_diskXX. I think that would have the effect that you are seeing. > > > If these are char devices, then mdadm is doing the wrong thing, but I > > cannot see that from the code. > > mdadm by choosing the shorter name without differentiate path (/dev/.../ ) > and name (sda) choose /dev/char/21:0 just because it is shorter than > /dev/raid_disk0. > > > Your proposal of choosing the highest rather than the shortest name > > has some merit, but I your current situation doesn't seem to justify > > it, and I particularly like the simplicity of the current heuristic. > > My proposal doesn't choose the highest name but does a selection based on > the shortest path to the device name. I.e. my proposal choose: > > /dev/raid_disk0 (1 directory level) over /dev/char/21:0 (2 directory level) > /dev/raid_disk0 (1 directory level) over /dev/block/8:0 (2 directory level) > /dev/sda1 (1 directory level) over /dev/disk/by-label/BOOT (3 directory level) This is what I meant by 'highest'. highest up in the directory tree - closest to the root (I guess some people might think that being close to the root is 'low', not 'high :-) > > So in fact, my proposal doesn't change current situation but "adjust" the > heuristic to avoid seeing an annoying device name (char device) as a member of > raid just because it's fullname is shorter than a "semantically better" name. I appreciate that. I am not against changing the heuristic if I find a good reason ... particularly if I can find something that actually measures "semantic goodness"! But I don't think you have provided a good reason. /dev/char/21:0 should be a char device, not a block device. So mdadm should ignore it. On your system, /dev/char/21:0 is a block device (or a link to a block device) so there is clearly some sort of configuration error. If you still cannot find it, maybe you could show us the change you made to udev.rules, and an 'ls -l' of '/dev/char'. That might help shed some light on your situation. NeilBrown > > Add a printf in map_dev() show that my proposal seems help the heuristic to > be more robust in case of "advanced" device naming, without changing current > things. > > > #./mdadm --detail /dev/md0 > (/dev/char/21:0) > (/dev/block/8:0) > (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA) > (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA) > (/dev/raid_disk0) > (/dev/char/21:1) > (/dev/block/8:16) > (/dev/raid_disk1) > (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA) > (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA) > /dev/md0: > Version : 0.90 > Creation Time : Tue Oct 13 12:53:54 2009 > Raid Level : raid1 > Array Size : 488386496 (465.76 GiB 500.11 GB) > Used Dev Size : 488386496 (465.76 GiB 500.11 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Thu Oct 22 12:42:44 2009 > State : clean > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > UUID : 3fea95a0:e1b8a341:3b119117:e416f62b > Events : 0.1526 > > Number Major Minor RaidDevice State > 0 8 0 0 active sync(/dev/char/21:0) > (/dev/block/8:0) > (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA) > (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA) > (/dev/raid_disk0) > /dev/raid_disk0 > 1 8 16 1 active sync(/dev/char/21:1) > (/dev/block/8:16) > (/dev/raid_disk1) > (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA) > (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA) > /dev/raid_disk1 > > > > Thanks, > Stéphane Bunel. > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-29 3:44 ` Neil Brown @ 2009-11-03 9:37 ` Stephane Bunel 2009-11-03 10:09 ` Beolach 0 siblings, 1 reply; 49+ messages in thread From: Stephane Bunel @ 2009-11-03 9:37 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil Brown a écrit : (...) > On your system, /dev/char/21:0 is a block device (or a link to a block > device) so there is clearly some sort of configuration error. All files in /dev/char are symlinks (see below). Rules are comming from Gentoo. > If you still cannot find it, maybe you could show us the change you > made to udev.rules, and an 'ls -l' of '/dev/char'. That might help > shed some light on your situation. Considering mdadm is only involved by "real" block device file, why not just skipping symlink ? o Udev rules used to rename /dev/sd[ab]: #cat 65-persistent-block.rules ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0", NAME="raid_disk0" ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0", NAME="raid_disk1" o System is Gentoo. o Content of /dev/char: #ls -la /dev/char/ total 0 drwxr-xr-x 2 root root 2880 2009-11-02 16:26 . drwxr-xr-x 16 root root 3480 2009-11-02 16:26 .. lrwxrwxrwx 1 root root 8 2009-10-16 18:12 10:1 -> ../psaux lrwxrwxrwx 1 root root 9 2009-10-16 18:12 10:227 -> ../mcelog lrwxrwxrwx 1 root root 7 2009-10-16 18:12 10:229 -> ../fuse lrwxrwxrwx 1 root root 11 2009-10-16 18:12 10:231 -> ../snapshot lrwxrwxrwx 1 root root 13 2009-10-16 18:12 10:252 -> ../dac960_gam lrwxrwxrwx 1 root root 11 2009-10-16 18:12 10:58 -> ../megadev0 lrwxrwxrwx 1 root root 6 2009-10-16 18:12 10:59 -> ../tgt lrwxrwxrwx 1 root root 21 2009-10-16 18:12 10:60 -> ../network_throughput lrwxrwxrwx 1 root root 18 2009-10-16 18:12 10:61 -> ../network_latency lrwxrwxrwx 1 root root 18 2009-10-16 18:12 10:62 -> ../cpu_dma_latency lrwxrwxrwx 1 root root 17 2009-10-16 18:12 10:63 -> ../mapper/control lrwxrwxrwx 1 root root 6 2009-10-16 18:12 1:1 -> ../mem lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:11 -> ../kmsg lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:2 -> ../kmem lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:3 -> ../null lrwxrwxrwx 1 root root 12 2009-10-16 18:12 13:0 -> ../input/js0 lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:32 -> ../input/mouse0 lrwxrwxrwx 1 root root 13 2009-10-16 18:12 13:63 -> ../input/mice lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:64 -> ../input/event0 lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:65 -> ../input/event1 lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:67 -> ../input/event3 lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:68 -> ../input/event4 lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:69 -> ../input/event5 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:4 -> ../port lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:5 -> ../zero lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:7 -> ../full lrwxrwxrwx 1 root root 9 2009-10-16 18:12 1:8 -> ../random lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:0 -> ../bus/usb/001/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:128 -> ../bus/usb/002/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:129 -> ../bus/usb/002/002 lrwxrwxrwx 1 root root 18 2009-11-02 16:25 189:133 -> ../bus/usb/002/006 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:256 -> ../bus/usb/003/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:384 -> ../bus/usb/004/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:385 -> ../bus/usb/004/002 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:512 -> ../bus/usb/005/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:640 -> ../bus/usb/006/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:768 -> ../bus/usb/007/001 lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:896 -> ../bus/usb/008/001 lrwxrwxrwx 1 root root 10 2009-10-16 18:12 1:9 -> ../urandom lrwxrwxrwx 1 root root 13 2009-10-16 18:12 21:0 -> ../raid_disk0 lrwxrwxrwx 1 root root 13 2009-10-16 18:12 21:1 -> ../raid_disk1 lrwxrwxrwx 1 root root 6 2009-11-02 16:26 21:2 -> ../sg2 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 248:0 -> ../rtc0 lrwxrwxrwx 1 root root 10 2009-10-16 18:12 254:0 -> ../hidraw0 lrwxrwxrwx 1 root root 10 2009-10-16 18:12 254:1 -> ../hidraw1 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:0 -> ../tty0 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:1 -> ../tty1 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:10 -> ../tty10 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:11 -> ../tty11 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:12 -> ../tty12 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:13 -> ../tty13 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:14 -> ../tty14 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:15 -> ../tty15 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:16 -> ../tty16 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:17 -> ../tty17 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:18 -> ../tty18 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:19 -> ../tty19 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:2 -> ../tty2 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:20 -> ../tty20 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:21 -> ../tty21 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:22 -> ../tty22 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:23 -> ../tty23 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:24 -> ../tty24 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:25 -> ../tty25 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:26 -> ../tty26 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:27 -> ../tty27 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:28 -> ../tty28 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:29 -> ../tty29 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:3 -> ../tty3 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:30 -> ../tty30 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:31 -> ../tty31 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:32 -> ../tty32 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:33 -> ../tty33 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:34 -> ../tty34 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:35 -> ../tty35 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:36 -> ../tty36 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:37 -> ../tty37 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:38 -> ../tty38 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:39 -> ../tty39 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:4 -> ../tty4 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:40 -> ../tty40 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:41 -> ../tty41 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:42 -> ../tty42 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:43 -> ../tty43 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:44 -> ../tty44 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:45 -> ../tty45 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:46 -> ../tty46 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:47 -> ../tty47 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:48 -> ../tty48 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:49 -> ../tty49 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:5 -> ../tty5 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:50 -> ../tty50 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:51 -> ../tty51 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:52 -> ../tty52 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:53 -> ../tty53 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:54 -> ../tty54 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:55 -> ../tty55 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:56 -> ../tty56 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:57 -> ../tty57 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:58 -> ../tty58 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:59 -> ../tty59 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:6 -> ../tty6 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:60 -> ../tty60 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:61 -> ../tty61 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:62 -> ../tty62 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:63 -> ../tty63 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:64 -> ../ttyS0 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:65 -> ../ttyS1 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:66 -> ../ttyS2 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:67 -> ../ttyS3 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:7 -> ../tty7 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:8 -> ../tty8 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:9 -> ../tty9 lrwxrwxrwx 1 root root 6 2009-10-16 18:12 5:0 -> ../tty lrwxrwxrwx 1 root root 10 2009-10-16 18:12 5:1 -> ../console lrwxrwxrwx 1 root root 7 2009-10-16 18:12 5:2 -> ../ptmx lrwxrwxrwx 1 root root 6 2009-10-16 18:12 7:0 -> ../vcs lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:1 -> ../vcs1 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:10 -> ../vcs10 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:11 -> ../vcs11 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:12 -> ../vcs12 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:128 -> ../vcsa lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:129 -> ../vcsa1 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:130 -> ../vcsa2 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:131 -> ../vcsa3 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:132 -> ../vcsa4 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:133 -> ../vcsa5 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:134 -> ../vcsa6 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:135 -> ../vcsa7 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:136 -> ../vcsa8 lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:137 -> ../vcsa9 lrwxrwxrwx 1 root root 9 2009-10-16 18:12 7:138 -> ../vcsa10 lrwxrwxrwx 1 root root 9 2009-10-16 18:12 7:139 -> ../vcsa11 lrwxrwxrwx 1 root root 9 2009-10-16 18:12 7:140 -> ../vcsa12 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:2 -> ../vcs2 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:3 -> ../vcs3 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:4 -> ../vcs4 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:5 -> ../vcs5 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:6 -> ../vcs6 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:7 -> ../vcs7 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:8 -> ../vcs8 lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:9 -> ../vcs9 Stéphane Bunel. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-11-03 9:37 ` Stephane Bunel @ 2009-11-03 10:09 ` Beolach 2009-11-03 12:16 ` Stephane Bunel 0 siblings, 1 reply; 49+ messages in thread From: Beolach @ 2009-11-03 10:09 UTC (permalink / raw) To: Stephane Bunel; +Cc: Neil Brown, linux-raid On Tue, Nov 3, 2009 at 02:37, Stephane Bunel <stephane.bunel@forumdesimages.fr> wrote: > Neil Brown a écrit : > (...) > >> On your system, /dev/char/21:0 is a block device (or a link to a block >> device) so there is clearly some sort of configuration error. > > All files in /dev/char are symlinks (see below). Rules are comming from > Gentoo. > >> If you still cannot find it, maybe you could show us the change you >> made to udev.rules, and an 'ls -l' of '/dev/char'. That might help >> shed some light on your situation. > > Considering mdadm is only involved by "real" block device file, why not just > skipping symlink ? > > o Udev rules used to rename /dev/sd[ab]: > > #cat 65-persistent-block.rules > ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0", > NAME="raid_disk0" > > ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0", > NAME="raid_disk1" > Try prepending SUBSYSTEM=="block" to those, so they they'll only match the (block) sd* devices, and not the (char) sg? devices: SUBSYSTEM=="block", ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0", NAME="raid_disk0" SUBSYSTEM=="block", ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0", NAME="raid_disk1" Good luck, Conway S. Smith -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-11-03 10:09 ` Beolach @ 2009-11-03 12:16 ` Stephane Bunel 0 siblings, 0 replies; 49+ messages in thread From: Stephane Bunel @ 2009-11-03 12:16 UTC (permalink / raw) To: Beolach; +Cc: Neil Brown, linux-raid Beolach a écrit : > On Tue, Nov 3, 2009 at 02:37, Stephane Bunel > <stephane.bunel@forumdesimages.fr> wrote: >> Neil Brown a écrit : >> (...) >> >>> On your system, /dev/char/21:0 is a block device (or a link to a block >>> device) so there is clearly some sort of configuration error. >> All files in /dev/char are symlinks (see below). Rules are comming from >> Gentoo. >> >>> If you still cannot find it, maybe you could show us the change you >>> made to udev.rules, and an 'ls -l' of '/dev/char'. That might help >>> shed some light on your situation. >> Considering mdadm is only involved by "real" block device file, why not just >> skipping symlink ? >> >> o Udev rules used to rename /dev/sd[ab]: >> >> #cat 65-persistent-block.rules >> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0", >> NAME="raid_disk0" >> >> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0", >> NAME="raid_disk1" >> > > Try prepending SUBSYSTEM=="block" to those, so they they'll only match > the (block) sd* devices, and not the (char) sg? devices: > SUBSYSTEM=="block", > ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0", > NAME="raid_disk0" > SUBSYSTEM=="block", > ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0", > NAME="raid_disk1" Well done!! This correct the sg* link issue. #mdadm --detail /dev/md0 (...) Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/block/8:0 1 8 16 1 active sync /dev/block/8:16 (...) But we loop back to the original problem. mdadm show '/dev/block/8:0' instead of '/dev/raid_disk0' (as i wish). This because the symlink '/dev/block/8:0' is shorter than the real block file name '/dev/raid_disk0'. Actual heuristic prevents renaming of real block device name as desired under penalty of seeing symlinks from /dev/block/*. maybe mdadm could change this behavior by: o always prefers real bloc file over symlinks. o simply skips symlinks. o changing the heuristic like proposed by my patch (prefer the name that is closer to /dev). Stéphane Bunel. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-21 21:46 ` Neil Brown 2009-10-22 11:22 ` Stephane Bunel @ 2009-10-22 11:29 ` Mario 'BitKoenig' Holbe 2009-10-22 14:17 ` Stephane Bunel 1 sibling, 1 reply; 49+ messages in thread From: Mario 'BitKoenig' Holbe @ 2009-10-22 11:29 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@suse.de> wrote: > On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote: >> 0 8 0 0 active sync /dev/char/21:0 >> 1 8 16 1 active sync /dev/char/21:1 > What is a block device doing in /dev/char ??? There should only be > character devices in there. major 21 are usually SCSI generic devices (/dev/sg) and they are char... crw-rw---- 1 root root 21, 0 Oct 10 21:48 /dev/sg0 crw-rw---- 1 root root 21, 1 Oct 10 21:48 /dev/sg1 The question is, why do they appear at mdadm --detail regards Mario -- There are trivial truths and the great truths. The opposite of a trivial truth is plainly false. The opposite of a great truth is also true. -- Niels Bohr ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-22 11:29 ` Mario 'BitKoenig' Holbe @ 2009-10-22 14:17 ` Stephane Bunel 2009-10-22 16:00 ` Stephane Bunel 0 siblings, 1 reply; 49+ messages in thread From: Stephane Bunel @ 2009-10-22 14:17 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe; +Cc: linux-raid Mario 'BitKoenig' Holbe a écrit : > Neil Brown <neilb@suse.de> wrote: >> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote: >>> 0 8 0 0 active sync /dev/char/21:0 >>> 1 8 16 1 active sync /dev/char/21:1 >> What is a block device doing in /dev/char ??? There should only be >> character devices in there. > > major 21 are usually SCSI generic devices (/dev/sg) and they are char... > crw-rw---- 1 root root 21, 0 Oct 10 21:48 /dev/sg0 > crw-rw---- 1 root root 21, 1 Oct 10 21:48 /dev/sg1 > > The question is, why do they appear at mdadm --detail mdadm performs a physical walk to not follow symbolic links (cf nftw( FTW_PHYS ) in map_dev() ). But using stat() mdadm finaly follow the symbolic link and so returns the same type/major/minor as the targeted link. lstat() is identical to stat(), except that if path is a symbolic link, then the link itself is stat-ed, not the file that it refers to. #ls -la /dev/char/21:0 lrwxrwxrwx 1 root root 13 2009-10-16 18:12 /dev/char/21:0 -> ../raid_disk0 Tested from Python: >>> import os, stat Using stat: >>> mode = os.stat( '/dev/char/21:0' )[ stat.ST_MODE ] >>> stat.S_ISBLK( mode ) True using lstat(): >>> mode = os.lstat( '/dev/char/21:0' )[ stat.ST_MODE ] >>> stat.S_ISBLK( mode ) False Stéphane Bunel. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device 2009-10-22 14:17 ` Stephane Bunel @ 2009-10-22 16:00 ` Stephane Bunel 0 siblings, 0 replies; 49+ messages in thread From: Stephane Bunel @ 2009-10-22 16:00 UTC (permalink / raw) To: Mario 'BitKoenig' Holbe; +Cc: linux-raid Stephane Bunel a écrit : > Mario 'BitKoenig' Holbe a écrit : >> Neil Brown <neilb@suse.de> wrote: >>> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote: >>>> 0 8 0 0 active sync /dev/char/21:0 >>>> 1 8 16 1 active sync /dev/char/21:1 >>> What is a block device doing in /dev/char ??? There should only be >>> character devices in there. >> >> major 21 are usually SCSI generic devices (/dev/sg) and they are char... >> crw-rw---- 1 root root 21, 0 Oct 10 21:48 /dev/sg0 >> crw-rw---- 1 root root 21, 1 Oct 10 21:48 /dev/sg1 >> >> The question is, why do they appear at mdadm --detail > > mdadm performs a physical walk to not follow symbolic links (cf nftw( > FTW_PHYS ) in map_dev() ). But using stat() mdadm finaly follow the > symbolic link and so returns the same type/major/minor as the targeted > link. > > lstat() is identical to stat(), except that if path is a symbolic > link, then the link itself is stat-ed, not the file that it refers to. > > > #ls -la /dev/char/21:0 > lrwxrwxrwx 1 root root 13 2009-10-16 18:12 /dev/char/21:0 -> ../raid_disk0 > > Tested from Python: > >>> import os, stat > > Using stat: > >>> mode = os.stat( '/dev/char/21:0' )[ stat.ST_MODE ] > >>> stat.S_ISBLK( mode ) > True > > using lstat(): > >>> mode = os.lstat( '/dev/char/21:0' )[ stat.ST_MODE ] > >>> stat.S_ISBLK( mode ) > False Just for fun ;-) --- util.c.orig 2009-10-22 17:54:11.000000000 +0200 +++ util.c 2009-10-22 17:55:09.000000000 +0200 @@ -468,7 +468,7 @@ struct stat st; if (S_ISLNK(stb->st_mode)) { - if (stat(name, &st) != 0) + if (lstat(name, &st) != 0) return 0; stb = &st; } Stéphane Bunel. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-21 10:19 ` Justin Piszcz @ 2009-10-22 22:49 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Wed, 21 Oct 2009, Justin Piszcz wrote: > > > On Tue, 20 Oct 2009, Justin Piszcz wrote: > > >> >> >> On Tue, 20 Oct 2009, Dave Chinner wrote: >> >>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >>>> On Mon, 19 Oct 2009, Dave Chinner wrote: >>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>>>> It has happened again, all sysrq-X output was saved this time. >>>>> ..... >>>>> >>>>> All pointing to log IO not completing. >>>>> >>> .... >>>> So far I do not have a reproducible test case, >>> >>> Ok. What sort of load is being placed on the machine? >> Hello, generally the load is low, it mainly serves out some samba shares. >> >>> >>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >>> the running state but don't appear to be consuming any significant >>> CPU time. If they remain like this then I think that means they are >>> stuck waiting on the run queue. Do these XFS threads always appear >>> like this when the hang occurs? If so, is there something else that >>> is hogging CPU 0 preventing these threads from getting the CPU? >> Yes, the XFS threads show up like this on each time the kernel crashed. So >> far >> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some >> issue >> between 2.6.30.9 and 2.6.31.x when this began happening. Any >> recommendations >> on how to catch this bug w/certain options enabled/etc? >> >> >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com >>> >> > > Uptime with 2.6.30.9: > > 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 > > No issues yet, so it first started happening in 2.6.(31).(x). > > Any further recommendations on how to debug this issue? BTW: Do you view > this > as an XFS bug or MD/VFS layer issue based on the logs/output thus far? > > Justin. > > Any other ideas? Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no load at all either.. Has anyone else reported similar problems? Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-22 22:49 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Wed, 21 Oct 2009, Justin Piszcz wrote: > > > On Tue, 20 Oct 2009, Justin Piszcz wrote: > > >> >> >> On Tue, 20 Oct 2009, Dave Chinner wrote: >> >>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >>>> On Mon, 19 Oct 2009, Dave Chinner wrote: >>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>>>> It has happened again, all sysrq-X output was saved this time. >>>>> ..... >>>>> >>>>> All pointing to log IO not completing. >>>>> >>> .... >>>> So far I do not have a reproducible test case, >>> >>> Ok. What sort of load is being placed on the machine? >> Hello, generally the load is low, it mainly serves out some samba shares. >> >>> >>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >>> the running state but don't appear to be consuming any significant >>> CPU time. If they remain like this then I think that means they are >>> stuck waiting on the run queue. Do these XFS threads always appear >>> like this when the hang occurs? If so, is there something else that >>> is hogging CPU 0 preventing these threads from getting the CPU? >> Yes, the XFS threads show up like this on each time the kernel crashed. So >> far >> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some >> issue >> between 2.6.30.9 and 2.6.31.x when this began happening. Any >> recommendations >> on how to catch this bug w/certain options enabled/etc? >> >> >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com >>> >> > > Uptime with 2.6.30.9: > > 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 > > No issues yet, so it first started happening in 2.6.(31).(x). > > Any further recommendations on how to debug this issue? BTW: Do you view > this > as an XFS bug or MD/VFS layer issue based on the logs/output thus far? > > Justin. > > Any other ideas? Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no load at all either.. Has anyone else reported similar problems? Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-22 22:49 ` Justin Piszcz @ 2009-10-22 23:00 ` Dave Chinner -1 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote: > On Wed, 21 Oct 2009, Justin Piszcz wrote: >> On Tue, 20 Oct 2009, Justin Piszcz wrote: >>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >>>> the running state but don't appear to be consuming any significant >>>> CPU time. If they remain like this then I think that means they are >>>> stuck waiting on the run queue. Do these XFS threads always appear >>>> like this when the hang occurs? If so, is there something else that >>>> is hogging CPU 0 preventing these threads from getting the CPU? >>> Yes, the XFS threads show up like this on each time the kernel >>> crashed. So far >>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be >>> some issue >>> between 2.6.30.9 and 2.6.31.x when this began happening. Any >>> recommendations >>> on how to catch this bug w/certain options enabled/etc? >> >> Uptime with 2.6.30.9: >> >> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 >> >> No issues yet, so it first started happening in 2.6.(31).(x). Ok. >> Any further recommendations on how to debug this issue? BTW: Do >> you view this as an XFS bug or MD/VFS layer issue based on the >> logs/output thus far? Could be either. Nothing so far points at a cause. > Any other ideas? If it is relatively quick to reproduce, you could run a git bisect to try to find the offending commit. Or when it has locked up, run oprofile with callgraph sampling and so we can get an idea of what is actually running when XFS appears to hang. > Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has > no load at all either.. Has anyone else reported similar problems? Not that I know of. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-22 23:00 ` Dave Chinner 0 siblings, 0 replies; 49+ messages in thread From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote: > On Wed, 21 Oct 2009, Justin Piszcz wrote: >> On Tue, 20 Oct 2009, Justin Piszcz wrote: >>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >>>> the running state but don't appear to be consuming any significant >>>> CPU time. If they remain like this then I think that means they are >>>> stuck waiting on the run queue. Do these XFS threads always appear >>>> like this when the hang occurs? If so, is there something else that >>>> is hogging CPU 0 preventing these threads from getting the CPU? >>> Yes, the XFS threads show up like this on each time the kernel >>> crashed. So far >>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be >>> some issue >>> between 2.6.30.9 and 2.6.31.x when this began happening. Any >>> recommendations >>> on how to catch this bug w/certain options enabled/etc? >> >> Uptime with 2.6.30.9: >> >> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 >> >> No issues yet, so it first started happening in 2.6.(31).(x). Ok. >> Any further recommendations on how to debug this issue? BTW: Do >> you view this as an XFS bug or MD/VFS layer issue based on the >> logs/output thus far? Could be either. Nothing so far points at a cause. > Any other ideas? If it is relatively quick to reproduce, you could run a git bisect to try to find the offending commit. Or when it has locked up, run oprofile with callgraph sampling and so we can get an idea of what is actually running when XFS appears to hang. > Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has > no load at all either.. Has anyone else reported similar problems? Not that I know of. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-22 22:49 ` Justin Piszcz @ 2009-10-26 11:24 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Thu, 22 Oct 2009, Justin Piszcz wrote: > > Any other ideas? > > Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no > load at all either.. Has anyone else reported similar problems? > > Justin. > -- Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 later today and wait another 48 hours. Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-10-26 11:24 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Thu, 22 Oct 2009, Justin Piszcz wrote: > > Any other ideas? > > Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no > load at all either.. Has anyone else reported similar problems? > > Justin. > -- Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 later today and wait another 48 hours. Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) 2009-10-26 11:24 ` Justin Piszcz @ 2009-11-02 21:46 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz On Mon, 26 Oct 2009, Justin Piszcz wrote: > > > On Thu, 22 Oct 2009, Justin Piszcz wrote: > >> >> Any other ideas? >> >> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no >> load at all either.. Has anyone else reported similar problems? >> >> Justin. >> > > -- > > Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 > later today and wait another 48 hours. > > Justin. > > Kernel report: 2.6.31-rc1: no crash - uptime 2+ days 2.6.31-rc2: no crash - uptime 2+ days 2.6.31-rc3: no crash - uptime 2+ days 2.6.31-rc4: no crash but network kept dropping out 2.6.31-rc5: cannot test, service owner needs host available 2.6.31-rc6: cannot test, service owner needs host available 2.6.31-rc7: cannot test, service owner needs host available 2.6.31-rc8: cannot test, service owner needs host available 2.6.31-rc9: cannot test, service owner needs host available 2.6.31.x: locks up D-state It would be somewhere between -rc4 and 2.6.31.x. Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) @ 2009-11-02 21:46 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Mon, 26 Oct 2009, Justin Piszcz wrote: > > > On Thu, 22 Oct 2009, Justin Piszcz wrote: > >> >> Any other ideas? >> >> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no >> load at all either.. Has anyone else reported similar problems? >> >> Justin. >> > > -- > > Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 > later today and wait another 48 hours. > > Justin. > > Kernel report: 2.6.31-rc1: no crash - uptime 2+ days 2.6.31-rc2: no crash - uptime 2+ days 2.6.31-rc3: no crash - uptime 2+ days 2.6.31-rc4: no crash but network kept dropping out 2.6.31-rc5: cannot test, service owner needs host available 2.6.31-rc6: cannot test, service owner needs host available 2.6.31-rc7: cannot test, service owner needs host available 2.6.31-rc8: cannot test, service owner needs host available 2.6.31-rc9: cannot test, service owner needs host available 2.6.31.x: locks up D-state It would be somewhere between -rc4 and 2.6.31.x. Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk 2009-10-21 10:19 ` Justin Piszcz @ 2009-11-20 20:39 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, linux-raid, xfs, Alan Piszcz, asterisk-users, submit Package: asterisk Version: 1.6.2.0~dfsg~rc1-1 See below for issue: On Wed, 21 Oct 2009, Justin Piszcz wrote: > > > On Tue, 20 Oct 2009, Justin Piszcz wrote: > > >> >> >> On Tue, 20 Oct 2009, Dave Chinner wrote: >> >>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >>>> On Mon, 19 Oct 2009, Dave Chinner wrote: >>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>>>> It has happened again, all sysrq-X output was saved this time. >>>>> ..... >>>>> >>>>> All pointing to log IO not completing. >>>>> >>> .... >>>> So far I do not have a reproducible test case, >>> >>> Ok. What sort of load is being placed on the machine? >> Hello, generally the load is low, it mainly serves out some samba shares. >> >>> >>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >>> the running state but don't appear to be consuming any significant >>> CPU time. If they remain like this then I think that means they are >>> stuck waiting on the run queue. Do these XFS threads always appear >>> like this when the hang occurs? If so, is there something else that >>> is hogging CPU 0 preventing these threads from getting the CPU? >> Yes, the XFS threads show up like this on each time the kernel crashed. So >> far >> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some >> issue >> between 2.6.30.9 and 2.6.31.x when this began happening. Any >> recommendations >> on how to catch this bug w/certain options enabled/etc? >> >> >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com >>> >> > > Uptime with 2.6.30.9: > > 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 > > No issues yet, so it first started happening in 2.6.(31).(x). > > Any further recommendations on how to debug this issue? BTW: Do you view > this > as an XFS bug or MD/VFS layer issue based on the logs/output thus far? > > Justin. > > Found root cause-- root cause is asterisk PBX software. I use an SPA3102. When someone called me, they accidentally dropped the connection, I called them back in a short period. It is during this time (and the last time) this happened that the box froze under multiple(!) kernels, always when someone was calling. I have removed asterisk but this is the version I was running: ~$ dpkg -l | grep -i asterisk rc asterisk 1:1.6.2.0~dfsg~rc1-1 Open S I don't know what asterisk is doing but top did run before the crash and asterisk was using 100% CPU and as I noted before all other processes were in D-state. When this bug occurs, it freezes I/O to all devices and the only way to recover is to reboot the system. Just FYI if anyone else out there has their system crash when running asterisk. Just out of curiosity, has anyone else running asterisk had such an issue? I was not running any special VoIP PCI cards/etc. Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk @ 2009-11-20 20:39 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw) To: Dave Chinner Cc: submit, linux-kernel, xfs, linux-raid, asterisk-users, Alan Piszcz Package: asterisk Version: 1.6.2.0~dfsg~rc1-1 See below for issue: On Wed, 21 Oct 2009, Justin Piszcz wrote: > > > On Tue, 20 Oct 2009, Justin Piszcz wrote: > > >> >> >> On Tue, 20 Oct 2009, Dave Chinner wrote: >> >>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote: >>>> On Mon, 19 Oct 2009, Dave Chinner wrote: >>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: >>>>>> It has happened again, all sysrq-X output was saved this time. >>>>> ..... >>>>> >>>>> All pointing to log IO not completing. >>>>> >>> .... >>>> So far I do not have a reproducible test case, >>> >>> Ok. What sort of load is being placed on the machine? >> Hello, generally the load is low, it mainly serves out some samba shares. >> >>> >>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in >>> the running state but don't appear to be consuming any significant >>> CPU time. If they remain like this then I think that means they are >>> stuck waiting on the run queue. Do these XFS threads always appear >>> like this when the hang occurs? If so, is there something else that >>> is hogging CPU 0 preventing these threads from getting the CPU? >> Yes, the XFS threads show up like this on each time the kernel crashed. So >> far >> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some >> issue >> between 2.6.30.9 and 2.6.31.x when this began happening. Any >> recommendations >> on how to catch this bug w/certain options enabled/etc? >> >> >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com >>> >> > > Uptime with 2.6.30.9: > > 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07 > > No issues yet, so it first started happening in 2.6.(31).(x). > > Any further recommendations on how to debug this issue? BTW: Do you view > this > as an XFS bug or MD/VFS layer issue based on the logs/output thus far? > > Justin. > > Found root cause-- root cause is asterisk PBX software. I use an SPA3102. When someone called me, they accidentally dropped the connection, I called them back in a short period. It is during this time (and the last time) this happened that the box froze under multiple(!) kernels, always when someone was calling. I have removed asterisk but this is the version I was running: ~$ dpkg -l | grep -i asterisk rc asterisk 1:1.6.2.0~dfsg~rc1-1 Open S I don't know what asterisk is doing but top did run before the crash and asterisk was using 100% CPU and as I noted before all other processes were in D-state. When this bug occurs, it freezes I/O to all devices and the only way to recover is to reboot the system. Just FYI if anyone else out there has their system crash when running asterisk. Just out of curiosity, has anyone else running asterisk had such an issue? I was not running any special VoIP PCI cards/etc. Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk 2009-11-20 20:39 ` Justin Piszcz (?) @ 2009-11-20 23:44 ` Faidon Liambotis -1 siblings, 0 replies; 49+ messages in thread From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw) To: Justin Piszcz, 557262 Cc: linux-raid, Dave Chinner, linux-kernel, xfs, submit, asterisk-users, Alan Piszcz Justin Piszcz wrote: > Found root cause-- root cause is asterisk PBX software. I use an SPA3102. > When someone called me, they accidentally dropped the connection, I called > them back in a short period. It is during this time (and the last time) > this happened that the box froze under multiple(!) kernels, always when > someone was calling. <snip> > I don't know what asterisk is doing but top did run before the crash > and asterisk was using 100% CPU and as I noted before all other processes > were in D-state. > > When this bug occurs, it freezes I/O to all devices and the only way to > recover > is to reboot the system. That's obviously *not* the root cause. It's not normal for an application that isn't even privileged to hang all I/O and, subsequently everything on a system. This is almost probably a kernel issue and asterisk just does something that triggers this bug. Regards, Faidon _______________________________________________ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk @ 2009-11-20 23:44 ` Faidon Liambotis 0 siblings, 0 replies; 49+ messages in thread From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw) To: Justin Piszcz, 557262 Cc: linux-raid, linux-kernel, xfs, submit, asterisk-users, Alan Piszcz Justin Piszcz wrote: > Found root cause-- root cause is asterisk PBX software. I use an SPA3102. > When someone called me, they accidentally dropped the connection, I called > them back in a short period. It is during this time (and the last time) > this happened that the box froze under multiple(!) kernels, always when > someone was calling. <snip> > I don't know what asterisk is doing but top did run before the crash > and asterisk was using 100% CPU and as I noted before all other processes > were in D-state. > > When this bug occurs, it freezes I/O to all devices and the only way to > recover > is to reboot the system. That's obviously *not* the root cause. It's not normal for an application that isn't even privileged to hang all I/O and, subsequently everything on a system. This is almost probably a kernel issue and asterisk just does something that triggers this bug. Regards, Faidon _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk @ 2009-11-20 23:44 ` Faidon Liambotis 0 siblings, 0 replies; 49+ messages in thread From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw) To: Justin Piszcz, 557262 Cc: Dave Chinner, submit, linux-kernel, xfs, linux-raid, asterisk-users, Alan Piszcz Justin Piszcz wrote: > Found root cause-- root cause is asterisk PBX software. I use an SPA3102. > When someone called me, they accidentally dropped the connection, I called > them back in a short period. It is during this time (and the last time) > this happened that the box froze under multiple(!) kernels, always when > someone was calling. <snip> > I don't know what asterisk is doing but top did run before the crash > and asterisk was using 100% CPU and as I noted before all other processes > were in D-state. > > When this bug occurs, it freezes I/O to all devices and the only way to > recover > is to reboot the system. That's obviously *not* the root cause. It's not normal for an application that isn't even privileged to hang all I/O and, subsequently everything on a system. This is almost probably a kernel issue and asterisk just does something that triggers this bug. Regards, Faidon ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk 2009-11-20 23:44 ` Faidon Liambotis @ 2009-11-20 23:51 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw) To: Faidon Liambotis Cc: 557262, Dave Chinner, submit, linux-kernel, xfs, linux-raid, asterisk-users, Alan Piszcz On Sat, 21 Nov 2009, Faidon Liambotis wrote: > Justin Piszcz wrote: > > Found root cause-- root cause is asterisk PBX software. I use an > SPA3102. >> When someone called me, they accidentally dropped the connection, I called >> them back in a short period. It is during this time (and the last time) >> this happened that the box froze under multiple(!) kernels, always when >> someone was calling. > <snip> >> I don't know what asterisk is doing but top did run before the crash >> and asterisk was using 100% CPU and as I noted before all other processes >> were in D-state. >> >> When this bug occurs, it freezes I/O to all devices and the only way to >> recover >> is to reboot the system. > That's obviously *not* the root cause. > > It's not normal for an application that isn't even privileged to hang > all I/O and, subsequently everything on a system. > > This is almost probably a kernel issue and asterisk just does something > that triggers this bug. > > Regards, > Faidon > It is possible although I tried with several kernels (2.6.30.[0-9] & 2.6.31+ (never had a crash with earlier versions, I installed asterisk long ago) but it always used to be 1.4.x until recently.. Nasty bug :\ Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk @ 2009-11-20 23:51 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw) To: Faidon Liambotis Cc: linux-raid, 557262, linux-kernel, xfs, submit, asterisk-users, Alan Piszcz On Sat, 21 Nov 2009, Faidon Liambotis wrote: > Justin Piszcz wrote: > > Found root cause-- root cause is asterisk PBX software. I use an > SPA3102. >> When someone called me, they accidentally dropped the connection, I called >> them back in a short period. It is during this time (and the last time) >> this happened that the box froze under multiple(!) kernels, always when >> someone was calling. > <snip> >> I don't know what asterisk is doing but top did run before the crash >> and asterisk was using 100% CPU and as I noted before all other processes >> were in D-state. >> >> When this bug occurs, it freezes I/O to all devices and the only way to >> recover >> is to reboot the system. > That's obviously *not* the root cause. > > It's not normal for an application that isn't even privileged to hang > all I/O and, subsequently everything on a system. > > This is almost probably a kernel issue and asterisk just does something > that triggers this bug. > > Regards, > Faidon > It is possible although I tried with several kernels (2.6.30.[0-9] & 2.6.31+ (never had a crash with earlier versions, I installed asterisk long ago) but it always used to be 1.4.x until recently.. Nasty bug :\ Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk 2009-11-20 23:44 ` Faidon Liambotis @ 2009-11-21 14:29 ` Roger Heflin -1 siblings, 0 replies; 49+ messages in thread From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw) To: Faidon Liambotis Cc: Justin Piszcz, 557262, Dave Chinner, submit, linux-kernel, xfs, linux-raid, asterisk-users, Alan Piszcz Faidon Liambotis wrote: > Justin Piszcz wrote: > > Found root cause-- root cause is asterisk PBX software. I use an > SPA3102. >> When someone called me, they accidentally dropped the connection, I called >> them back in a short period. It is during this time (and the last time) >> this happened that the box froze under multiple(!) kernels, always when >> someone was calling. > <snip> >> I don't know what asterisk is doing but top did run before the crash >> and asterisk was using 100% CPU and as I noted before all other processes >> were in D-state. >> >> When this bug occurs, it freezes I/O to all devices and the only way to >> recover >> is to reboot the system. > That's obviously *not* the root cause. > > It's not normal for an application that isn't even privileged to hang > all I/O and, subsequently everything on a system. > > This is almost probably a kernel issue and asterisk just does something > that triggers this bug. > > Regards, > Faidon I had an application in 2.6.5 (SLES9)...that would hang XFS. The underlying application was multi-threaded and both threads were doing full disks syncs every so often, and sometimes when doing the full disk sync the XFS subsystem would deadlock, it appeared to me tha one sync had a lock and was waiting for another, and the other process had the second lock and was waiting for the first... We were able to disable the full disk sync from the application and the deadlock went away. All non-xfs filesytems still worked and could still be accessed. I did report the bug with some traces but I don't believe anyone ever determined where the underlying issues was. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk @ 2009-11-21 14:29 ` Roger Heflin 0 siblings, 0 replies; 49+ messages in thread From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw) To: Faidon Liambotis Cc: linux-raid, 557262, linux-kernel, xfs, submit, asterisk-users, Justin Piszcz, Alan Piszcz Faidon Liambotis wrote: > Justin Piszcz wrote: > > Found root cause-- root cause is asterisk PBX software. I use an > SPA3102. >> When someone called me, they accidentally dropped the connection, I called >> them back in a short period. It is during this time (and the last time) >> this happened that the box froze under multiple(!) kernels, always when >> someone was calling. > <snip> >> I don't know what asterisk is doing but top did run before the crash >> and asterisk was using 100% CPU and as I noted before all other processes >> were in D-state. >> >> When this bug occurs, it freezes I/O to all devices and the only way to >> recover >> is to reboot the system. > That's obviously *not* the root cause. > > It's not normal for an application that isn't even privileged to hang > all I/O and, subsequently everything on a system. > > This is almost probably a kernel issue and asterisk just does something > that triggers this bug. > > Regards, > Faidon I had an application in 2.6.5 (SLES9)...that would hang XFS. The underlying application was multi-threaded and both threads were doing full disks syncs every so often, and sometimes when doing the full disk sync the XFS subsystem would deadlock, it appeared to me tha one sync had a lock and was waiting for another, and the other process had the second lock and was waiting for the first... We were able to disable the full disk sync from the application and the deadlock went away. All non-xfs filesytems still worked and could still be accessed. I did report the bug with some traces but I don't believe anyone ever determined where the underlying issues was. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Which kernel options should be enabled to find the root cause of this bug? 2009-10-17 22:34 ` Justin Piszcz @ 2009-11-24 13:08 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw) To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz On Sat, 17 Oct 2009, Justin Piszcz wrote: > Hello, > > I have a system I recently upgraded from 2.6.30.x and after approximately > 24-48 hours--sometimes longer, the system cannot write any more files to disk > (luckily though I can still write to /dev/shm) -- to which I have > saved the sysrq-t and sysrq-w output: > > http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt > http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt > > Configuration: > > $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : > active raid1 sdb2[1] sda2[0] > 136448 blocks [2/2] [UU] > > md2 : active raid1 sdb3[1] sda3[0] > 129596288 blocks [2/2] [UU] > > md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] > sdc1[0] > 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] > > md0 : active raid1 sdb1[1] sda1[0] > 16787776 blocks [2/2] [UU] > > $ mount > /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) > tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) > proc on /proc type proc (rw,noexec,nosuid,nodev) > sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) > udev on /dev type tmpfs (rw,mode=0755) > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) > devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) > /dev/md1 on /boot type ext3 (rw,noatime) > /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) > rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) > nfsd on /proc/fs/nfsd type nfsd (rw) > > Distribution: Debian Testing > Arch: x86_64 > > The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem > persists. > > Here is a snippet of two processes in D-state, the first was not doing > anything, the second was mrtg. > > [121444.684000] pickup D 0000000000000003 0 18407 4521 > 0x00000000 > [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 > 0000000000000000 > [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 > ffff880176794f90 > [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 > ffff880176794f90 > [121444.684000] Call Trace: > [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0 > [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160 > [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40 > [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70 > [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90 > [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80 > [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b > > [121444.684000] rateup D 0000000000000000 0 18538 18465 > 0x00000000 > [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 > ffff88023ea09ec8 > [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 > ffff88023faad0d0 > [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 > ffff88023faad0d0 > [121444.684000] Call Trace: > [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90 > [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50 > [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0 > [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 > [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 > [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 > [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 > [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130 > [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0 > [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40 > [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0 > [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0 > [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220 > [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0 > [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220 > [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300 > [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0 > [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30 > [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0 > [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130 > [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40 > [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0 > [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40 > [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130 > [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40 > [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60 > [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0 > [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0 > [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20 > [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0 > [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90 > [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100 > [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b > > Anyone know what is going on here? > > Justin. > In addition to using netconsole, which kernel options should be enabled to better diagnose this issue? Should I enable these to help track down this bug? [ ] XFS Debugging support (EXPERIMENTAL) [ ] Compile the kernel with frame pointers Are there any other options that will help determine the root cause of this bug that are recommended? Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Which kernel options should be enabled to find the root cause of this bug? @ 2009-11-24 13:08 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw) To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz On Sat, 17 Oct 2009, Justin Piszcz wrote: > Hello, > > I have a system I recently upgraded from 2.6.30.x and after approximately > 24-48 hours--sometimes longer, the system cannot write any more files to disk > (luckily though I can still write to /dev/shm) -- to which I have > saved the sysrq-t and sysrq-w output: > > http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt > http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt > > Configuration: > > $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : > active raid1 sdb2[1] sda2[0] > 136448 blocks [2/2] [UU] > > md2 : active raid1 sdb3[1] sda3[0] > 129596288 blocks [2/2] [UU] > > md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] > sdc1[0] > 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] > > md0 : active raid1 sdb1[1] sda1[0] > 16787776 blocks [2/2] [UU] > > $ mount > /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) > tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) > proc on /proc type proc (rw,noexec,nosuid,nodev) > sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) > udev on /dev type tmpfs (rw,mode=0755) > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) > devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) > /dev/md1 on /boot type ext3 (rw,noatime) > /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) > rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) > nfsd on /proc/fs/nfsd type nfsd (rw) > > Distribution: Debian Testing > Arch: x86_64 > > The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem > persists. > > Here is a snippet of two processes in D-state, the first was not doing > anything, the second was mrtg. > > [121444.684000] pickup D 0000000000000003 0 18407 4521 > 0x00000000 > [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 > 0000000000000000 > [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 > ffff880176794f90 > [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 > ffff880176794f90 > [121444.684000] Call Trace: > [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0 > [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160 > [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40 > [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70 > [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90 > [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80 > [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b > > [121444.684000] rateup D 0000000000000000 0 18538 18465 > 0x00000000 > [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 > ffff88023ea09ec8 > [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 > ffff88023faad0d0 > [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 > ffff88023faad0d0 > [121444.684000] Call Trace: > [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90 > [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50 > [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0 > [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0 > [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10 > [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80 > [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40 > [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130 > [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0 > [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40 > [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0 > [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0 > [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220 > [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0 > [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220 > [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300 > [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0 > [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30 > [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0 > [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130 > [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40 > [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0 > [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40 > [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130 > [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40 > [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60 > [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0 > [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0 > [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20 > [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0 > [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90 > [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100 > [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b > > Anyone know what is going on here? > > Justin. > In addition to using netconsole, which kernel options should be enabled to better diagnose this issue? Should I enable these to help track down this bug? [ ] XFS Debugging support (EXPERIMENTAL) [ ] Compile the kernel with frame pointers Are there any other options that will help determine the root cause of this bug that are recommended? Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug? 2009-11-24 13:08 ` Justin Piszcz @ 2009-11-24 15:14 ` Eric Sandeen -1 siblings, 0 replies; 49+ messages in thread From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs Justin Piszcz wrote: > > > On Sat, 17 Oct 2009, Justin Piszcz wrote: > >> Hello, >> >> I have a system I recently upgraded from 2.6.30.x and after >> approximately 24-48 hours--sometimes longer, the system cannot write >> any more files to disk (luckily though I can still write to /dev/shm) >> -- to which I have >> saved the sysrq-t and sysrq-w output: >> >> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt >> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt Unfortunately it looks like a lot of the sysrq-t, at least, was lost. The sysrq-w trace has the "show blocked state" start a ways down the file, for anyone playing along at home ;) Other things you might try are a sysrq-m to get memory state... >> Configuration: >> >> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 >> : active raid1 sdb2[1] sda2[0] >> 136448 blocks [2/2] [UU] >> >> md2 : active raid1 sdb3[1] sda3[0] >> 129596288 blocks [2/2] [UU] >> >> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] >> sdd1[1] sdc1[0] >> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] >> >> md0 : active raid1 sdb1[1] sda1[0] >> 16787776 blocks [2/2] [UU] >> >> $ mount >> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) >> proc on /proc type proc (rw,noexec,nosuid,nodev) >> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) >> udev on /dev type tmpfs (rw,mode=0755) >> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) >> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) >> /dev/md1 on /boot type ext3 (rw,noatime) >> /dev/md3 on /r/1 type xfs >> (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) >> nfsd on /proc/fs/nfsd type nfsd (rw) Do you get the same behavior if you don't add the log options at mount time? Kind of grasping at straws here for now ... >> Distribution: Debian Testing >> Arch: x86_64 >> >> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem >> persists. >> ... > In addition to using netconsole, which kernel options should be enabled > to better diagnose this issue? > > Should I enable these to help track down this bug? > > [ ] XFS Debugging support (EXPERIMENTAL) > [ ] Compile the kernel with frame pointers The former probably won't hurt; the latter might gibe us better backtraces. > Are there any other options that will help determine the root cause of this > bug that are recommended? Not that I can think of off hand ... -Eric > Justin. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug? @ 2009-11-24 15:14 ` Eric Sandeen 0 siblings, 0 replies; 49+ messages in thread From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz Justin Piszcz wrote: > > > On Sat, 17 Oct 2009, Justin Piszcz wrote: > >> Hello, >> >> I have a system I recently upgraded from 2.6.30.x and after >> approximately 24-48 hours--sometimes longer, the system cannot write >> any more files to disk (luckily though I can still write to /dev/shm) >> -- to which I have >> saved the sysrq-t and sysrq-w output: >> >> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt >> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt Unfortunately it looks like a lot of the sysrq-t, at least, was lost. The sysrq-w trace has the "show blocked state" start a ways down the file, for anyone playing along at home ;) Other things you might try are a sysrq-m to get memory state... >> Configuration: >> >> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 >> : active raid1 sdb2[1] sda2[0] >> 136448 blocks [2/2] [UU] >> >> md2 : active raid1 sdb3[1] sda3[0] >> 129596288 blocks [2/2] [UU] >> >> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] >> sdd1[1] sdc1[0] >> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] >> >> md0 : active raid1 sdb1[1] sda1[0] >> 16787776 blocks [2/2] [UU] >> >> $ mount >> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) >> proc on /proc type proc (rw,noexec,nosuid,nodev) >> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) >> udev on /dev type tmpfs (rw,mode=0755) >> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) >> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) >> /dev/md1 on /boot type ext3 (rw,noatime) >> /dev/md3 on /r/1 type xfs >> (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) >> nfsd on /proc/fs/nfsd type nfsd (rw) Do you get the same behavior if you don't add the log options at mount time? Kind of grasping at straws here for now ... >> Distribution: Debian Testing >> Arch: x86_64 >> >> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem >> persists. >> ... > In addition to using netconsole, which kernel options should be enabled > to better diagnose this issue? > > Should I enable these to help track down this bug? > > [ ] XFS Debugging support (EXPERIMENTAL) > [ ] Compile the kernel with frame pointers The former probably won't hurt; the latter might gibe us better backtraces. > Are there any other options that will help determine the root cause of this > bug that are recommended? Not that I can think of off hand ... -Eric > Justin. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug? 2009-11-24 15:14 ` Eric Sandeen @ 2009-11-24 16:20 ` Justin Piszcz -1 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Tue, 24 Nov 2009, Eric Sandeen wrote: > Justin Piszcz wrote: >> >> >> On Sat, 17 Oct 2009, Justin Piszcz wrote: >> >>> Hello, >>> >>> I have a system I recently upgraded from 2.6.30.x and after >>> approximately 24-48 hours--sometimes longer, the system cannot write >>> any more files to disk (luckily though I can still write to /dev/shm) >>> -- to which I have >>> saved the sysrq-t and sysrq-w output: >>> >>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt >>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt > > Unfortunately it looks like a lot of the sysrq-t, at least, was lost. Yes, when this occurred the first few times, I can only grab whats in dmesg to the ramdisk, trying to access any file system other than the ramdisk (tmpfs) /dev/shm, will cause the process to be locked. > > The sysrq-w trace has the "show blocked state" start a ways down the file, > for anyone playing along at home ;) > > Other things you might try are a sysrq-m to get memory state... I actually performed most of the useful sysrq-commands, please see the following: wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt > >>> Configuration: >>> >>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 >>> : active raid1 sdb2[1] sda2[0] >>> 136448 blocks [2/2] [UU] >>> >>> md2 : active raid1 sdb3[1] sda3[0] >>> 129596288 blocks [2/2] [UU] >>> >>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] >>> sdd1[1] sdc1[0] >>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] >>> >>> md0 : active raid1 sdb1[1] sda1[0] >>> 16787776 blocks [2/2] [UU] >>> >>> $ mount >>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) >>> proc on /proc type proc (rw,noexec,nosuid,nodev) >>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) >>> udev on /dev type tmpfs (rw,mode=0755) >>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) >>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) >>> /dev/md1 on /boot type ext3 (rw,noatime) >>> /dev/md3 on /r/1 type xfs >>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) >>> nfsd on /proc/fs/nfsd type nfsd (rw) > > Do you get the same behavior if you don't add the log options at mount time? I have not tried disabling the log options, although they have been in effect for a long time, (the logsbufs and bufsize and recently) the nobarrier support. Could there be an issue using -o nobarrier on a raid1+xfs? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug? @ 2009-11-24 16:20 ` Justin Piszcz 0 siblings, 0 replies; 49+ messages in thread From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs On Tue, 24 Nov 2009, Eric Sandeen wrote: > Justin Piszcz wrote: >> >> >> On Sat, 17 Oct 2009, Justin Piszcz wrote: >> >>> Hello, >>> >>> I have a system I recently upgraded from 2.6.30.x and after >>> approximately 24-48 hours--sometimes longer, the system cannot write >>> any more files to disk (luckily though I can still write to /dev/shm) >>> -- to which I have >>> saved the sysrq-t and sysrq-w output: >>> >>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt >>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt > > Unfortunately it looks like a lot of the sysrq-t, at least, was lost. Yes, when this occurred the first few times, I can only grab whats in dmesg to the ramdisk, trying to access any file system other than the ramdisk (tmpfs) /dev/shm, will cause the process to be locked. > > The sysrq-w trace has the "show blocked state" start a ways down the file, > for anyone playing along at home ;) > > Other things you might try are a sysrq-m to get memory state... I actually performed most of the useful sysrq-commands, please see the following: wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt > >>> Configuration: >>> >>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 >>> : active raid1 sdb2[1] sda2[0] >>> 136448 blocks [2/2] [UU] >>> >>> md2 : active raid1 sdb3[1] sda3[0] >>> 129596288 blocks [2/2] [UU] >>> >>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] >>> sdd1[1] sdc1[0] >>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] >>> >>> md0 : active raid1 sdb1[1] sda1[0] >>> 16787776 blocks [2/2] [UU] >>> >>> $ mount >>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) >>> proc on /proc type proc (rw,noexec,nosuid,nodev) >>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) >>> udev on /dev type tmpfs (rw,mode=0755) >>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) >>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) >>> /dev/md1 on /boot type ext3 (rw,noatime) >>> /dev/md3 on /r/1 type xfs >>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144) >>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) >>> nfsd on /proc/fs/nfsd type nfsd (rw) > > Do you get the same behavior if you don't add the log options at mount time? I have not tried disabling the log options, although they have been in effect for a long time, (the logsbufs and bufsize and recently) the nobarrier support. Could there be an issue using -o nobarrier on a raid1+xfs? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug? 2009-11-24 16:20 ` Justin Piszcz @ 2009-11-24 16:23 ` Eric Sandeen -1 siblings, 0 replies; 49+ messages in thread From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs Justin Piszcz wrote: > > On Tue, 24 Nov 2009, Eric Sandeen wrote: ... >> Do you get the same behavior if you don't add the log options at mount time? > I have not tried disabling the log options, although they have been in effect > for a long time, (the logsbufs and bufsize and recently) the nobarrier > support. Could there be an issue using -o nobarrier on a raid1+xfs? nobarrier should not cause problems. -Eric ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug? @ 2009-11-24 16:23 ` Eric Sandeen 0 siblings, 0 replies; 49+ messages in thread From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs Justin Piszcz wrote: > > On Tue, 24 Nov 2009, Eric Sandeen wrote: ... >> Do you get the same behavior if you don't add the log options at mount time? > I have not tried disabling the log options, although they have been in effect > for a long time, (the logsbufs and bufsize and recently) the nobarrier > support. Could there be an issue using -o nobarrier on a raid1+xfs? nobarrier should not cause problems. -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2009-11-24 16:23 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-10-17 22:34 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz 2009-10-17 22:34 ` Justin Piszcz 2009-10-18 20:17 ` Justin Piszcz 2009-10-18 20:17 ` Justin Piszcz 2009-10-19 3:04 ` Dave Chinner 2009-10-19 3:04 ` Dave Chinner 2009-10-19 10:18 ` Justin Piszcz 2009-10-19 10:18 ` Justin Piszcz 2009-10-20 0:33 ` Dave Chinner 2009-10-20 0:33 ` Dave Chinner 2009-10-20 8:33 ` Justin Piszcz 2009-10-20 8:33 ` Justin Piszcz 2009-10-21 10:19 ` Justin Piszcz 2009-10-21 10:19 ` Justin Piszcz 2009-10-21 14:17 ` mdadm --detail showing annoying device Stephane Bunel 2009-10-21 21:46 ` Neil Brown 2009-10-22 11:22 ` Stephane Bunel 2009-10-29 3:44 ` Neil Brown 2009-11-03 9:37 ` Stephane Bunel 2009-11-03 10:09 ` Beolach 2009-11-03 12:16 ` Stephane Bunel 2009-10-22 11:29 ` Mario 'BitKoenig' Holbe 2009-10-22 14:17 ` Stephane Bunel 2009-10-22 16:00 ` Stephane Bunel 2009-10-22 22:49 ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz 2009-10-22 22:49 ` Justin Piszcz 2009-10-22 23:00 ` Dave Chinner 2009-10-22 23:00 ` Dave Chinner 2009-10-26 11:24 ` Justin Piszcz 2009-10-26 11:24 ` Justin Piszcz 2009-11-02 21:46 ` Justin Piszcz 2009-11-02 21:46 ` Justin Piszcz 2009-11-20 20:39 ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz 2009-11-20 20:39 ` Justin Piszcz 2009-11-20 23:44 ` Bug#557262: " Faidon Liambotis 2009-11-20 23:44 ` Faidon Liambotis 2009-11-20 23:44 ` Faidon Liambotis 2009-11-20 23:51 ` Justin Piszcz 2009-11-20 23:51 ` Justin Piszcz 2009-11-21 14:29 ` Roger Heflin 2009-11-21 14:29 ` Roger Heflin 2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz 2009-11-24 13:08 ` Justin Piszcz 2009-11-24 15:14 ` Eric Sandeen 2009-11-24 15:14 ` Eric Sandeen 2009-11-24 16:20 ` Justin Piszcz 2009-11-24 16:20 ` Justin Piszcz 2009-11-24 16:23 ` Eric Sandeen 2009-11-24 16:23 ` Eric Sandeen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.