* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-17 22:34 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw)
To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz
Hello,
I have a system I recently upgraded from 2.6.30.x and after approximately
24-48 hours--sometimes longer, the system cannot write any more files to
disk (luckily though I can still write to /dev/shm) -- to which I have
saved the sysrq-t and sysrq-w output:
http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
Configuration:
$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
136448 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
129596288 blocks [2/2] [UU]
md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0]
5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
md0 : active raid1 sdb1[1] sda1[0]
16787776 blocks [2/2] [UU]
$ mount
/dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw,noatime)
/dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
Distribution: Debian Testing
Arch: x86_64
The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
persists.
Here is a snippet of two processes in D-state, the first was not doing
anything, the second was mrtg.
[121444.684000] pickup D 0000000000000003 0 18407 4521 0x00000000
[121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000
[121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90
[121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90
[121444.684000] Call Trace:
[121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
[121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
[121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
[121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
[121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
[121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
[121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
[121444.684000] rateup D 0000000000000000 0 18538 18465 0x00000000
[121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8
[121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0
[121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0
[121444.684000] Call Trace:
[121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
[121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
[121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
[121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
[121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
[121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
[121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
[121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
[121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
[121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
[121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
[121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
[121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
[121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
[121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
[121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
[121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40
[121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
[121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40
[121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
[121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
[121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
[121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
[121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
[121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
[121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
[121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90
[121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100
[121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
Anyone know what is going on here?
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-17 22:34 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw)
To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz
Hello,
I have a system I recently upgraded from 2.6.30.x and after approximately
24-48 hours--sometimes longer, the system cannot write any more files to
disk (luckily though I can still write to /dev/shm) -- to which I have
saved the sysrq-t and sysrq-w output:
http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
Configuration:
$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
136448 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
129596288 blocks [2/2] [UU]
md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0]
5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
md0 : active raid1 sdb1[1] sda1[0]
16787776 blocks [2/2] [UU]
$ mount
/dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw,noatime)
/dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
Distribution: Debian Testing
Arch: x86_64
The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
persists.
Here is a snippet of two processes in D-state, the first was not doing
anything, the second was mrtg.
[121444.684000] pickup D 0000000000000003 0 18407 4521 0x00000000
[121444.684000] ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000
[121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90
[121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90
[121444.684000] Call Trace:
[121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
[121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
[121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
[121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
[121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
[121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
[121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
[121444.684000] rateup D 0000000000000000 0 18538 18465 0x00000000
[121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8
[121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0
[121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0
[121444.684000] Call Trace:
[121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
[121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
[121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
[121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
[121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
[121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
[121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
[121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
[121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
[121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
[121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
[121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
[121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
[121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
[121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
[121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
[121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40
[121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
[121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40
[121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
[121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
[121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
[121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
[121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
[121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
[121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
[121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90
[121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100
[121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
Anyone know what is going on here?
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-17 22:34 ` Justin Piszcz
@ 2009-10-18 20:17 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw)
To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz
On Sat, 17 Oct 2009, Justin Piszcz wrote:
> Hello,
It has happened again, all sysrq-X output was saved this time.
wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
Kernel configuration:
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt
Diff of the two configs:
$ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_"
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_CONSTRUCTORS=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_HAVE_DMA_ATTRS=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_X86_NEW_MCE=y
> CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_UNEVICTABLE_LRU=y
< CONFIG_PHYSICAL_START=0x200000
> CONFIG_PHYSICAL_START=0x1000000
< CONFIG_PHYSICAL_ALIGN=0x200000
> CONFIG_PHYSICAL_ALIGN=0x1000000
< CONFIG_COMPAT_NET_DEV_OPS=y
< CONFIG_SND_JACK=y
> CONFIG_HID_DRAGONRISE=y
> CONFIG_HID_GREENASIA=y
> CONFIG_HID_SMARTJOYPLUS=y
> CONFIG_HID_THRUSTMASTER=y
> CONFIG_HID_ZEROPLUS=y
> CONFIG_FSNOTIFY=y
> CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
> CONFIG_HAVE_ARCH_KMEMCHECK=y
I have reverted back to 2.6.30.9 to see if the problem recurs with this
kernel version.
I do not recall seeing this on the older 2.6.30.x kernels:
[ 9.276427] md3: detected capacity change from 0 to 5251073572864
[ 9.277411] md2: detected capacity change from 0 to 132706598912
[ 9.278305] md1: detected capacity change from 0 to 139722752
[ 9.278921] md0: detected capacity change from 0 to 17190682624
Again, some more D-state processes:
[76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
[76325.608087] Call Trace:
[76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
[76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
[76325.608214] Call Trace:
[76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
[76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
[76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
[76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
[76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
[76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
[76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
[76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
All of the D-state processes:
$ cat sysrq-w.txt |grep ' D'
[76307.285125] alpine D 0000000000000000 0 7659 29120 0x00000000
[76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
[76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
[76325.608257] syslogd D 0000000000000002 0 2438 1 0x00000000
[76325.608318] freshclam D 0000000000000000 0 2877 1 0x00000000
[76325.608428] asterisk D 0000000000000001 0 3278 1 0x00000000
[76325.608492] console-kit-d D 0000000000000000 0 3299 1 0x00000000
[76325.608562] dhcpd3 D 0000000000000000 0 3554 1 0x00000000
[76325.608621] plasma-deskto D 0000000000000002 0 32482 1 0x00000000
[76325.608713] kaccess D 0000000000000001 0 32488 1 0x00000000
[76325.608752] mail D 0000000000000000 0 7397 7386 0x00000000
[76325.608830] hal-acl-tool D 0000000000000000 0 7430 3399 0x00000004
[76325.608888] mrtg D 0000000000000000 0 7444 7433 0x00000000
[76325.608981] cron D 0000000000000000 0 7500 3630 0x00000000
[76325.609000] alpine D 0000000000000000 0 7659 29120 0x00000000
List of functions underneath the D-state processes (sorted/uniqued)--
121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0
60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170
60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180
59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0
59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0
59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80
59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0
59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100
59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80
58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270
58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0
57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200
57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0
57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90
57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30
55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0
8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150
6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70
4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50
4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0
4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20
4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60
4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110
4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000
3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130
3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190
3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0
3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0
3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40
3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140
3 [<ffffffff810c2110>] ? pollwake+0x0/0x60
3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0
3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0
3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0
3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0
3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80
3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0
3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70
2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0
2 [<ffffffff81455151>] ? __down+0x61/0xa0
2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0
2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140
2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0
2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0
2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190
2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220
2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530
2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340
2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360
2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0
2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
2 [<ffffffff810c31dc>] ? dput+0xac/0x160
2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70
2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130
2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780
2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0
2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220
2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40
2 [<ffffffff8108b210>] ? __writepage+0x0/0x40
2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0
2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0
2 [<ffffffff8106e016>] ? down+0x46/0x50
2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80
2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130
2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0
2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30
2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30
2 [<ffffffff81069886>] ? kthread+0xa6/0xb0
2 [<ffffffff810697e0>] ? kthread+0x0/0xb0
2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10
2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60
2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70
2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20
2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20
1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20
1 [<ffffffff81455857>] ? __down_read+0x17/0xae
1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80
1 [<ffffffff81454144>] ? io_schedule+0x34/0x50
1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180
1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30
1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10
1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20
1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0
1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150
1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60
1 [<ffffffff8126c391>] ? tty_write+0x221/0x270
1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100
1 [<ffffffff81219361>] ? __up_read+0x21/0xc0
1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80
1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0
1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0
1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0
1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50
1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20
1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30
1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100
1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340
1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880
1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190
1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0
1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530
1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100
1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0
1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60
1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40
1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0
1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0
1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0
1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320
1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480
1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480
1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70
1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320
1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80
1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110
1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0
1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0
1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0
1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0
1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0
1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0
1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450
1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0
1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190
1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0
1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0
1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0
1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460
1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0
1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140
1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120
1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110
1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310
1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670
1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50
1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0
1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0
1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60
1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0
1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80
1 [<ffffffff810af636>] ? sys_close+0xa6/0x100
1 [<ffffffff810af556>] ? filp_close+0x56/0x90
1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590
1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30
1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220
1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140
1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140
1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140
1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80
1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80
1 [<ffffffff81084d95>] ? sync_page+0x35/0x60
1 [<ffffffff81084d60>] ? sync_page+0x0/0x60
1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250
1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30
1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30
1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240
1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360
1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0
1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90
1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90
1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90
1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30
1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-18 20:17 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw)
To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz
On Sat, 17 Oct 2009, Justin Piszcz wrote:
> Hello,
It has happened again, all sysrq-X output was saved this time.
wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
Kernel configuration:
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt
Diff of the two configs:
$ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_"
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_CONSTRUCTORS=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_HAVE_DMA_ATTRS=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_X86_NEW_MCE=y
> CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_UNEVICTABLE_LRU=y
< CONFIG_PHYSICAL_START=0x200000
> CONFIG_PHYSICAL_START=0x1000000
< CONFIG_PHYSICAL_ALIGN=0x200000
> CONFIG_PHYSICAL_ALIGN=0x1000000
< CONFIG_COMPAT_NET_DEV_OPS=y
< CONFIG_SND_JACK=y
> CONFIG_HID_DRAGONRISE=y
> CONFIG_HID_GREENASIA=y
> CONFIG_HID_SMARTJOYPLUS=y
> CONFIG_HID_THRUSTMASTER=y
> CONFIG_HID_ZEROPLUS=y
> CONFIG_FSNOTIFY=y
> CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
> CONFIG_HAVE_ARCH_KMEMCHECK=y
I have reverted back to 2.6.30.9 to see if the problem recurs with this
kernel version.
I do not recall seeing this on the older 2.6.30.x kernels:
[ 9.276427] md3: detected capacity change from 0 to 5251073572864
[ 9.277411] md2: detected capacity change from 0 to 132706598912
[ 9.278305] md1: detected capacity change from 0 to 139722752
[ 9.278921] md0: detected capacity change from 0 to 17190682624
Again, some more D-state processes:
[76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
[76325.608087] Call Trace:
[76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
[76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
[76325.608214] Call Trace:
[76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
[76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
[76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
[76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
[76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
[76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
[76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
[76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
All of the D-state processes:
$ cat sysrq-w.txt |grep ' D'
[76307.285125] alpine D 0000000000000000 0 7659 29120 0x00000000
[76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
[76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
[76325.608257] syslogd D 0000000000000002 0 2438 1 0x00000000
[76325.608318] freshclam D 0000000000000000 0 2877 1 0x00000000
[76325.608428] asterisk D 0000000000000001 0 3278 1 0x00000000
[76325.608492] console-kit-d D 0000000000000000 0 3299 1 0x00000000
[76325.608562] dhcpd3 D 0000000000000000 0 3554 1 0x00000000
[76325.608621] plasma-deskto D 0000000000000002 0 32482 1 0x00000000
[76325.608713] kaccess D 0000000000000001 0 32488 1 0x00000000
[76325.608752] mail D 0000000000000000 0 7397 7386 0x00000000
[76325.608830] hal-acl-tool D 0000000000000000 0 7430 3399 0x00000004
[76325.608888] mrtg D 0000000000000000 0 7444 7433 0x00000000
[76325.608981] cron D 0000000000000000 0 7500 3630 0x00000000
[76325.609000] alpine D 0000000000000000 0 7659 29120 0x00000000
List of functions underneath the D-state processes (sorted/uniqued)--
121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0
60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170
60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180
59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0
59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0
59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80
59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0
59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100
59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80
58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270
58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0
57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200
57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0
57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90
57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30
55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0
8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150
6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70
4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50
4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0
4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20
4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60
4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110
4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000
3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130
3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190
3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0
3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0
3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40
3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140
3 [<ffffffff810c2110>] ? pollwake+0x0/0x60
3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0
3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0
3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0
3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0
3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80
3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0
3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70
2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0
2 [<ffffffff81455151>] ? __down+0x61/0xa0
2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0
2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140
2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0
2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0
2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190
2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220
2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530
2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340
2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360
2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0
2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
2 [<ffffffff810c31dc>] ? dput+0xac/0x160
2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70
2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130
2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780
2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0
2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220
2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40
2 [<ffffffff8108b210>] ? __writepage+0x0/0x40
2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0
2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0
2 [<ffffffff8106e016>] ? down+0x46/0x50
2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80
2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130
2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0
2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30
2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30
2 [<ffffffff81069886>] ? kthread+0xa6/0xb0
2 [<ffffffff810697e0>] ? kthread+0x0/0xb0
2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10
2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60
2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70
2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20
2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20
1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20
1 [<ffffffff81455857>] ? __down_read+0x17/0xae
1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80
1 [<ffffffff81454144>] ? io_schedule+0x34/0x50
1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180
1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30
1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10
1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20
1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0
1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150
1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60
1 [<ffffffff8126c391>] ? tty_write+0x221/0x270
1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100
1 [<ffffffff81219361>] ? __up_read+0x21/0xc0
1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80
1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0
1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0
1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0
1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50
1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20
1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30
1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100
1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340
1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880
1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190
1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0
1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530
1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100
1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0
1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60
1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40
1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0
1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0
1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0
1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320
1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480
1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480
1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70
1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320
1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80
1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110
1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0
1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0
1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0
1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0
1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0
1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0
1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450
1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0
1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190
1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0
1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0
1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0
1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460
1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0
1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140
1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120
1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110
1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310
1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670
1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50
1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0
1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0
1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60
1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0
1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80
1 [<ffffffff810af636>] ? sys_close+0xa6/0x100
1 [<ffffffff810af556>] ? filp_close+0x56/0x90
1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590
1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30
1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220
1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140
1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140
1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140
1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80
1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80
1 [<ffffffff81084d95>] ? sync_page+0x35/0x60
1 [<ffffffff81084d60>] ? sync_page+0x0/0x60
1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250
1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30
1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30
1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240
1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360
1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0
1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90
1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90
1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90
1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30
1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-18 20:17 ` Justin Piszcz
@ 2009-10-19 3:04 ` Dave Chinner
-1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-19 3:04 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:
All pointing to log IO not completing.
That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.
Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-19 3:04 ` Dave Chinner
0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-19 3:04 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:
All pointing to log IO not completing.
That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.
Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-19 3:04 ` Dave Chinner
@ 2009-10-19 10:18 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Mon, 19 Oct 2009, Dave Chinner wrote:
> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>> It has happened again, all sysrq-X output was saved this time.
>>
>> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
> .....
>
> All pointing to log IO not completing.
>
> That is, all of the D state processes are backed up on locks or
> waiting for IO completion processing. A lot of the processes are
> waiting for _xfs_log_force to complete, others are waiting for
> inodes to be unpinned or are backed up behind locked inodes that are
> waiting on log IO to complete before they can complete the
> transaction and unlock the inode, and so on.
>
> Unfortunately, the xfslogd and xfsdatad kernel threads are not
> present in any of the output given, so I can't tell if these have
> deadlocked themselves and caused the problem. However, my experience
> with such pile-ups is that an I/O completion has not been run for
> some reason and that is the cause of the problem. I don't know if
> you can provide enough information to tell us if this happened or
> not. Instead, do you have a test case that you can share?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
Hello,
So far I do not have a reproducible test case, the only other thing not posted
was the output of ps auxww during the time of the lockup, not sure if it will
help, but here it is:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
root 2 0.0 0.0 0 0 ? S< Oct16 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/1]
root 6 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/1]
root 7 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/2]
root 8 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/2]
root 9 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/3]
root 10 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/3]
root 11 0.0 0.0 0 0 ? R< Oct16 0:00 [events/0]
root 12 0.0 0.0 0 0 ? S< Oct16 0:00 [events/1]
root 13 0.0 0.0 0 0 ? S< Oct16 0:00 [events/2]
root 14 0.0 0.0 0 0 ? S< Oct16 0:00 [events/3]
root 15 0.0 0.0 0 0 ? S< Oct16 0:00 [khelper]
root 20 0.0 0.0 0 0 ? S< Oct16 0:00 [async/mgr]
root 180 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/0]
root 181 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/1]
root 182 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/2]
root 183 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/3]
root 185 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpid]
root 186 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_notify]
root 187 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_hotplug]
root 271 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/0]
root 272 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/1]
root 273 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/2]
root 274 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/3]
root 275 0.0 0.0 0 0 ? S< Oct16 0:00 [ata_aux]
root 276 0.0 0.0 0 0 ? S< Oct16 0:00 [ksuspend_usbd]
root 280 0.0 0.0 0 0 ? S< Oct16 0:00 [khubd]
root 283 0.0 0.0 0 0 ? S< Oct16 0:00 [kseriod]
root 318 0.0 0.0 0 0 ? S< Oct16 0:00 [khpsbpkt]
root 361 0.0 0.0 0 0 ? S Oct16 0:00 [pdflush]
root 362 0.0 0.0 0 0 ? D Oct16 0:43 [pdflush]
root 363 0.0 0.0 0 0 ? S< Oct16 0:21 [kswapd0]
root 364 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/0]
root 365 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/1]
root 366 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/2]
root 367 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/3]
root 368 0.0 0.0 0 0 ? S< Oct16 0:00 [nfsiod]
root 369 0.0 0.0 0 0 ? S< Oct16 0:00 [cifsoplockd]
root 370 0.0 0.0 0 0 ? S< Oct16 0:00 [xfs_mru_cache]
root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
root 518 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_0]
root 521 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_1]
root 524 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_2]
root 527 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_3]
root 530 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_4]
root 533 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_5]
root 542 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_6]
root 545 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_7]
root 551 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_8]
root 554 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_9]
root 558 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_10]
root 562 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_11]
root 568 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_12]
root 571 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_13]
root 584 0.0 0.0 0 0 ? S< Oct16 0:00 [knodemgrd_0]
root 616 0.0 0.0 0 0 ? S< Oct16 0:00 [kpsmoused]
root 666 0.0 0.0 0 0 ? S< Oct16 0:00 [usbhid_resumer]
root 683 0.0 0.0 0 0 ? S< Oct16 0:00 [hd-audio0]
root 703 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/0]
root 704 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/1]
root 705 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/2]
root 706 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/3]
root 811 4.0 0.0 0 0 ? S< Oct16 81:54 [md3_raid5]
root 817 0.3 0.0 0 0 ? S< Oct16 6:30 [md2_raid1]
root 823 0.0 0.0 0 0 ? S< Oct16 0:00 [md1_raid1]
root 827 0.0 0.0 0 0 ? S< Oct16 0:50 [md0_raid1]
root 829 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd]
root 830 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild]
root 831 0.0 0.0 0 0 ? D< Oct16 0:00 [xfssyncd]
root 884 0.0 0.0 16736 740 ? S<s Oct16 0:00 udevd --daemon
postfix 1649 0.0 0.0 39124 2468 ? S 05:00 0:00 qmgr -l -t fifo -u -c
www-data 1877 0.0 0.0 146612 5248 ? S 05:01 0:00 /usr/sbin/apache2 -k start
root 3182 0.0 0.0 0 0 ? S< Oct16 0:00 [kjournald]
root 3183 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd]
root 3184 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild]
root 3185 0.0 0.0 0 0 ? S< Oct16 0:00 [xfssyncd]
root 3337 0.0 0.0 8072 132 ? Ss Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d
root 3339 0.0 0.0 8072 424 ? Ds Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d
daemon 3526 0.0 0.0 8016 564 ? Ss Oct16 0:00 /sbin/portmap
statd 3538 0.0 0.0 10148 776 ? Ss Oct16 0:00 /sbin/rpc.statd
root 3547 0.0 0.0 26956 572 ? Ss Oct16 0:00 /usr/sbin/rpc.idmapd
root 3732 0.0 0.0 5900 676 ? Ds Oct16 0:00 /sbin/syslogd
root 3741 0.0 0.0 3796 452 ? Ss Oct16 0:00 /sbin/klogd -x
root 3750 0.0 0.0 3804 640 ? Ss Oct16 0:00 /usr/sbin/acpid
110 3760 0.0 0.0 23560 1476 ? Ss Oct16 0:02 /usr/bin/dbus-daemon --system
bind 3773 0.0 0.8 264364 70004 ? Ssl Oct16 0:05 /usr/sbin/named -u bind -S 1024
root 3796 0.0 0.0 49028 1260 ? Ss Oct16 0:00 /usr/sbin/sshd
root 3827 0.0 0.0 104804 7452 ? Ssl Oct16 0:08 /usr/sbin/console-kit-daemon
amavis 3909 0.0 1.1 217948 91004 ? Ss Oct16 0:00 amavisd (master)
nobody 3913 0.0 0.2 118612 21632 ? Sl Oct16 0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid
polw 3932 0.0 0.1 51868 12084 ? Ss Oct16 0:00 policyd-weight (master)
polw 3933 0.0 0.1 51868 11744 ? Ss Oct16 0:00 policyd-weight (cache)
postfw 3937 0.0 0.1 55248 14064 ? Ss Oct16 0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid
postgrey 3940 0.0 0.1 57900 13232 ? Ss Oct16 0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421
clamav 4166 0.0 1.3 215576 112772 ? Ssl Oct16 0:04 /usr/sbin/clamd
clamav 4262 0.0 0.0 25776 1328 ? Ss Oct16 0:00 /usr/bin/freshclam -d --quiet
amavis 4307 0.0 1.1 224072 96732 ? S Oct16 0:02 amavisd (ch14-avail)
amavis 4308 0.0 1.2 228168 100820 ? S Oct16 0:05 amavisd (ch14-avail)
root 4353 0.0 0.0 7224 3944 ? S Oct16 0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
root 4362 0.0 0.0 8864 520 ? Ss Oct16 0:00 /usr/sbin/irqbalance
daemon 4377 0.0 0.0 38488 840 ? Ss Oct16 0:00 lpd Waiting
root 4412 0.0 0.0 0 0 ? S< Oct16 0:00 [lockd]
root 4413 0.0 0.0 0 0 ? S< Oct16 0:10 [nfsd]
root 4414 0.0 0.0 0 0 ? D< Oct16 0:03 [nfsd]
root 4415 0.0 0.0 0 0 ? D< Oct16 0:13 [nfsd]
root 4416 0.0 0.0 0 0 ? S< Oct16 0:07 [nfsd]
root 4417 0.0 0.0 0 0 ? D< Oct16 0:11 [nfsd]
root 4418 0.0 0.0 0 0 ? D< Oct16 0:01 [nfsd]
root 4419 0.0 0.0 0 0 ? D< Oct16 0:04 [nfsd]
root 4420 0.0 0.0 0 0 ? D< Oct16 0:02 [nfsd]
root 4424 0.0 0.0 18812 1208 ? Ss Oct16 0:00 /usr/sbin/rpc.mountd --manage-gids
oident 4432 0.0 0.0 12232 576 ? Ss Oct16 0:00 /usr/sbin/oidentd -m -u oident -g oident
root 4444 0.0 0.0 10124 660 ? Ss Oct16 0:00 /usr/sbin/inetd
root 4521 0.0 0.0 37020 2388 ? Ss Oct16 0:00 /usr/lib/postfix/master
root 4545 0.0 0.0 58084 1796 ? Ds Oct16 0:00 /usr/sbin/nmbd -D
root 4547 0.0 0.0 93724 3044 ? Ss Oct16 0:00 /usr/sbin/smbd -D
root 4568 0.0 0.0 93752 1776 ? S Oct16 0:00 /usr/sbin/smbd -D
root 4700 0.0 0.0 18768 1348 ? S Oct16 0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid
asterisk 4714 1.3 0.2 666972 24460 ? Ssl Oct16 26:36 /usr/sbin/asterisk -p -U asterisk
asterisk 4715 0.0 0.0 13656 880 ? S Oct16 0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
ntp 4734 0.0 0.0 23428 1400 ? Ss Oct16 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g
root 4789 0.0 0.0 16644 780 ? Ss Oct16 0:00 /usr/sbin/dovecot
root 4790 0.0 0.0 74664 3196 ? S Oct16 0:00 dovecot-auth
root 4792 0.0 0.0 74664 3240 ? S Oct16 0:00 dovecot-auth -w
111 4797 0.0 0.0 38084 4616 ? Ss Oct16 0:07 /usr/sbin/hald
root 4798 0.0 0.0 20040 1432 ? S Oct16 0:01 hald-runner
root 4895 0.0 0.0 22028 1240 ? S Oct16 0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3
root 4899 0.0 0.0 22032 1232 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled
root 4900 0.0 0.0 22028 1228 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled
111 4902 0.0 0.0 25920 1204 ? S Oct16 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root 4911 0.0 0.0 12484 656 ? Ss Oct16 0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root 4929 0.0 0.0 24020 584 ? Ss Oct16 0:00 /usr/sbin/squid -D -YC
proxy 4931 0.0 0.0 26900 5136 ? S Oct16 0:00 (squid) -D -YC
root 4942 0.0 0.0 28924 3548 ? Ss Oct16 0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345
root 4943 0.0 0.0 17392 1380 ? S Oct16 0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1
daemon 4945 0.3 0.0 16596 3172 ? S Oct16 7:01 p0f -u daemon -i eth1 -l tcp dst port 25
root 4947 0.0 0.0 93604 5832 ? SNs Oct16 0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph
root 4951 0.0 0.0 9296 3716 ? Ss Oct16 0:00 /usr/sbin/dhcpd3 -q eth0
nut 4964 0.0 0.0 14384 744 ? Ss Oct16 0:15 /lib/nut/usbhid-ups -a apc
nut 4966 0.0 0.0 14316 600 ? Ss Oct16 0:01 /sbin/upsd
root 4968 0.0 0.0 14284 728 ? Ss Oct16 0:00 /sbin/upsmon
nut 4970 0.0 0.0 14284 712 ? S Oct16 0:01 /sbin/upsmon
daemon 5007 0.0 0.0 16356 416 ? Ss Oct16 0:00 /usr/sbin/atd
root 5027 0.0 0.0 20988 1060 ? Ss Oct16 0:00 /usr/sbin/cron
dovecot 5887 0.0 0.0 18504 2108 ? S 15:31 0:00 imap-login
root 6184 0.0 0.0 73496 6380 ? Sl Oct16 0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root 6205 0.0 0.0 26292 720 ? Ss Oct16 0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc
root 6213 0.1 1.3 192020 112852 tty7 Ss+ Oct16 2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6
root 6284 0.0 0.0 5852 600 tty1 Ss+ Oct16 0:00 /sbin/getty 38400 tty1
root 6285 0.0 0.0 5852 596 tty2 Ss+ Oct16 0:00 /sbin/getty 38400 tty2
root 6286 0.0 0.0 5852 596 tty3 Ss+ Oct16 0:00 /sbin/getty 38400 tty3
root 6287 0.0 0.0 5852 600 tty4 Ss+ Oct16 0:00 /sbin/getty 38400 tty4
root 6288 0.0 0.0 5852 596 tty5 Ss+ Oct16 0:00 /sbin/getty 38400 tty5
root 6289 0.0 0.0 5852 600 tty6 Ss+ Oct16 0:00 /sbin/getty 38400 tty6
root 6297 0.0 0.0 55716 1864 ? S Oct16 0:00 -:0
postfix 6362 0.0 0.0 39076 2308 ? S 15:36 0:00 anvil -l -t unix -u -o max_idle=3600s
root 7003 0.0 0.0 80904 3412 ? Ss Oct16 0:00 sshd: ap [priv]
polw 7443 0.0 0.1 52000 12516 ? S Oct16 0:00 policyd-weight (child)
dovecot 8922 0.0 0.0 18504 2108 ? S 16:01 0:00 imap-login
dovecot 8923 0.0 0.0 18504 2112 ? S 16:01 0:00 imap-login
postfix 18407 0.0 0.0 39076 2352 ? D 17:50 0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks
asterisk 18424 0.0 0.0 36996 2260 ? D 17:54 0:00 /usr/sbin/postdrop -r
root 18425 0.0 0.0 46372 1440 ? S 17:55 0:00 /USR/SBIN/CRON
root 18459 0.0 0.0 11404 1328 ? Ss 17:55 0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi
root 18460 0.0 0.0 21888 1432 ? D 17:55 0:00 /usr/lib/hal/hal-acl-tool --reconfigure
root 18465 0.0 0.2 43416 16892 ? S 17:55 0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg
root 18479 0.0 0.0 4140 612 ? S 17:55 0:00 tee -a /var/log/mrtg/mrtg.log
root 18538 0.0 0.0 38476 1976 ? D 17:55 0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0
postfix 18539 0.0 0.1 50740 9836 ? S 17:55 0:00 cleanup -z -t unix -u -c
root 18555 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18556 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18557 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18558 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18559 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18560 0.0 0.0 46368 1172 ? S 18:02 0:00 /USR/SBIN/CRON
root 18561 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18562 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18563 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18564 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18565 0.0 0.0 46368 1172 ? S 18:09 0:00 /USR/SBIN/CRON
root 18566 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18567 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18568 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18578 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18579 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18580 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18581 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18582 0.0 0.0 46368 1172 ? S 18:17 0:00 /USR/SBIN/CRON
root 18583 0.0 0.0 46368 1172 ? S 18:18 0:00 /USR/SBIN/CRON
postfix 18584 0.0 0.0 71452 8104 ? S 18:18 0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings
postfix 18585 0.0 0.0 39076 2300 ? S 18:18 0:00 proxymap -t unix -u
sshd 18597 0.0 0.0 0 0 ? Z 18:19 0:00 [sshd] <defunct>
root 18601 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 18602 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 18603 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 30866 0.0 0.0 54392 944 ? Ss 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30867 0.0 0.0 54392 652 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30868 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30869 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30870 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 31018 0.0 0.0 146612 7856 ? Ss 04:31 0:00 /usr/sbin/apache2 -k start
www-data 31025 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start
www-data 31044 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start
root 31076 0.0 0.0 23128 756 ? Ss 04:31 0:00 pure-ftpd (SERVER)
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-19 10:18 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Mon, 19 Oct 2009, Dave Chinner wrote:
> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>> It has happened again, all sysrq-X output was saved this time.
>>
>> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
> .....
>
> All pointing to log IO not completing.
>
> That is, all of the D state processes are backed up on locks or
> waiting for IO completion processing. A lot of the processes are
> waiting for _xfs_log_force to complete, others are waiting for
> inodes to be unpinned or are backed up behind locked inodes that are
> waiting on log IO to complete before they can complete the
> transaction and unlock the inode, and so on.
>
> Unfortunately, the xfslogd and xfsdatad kernel threads are not
> present in any of the output given, so I can't tell if these have
> deadlocked themselves and caused the problem. However, my experience
> with such pile-ups is that an I/O completion has not been run for
> some reason and that is the cause of the problem. I don't know if
> you can provide enough information to tell us if this happened or
> not. Instead, do you have a test case that you can share?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
Hello,
So far I do not have a reproducible test case, the only other thing not posted
was the output of ps auxww during the time of the lockup, not sure if it will
help, but here it is:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
root 2 0.0 0.0 0 0 ? S< Oct16 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/1]
root 6 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/1]
root 7 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/2]
root 8 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/2]
root 9 0.0 0.0 0 0 ? S< Oct16 0:00 [migration/3]
root 10 0.0 0.0 0 0 ? S< Oct16 0:00 [ksoftirqd/3]
root 11 0.0 0.0 0 0 ? R< Oct16 0:00 [events/0]
root 12 0.0 0.0 0 0 ? S< Oct16 0:00 [events/1]
root 13 0.0 0.0 0 0 ? S< Oct16 0:00 [events/2]
root 14 0.0 0.0 0 0 ? S< Oct16 0:00 [events/3]
root 15 0.0 0.0 0 0 ? S< Oct16 0:00 [khelper]
root 20 0.0 0.0 0 0 ? S< Oct16 0:00 [async/mgr]
root 180 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/0]
root 181 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/1]
root 182 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/2]
root 183 0.0 0.0 0 0 ? S< Oct16 0:00 [kblockd/3]
root 185 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpid]
root 186 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_notify]
root 187 0.0 0.0 0 0 ? S< Oct16 0:00 [kacpi_hotplug]
root 271 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/0]
root 272 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/1]
root 273 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/2]
root 274 0.0 0.0 0 0 ? S< Oct16 0:00 [ata/3]
root 275 0.0 0.0 0 0 ? S< Oct16 0:00 [ata_aux]
root 276 0.0 0.0 0 0 ? S< Oct16 0:00 [ksuspend_usbd]
root 280 0.0 0.0 0 0 ? S< Oct16 0:00 [khubd]
root 283 0.0 0.0 0 0 ? S< Oct16 0:00 [kseriod]
root 318 0.0 0.0 0 0 ? S< Oct16 0:00 [khpsbpkt]
root 361 0.0 0.0 0 0 ? S Oct16 0:00 [pdflush]
root 362 0.0 0.0 0 0 ? D Oct16 0:43 [pdflush]
root 363 0.0 0.0 0 0 ? S< Oct16 0:21 [kswapd0]
root 364 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/0]
root 365 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/1]
root 366 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/2]
root 367 0.0 0.0 0 0 ? S< Oct16 0:00 [aio/3]
root 368 0.0 0.0 0 0 ? S< Oct16 0:00 [nfsiod]
root 369 0.0 0.0 0 0 ? S< Oct16 0:00 [cifsoplockd]
root 370 0.0 0.0 0 0 ? S< Oct16 0:00 [xfs_mru_cache]
root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
root 518 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_0]
root 521 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_1]
root 524 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_2]
root 527 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_3]
root 530 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_4]
root 533 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_5]
root 542 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_6]
root 545 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_7]
root 551 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_8]
root 554 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_9]
root 558 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_10]
root 562 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_11]
root 568 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_12]
root 571 0.0 0.0 0 0 ? S< Oct16 0:00 [scsi_eh_13]
root 584 0.0 0.0 0 0 ? S< Oct16 0:00 [knodemgrd_0]
root 616 0.0 0.0 0 0 ? S< Oct16 0:00 [kpsmoused]
root 666 0.0 0.0 0 0 ? S< Oct16 0:00 [usbhid_resumer]
root 683 0.0 0.0 0 0 ? S< Oct16 0:00 [hd-audio0]
root 703 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/0]
root 704 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/1]
root 705 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/2]
root 706 0.0 0.0 0 0 ? S< Oct16 0:00 [rpciod/3]
root 811 4.0 0.0 0 0 ? S< Oct16 81:54 [md3_raid5]
root 817 0.3 0.0 0 0 ? S< Oct16 6:30 [md2_raid1]
root 823 0.0 0.0 0 0 ? S< Oct16 0:00 [md1_raid1]
root 827 0.0 0.0 0 0 ? S< Oct16 0:50 [md0_raid1]
root 829 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd]
root 830 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild]
root 831 0.0 0.0 0 0 ? D< Oct16 0:00 [xfssyncd]
root 884 0.0 0.0 16736 740 ? S<s Oct16 0:00 udevd --daemon
postfix 1649 0.0 0.0 39124 2468 ? S 05:00 0:00 qmgr -l -t fifo -u -c
www-data 1877 0.0 0.0 146612 5248 ? S 05:01 0:00 /usr/sbin/apache2 -k start
root 3182 0.0 0.0 0 0 ? S< Oct16 0:00 [kjournald]
root 3183 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsbufd]
root 3184 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsaild]
root 3185 0.0 0.0 0 0 ? S< Oct16 0:00 [xfssyncd]
root 3337 0.0 0.0 8072 132 ? Ss Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d
root 3339 0.0 0.0 8072 424 ? Ds Oct16 0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d
daemon 3526 0.0 0.0 8016 564 ? Ss Oct16 0:00 /sbin/portmap
statd 3538 0.0 0.0 10148 776 ? Ss Oct16 0:00 /sbin/rpc.statd
root 3547 0.0 0.0 26956 572 ? Ss Oct16 0:00 /usr/sbin/rpc.idmapd
root 3732 0.0 0.0 5900 676 ? Ds Oct16 0:00 /sbin/syslogd
root 3741 0.0 0.0 3796 452 ? Ss Oct16 0:00 /sbin/klogd -x
root 3750 0.0 0.0 3804 640 ? Ss Oct16 0:00 /usr/sbin/acpid
110 3760 0.0 0.0 23560 1476 ? Ss Oct16 0:02 /usr/bin/dbus-daemon --system
bind 3773 0.0 0.8 264364 70004 ? Ssl Oct16 0:05 /usr/sbin/named -u bind -S 1024
root 3796 0.0 0.0 49028 1260 ? Ss Oct16 0:00 /usr/sbin/sshd
root 3827 0.0 0.0 104804 7452 ? Ssl Oct16 0:08 /usr/sbin/console-kit-daemon
amavis 3909 0.0 1.1 217948 91004 ? Ss Oct16 0:00 amavisd (master)
nobody 3913 0.0 0.2 118612 21632 ? Sl Oct16 0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid
polw 3932 0.0 0.1 51868 12084 ? Ss Oct16 0:00 policyd-weight (master)
polw 3933 0.0 0.1 51868 11744 ? Ss Oct16 0:00 policyd-weight (cache)
postfw 3937 0.0 0.1 55248 14064 ? Ss Oct16 0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid
postgrey 3940 0.0 0.1 57900 13232 ? Ss Oct16 0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421
clamav 4166 0.0 1.3 215576 112772 ? Ssl Oct16 0:04 /usr/sbin/clamd
clamav 4262 0.0 0.0 25776 1328 ? Ss Oct16 0:00 /usr/bin/freshclam -d --quiet
amavis 4307 0.0 1.1 224072 96732 ? S Oct16 0:02 amavisd (ch14-avail)
amavis 4308 0.0 1.2 228168 100820 ? S Oct16 0:05 amavisd (ch14-avail)
root 4353 0.0 0.0 7224 3944 ? S Oct16 0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
root 4362 0.0 0.0 8864 520 ? Ss Oct16 0:00 /usr/sbin/irqbalance
daemon 4377 0.0 0.0 38488 840 ? Ss Oct16 0:00 lpd Waiting
root 4412 0.0 0.0 0 0 ? S< Oct16 0:00 [lockd]
root 4413 0.0 0.0 0 0 ? S< Oct16 0:10 [nfsd]
root 4414 0.0 0.0 0 0 ? D< Oct16 0:03 [nfsd]
root 4415 0.0 0.0 0 0 ? D< Oct16 0:13 [nfsd]
root 4416 0.0 0.0 0 0 ? S< Oct16 0:07 [nfsd]
root 4417 0.0 0.0 0 0 ? D< Oct16 0:11 [nfsd]
root 4418 0.0 0.0 0 0 ? D< Oct16 0:01 [nfsd]
root 4419 0.0 0.0 0 0 ? D< Oct16 0:04 [nfsd]
root 4420 0.0 0.0 0 0 ? D< Oct16 0:02 [nfsd]
root 4424 0.0 0.0 18812 1208 ? Ss Oct16 0:00 /usr/sbin/rpc.mountd --manage-gids
oident 4432 0.0 0.0 12232 576 ? Ss Oct16 0:00 /usr/sbin/oidentd -m -u oident -g oident
root 4444 0.0 0.0 10124 660 ? Ss Oct16 0:00 /usr/sbin/inetd
root 4521 0.0 0.0 37020 2388 ? Ss Oct16 0:00 /usr/lib/postfix/master
root 4545 0.0 0.0 58084 1796 ? Ds Oct16 0:00 /usr/sbin/nmbd -D
root 4547 0.0 0.0 93724 3044 ? Ss Oct16 0:00 /usr/sbin/smbd -D
root 4568 0.0 0.0 93752 1776 ? S Oct16 0:00 /usr/sbin/smbd -D
root 4700 0.0 0.0 18768 1348 ? S Oct16 0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid
asterisk 4714 1.3 0.2 666972 24460 ? Ssl Oct16 26:36 /usr/sbin/asterisk -p -U asterisk
asterisk 4715 0.0 0.0 13656 880 ? S Oct16 0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
ntp 4734 0.0 0.0 23428 1400 ? Ss Oct16 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g
root 4789 0.0 0.0 16644 780 ? Ss Oct16 0:00 /usr/sbin/dovecot
root 4790 0.0 0.0 74664 3196 ? S Oct16 0:00 dovecot-auth
root 4792 0.0 0.0 74664 3240 ? S Oct16 0:00 dovecot-auth -w
111 4797 0.0 0.0 38084 4616 ? Ss Oct16 0:07 /usr/sbin/hald
root 4798 0.0 0.0 20040 1432 ? S Oct16 0:01 hald-runner
root 4895 0.0 0.0 22028 1240 ? S Oct16 0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3
root 4899 0.0 0.0 22032 1232 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled
root 4900 0.0 0.0 22028 1228 ? S Oct16 0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled
111 4902 0.0 0.0 25920 1204 ? S Oct16 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root 4911 0.0 0.0 12484 656 ? Ss Oct16 0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root 4929 0.0 0.0 24020 584 ? Ss Oct16 0:00 /usr/sbin/squid -D -YC
proxy 4931 0.0 0.0 26900 5136 ? S Oct16 0:00 (squid) -D -YC
root 4942 0.0 0.0 28924 3548 ? Ss Oct16 0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345
root 4943 0.0 0.0 17392 1380 ? S Oct16 0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1
daemon 4945 0.3 0.0 16596 3172 ? S Oct16 7:01 p0f -u daemon -i eth1 -l tcp dst port 25
root 4947 0.0 0.0 93604 5832 ? SNs Oct16 0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph
root 4951 0.0 0.0 9296 3716 ? Ss Oct16 0:00 /usr/sbin/dhcpd3 -q eth0
nut 4964 0.0 0.0 14384 744 ? Ss Oct16 0:15 /lib/nut/usbhid-ups -a apc
nut 4966 0.0 0.0 14316 600 ? Ss Oct16 0:01 /sbin/upsd
root 4968 0.0 0.0 14284 728 ? Ss Oct16 0:00 /sbin/upsmon
nut 4970 0.0 0.0 14284 712 ? S Oct16 0:01 /sbin/upsmon
daemon 5007 0.0 0.0 16356 416 ? Ss Oct16 0:00 /usr/sbin/atd
root 5027 0.0 0.0 20988 1060 ? Ss Oct16 0:00 /usr/sbin/cron
dovecot 5887 0.0 0.0 18504 2108 ? S 15:31 0:00 imap-login
root 6184 0.0 0.0 73496 6380 ? Sl Oct16 0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root 6205 0.0 0.0 26292 720 ? Ss Oct16 0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc
root 6213 0.1 1.3 192020 112852 tty7 Ss+ Oct16 2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6
root 6284 0.0 0.0 5852 600 tty1 Ss+ Oct16 0:00 /sbin/getty 38400 tty1
root 6285 0.0 0.0 5852 596 tty2 Ss+ Oct16 0:00 /sbin/getty 38400 tty2
root 6286 0.0 0.0 5852 596 tty3 Ss+ Oct16 0:00 /sbin/getty 38400 tty3
root 6287 0.0 0.0 5852 600 tty4 Ss+ Oct16 0:00 /sbin/getty 38400 tty4
root 6288 0.0 0.0 5852 596 tty5 Ss+ Oct16 0:00 /sbin/getty 38400 tty5
root 6289 0.0 0.0 5852 600 tty6 Ss+ Oct16 0:00 /sbin/getty 38400 tty6
root 6297 0.0 0.0 55716 1864 ? S Oct16 0:00 -:0
postfix 6362 0.0 0.0 39076 2308 ? S 15:36 0:00 anvil -l -t unix -u -o max_idle=3600s
root 7003 0.0 0.0 80904 3412 ? Ss Oct16 0:00 sshd: ap [priv]
polw 7443 0.0 0.1 52000 12516 ? S Oct16 0:00 policyd-weight (child)
dovecot 8922 0.0 0.0 18504 2108 ? S 16:01 0:00 imap-login
dovecot 8923 0.0 0.0 18504 2112 ? S 16:01 0:00 imap-login
postfix 18407 0.0 0.0 39076 2352 ? D 17:50 0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks
asterisk 18424 0.0 0.0 36996 2260 ? D 17:54 0:00 /usr/sbin/postdrop -r
root 18425 0.0 0.0 46372 1440 ? S 17:55 0:00 /USR/SBIN/CRON
root 18459 0.0 0.0 11404 1328 ? Ss 17:55 0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi
root 18460 0.0 0.0 21888 1432 ? D 17:55 0:00 /usr/lib/hal/hal-acl-tool --reconfigure
root 18465 0.0 0.2 43416 16892 ? S 17:55 0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg
root 18479 0.0 0.0 4140 612 ? S 17:55 0:00 tee -a /var/log/mrtg/mrtg.log
root 18538 0.0 0.0 38476 1976 ? D 17:55 0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0
postfix 18539 0.0 0.1 50740 9836 ? S 17:55 0:00 cleanup -z -t unix -u -c
root 18555 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18556 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18557 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18558 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18559 0.0 0.0 46368 1172 ? S 18:00 0:00 /USR/SBIN/CRON
root 18560 0.0 0.0 46368 1172 ? S 18:02 0:00 /USR/SBIN/CRON
root 18561 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18562 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18563 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18564 0.0 0.0 46368 1172 ? S 18:05 0:00 /USR/SBIN/CRON
root 18565 0.0 0.0 46368 1172 ? S 18:09 0:00 /USR/SBIN/CRON
root 18566 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18567 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18568 0.0 0.0 46368 1172 ? S 18:10 0:00 /USR/SBIN/CRON
root 18578 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18579 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18580 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18581 0.0 0.0 46368 1172 ? S 18:15 0:00 /USR/SBIN/CRON
root 18582 0.0 0.0 46368 1172 ? S 18:17 0:00 /USR/SBIN/CRON
root 18583 0.0 0.0 46368 1172 ? S 18:18 0:00 /USR/SBIN/CRON
postfix 18584 0.0 0.0 71452 8104 ? S 18:18 0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings
postfix 18585 0.0 0.0 39076 2300 ? S 18:18 0:00 proxymap -t unix -u
sshd 18597 0.0 0.0 0 0 ? Z 18:19 0:00 [sshd] <defunct>
root 18601 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 18602 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 18603 0.0 0.0 46368 1172 ? S 18:20 0:00 /USR/SBIN/CRON
root 30866 0.0 0.0 54392 944 ? Ss 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30867 0.0 0.0 54392 652 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30868 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30869 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 30870 0.0 0.0 54392 540 ? S 04:31 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root 31018 0.0 0.0 146612 7856 ? Ss 04:31 0:00 /usr/sbin/apache2 -k start
www-data 31025 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start
www-data 31044 0.0 0.0 146612 5248 ? S 04:31 0:00 /usr/sbin/apache2 -k start
root 31076 0.0 0.0 23128 756 ? Ss 04:31 0:00 pure-ftpd (SERVER)
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-19 10:18 ` Justin Piszcz
@ 2009-10-20 0:33 ` Dave Chinner
-1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-20 0:33 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
> On Mon, 19 Oct 2009, Dave Chinner wrote:
>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>> It has happened again, all sysrq-X output was saved this time.
>> .....
>>
>> All pointing to log IO not completing.
>>
....
> So far I do not have a reproducible test case,
Ok. What sort of load is being placed on the machine?
> the only other thing not posted was the output of ps auxww during
> the time of the lockup, not sure if it will help, but here it is:
>
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
....
> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
.....
It appears that both the xfslogd and the xfsdatad on CPU 0 are in
the running state but don't appear to be consuming any significant
CPU time. If they remain like this then I think that means they are
stuck waiting on the run queue. Do these XFS threads always appear
like this when the hang occurs? If so, is there something else that
is hogging CPU 0 preventing these threads from getting the CPU?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-20 0:33 ` Dave Chinner
0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-20 0:33 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
> On Mon, 19 Oct 2009, Dave Chinner wrote:
>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>> It has happened again, all sysrq-X output was saved this time.
>> .....
>>
>> All pointing to log IO not completing.
>>
....
> So far I do not have a reproducible test case,
Ok. What sort of load is being placed on the machine?
> the only other thing not posted was the output of ps auxww during
> the time of the lockup, not sure if it will help, but here it is:
>
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
....
> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
.....
It appears that both the xfslogd and the xfsdatad on CPU 0 are in
the running state but don't appear to be consuming any significant
CPU time. If they remain like this then I think that means they are
stuck waiting on the run queue. Do these XFS threads always appear
like this when the hang occurs? If so, is there something else that
is hogging CPU 0 preventing these threads from getting the CPU?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-20 0:33 ` Dave Chinner
@ 2009-10-20 8:33 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-20 8:33 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Tue, 20 Oct 2009, Dave Chinner wrote:
> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>> It has happened again, all sysrq-X output was saved this time.
>>> .....
>>>
>>> All pointing to log IO not completing.
>>>
> ....
>> So far I do not have a reproducible test case,
>
> Ok. What sort of load is being placed on the machine?
Hello, generally the load is low, it mainly serves out some samba shares.
>
>> the only other thing not posted was the output of ps auxww during
>> the time of the lockup, not sure if it will help, but here it is:
>>
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
> ....
>> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
>> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
>> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
>> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
>> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
>> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
>> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
>> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
>> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
>> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
>> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
>> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
> .....
>
> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
> the running state but don't appear to be consuming any significant
> CPU time. If they remain like this then I think that means they are
> stuck waiting on the run queue. Do these XFS threads always appear
> like this when the hang occurs? If so, is there something else that
> is hogging CPU 0 preventing these threads from getting the CPU?
Yes, the XFS threads show up like this on each time the kernel crashed. So far
with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some issue
between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations
on how to catch this bug w/certain options enabled/etc?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-20 8:33 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-20 8:33 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Tue, 20 Oct 2009, Dave Chinner wrote:
> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>> It has happened again, all sysrq-X output was saved this time.
>>> .....
>>>
>>> All pointing to log IO not completing.
>>>
> ....
>> So far I do not have a reproducible test case,
>
> Ok. What sort of load is being placed on the machine?
Hello, generally the load is low, it mainly serves out some samba shares.
>
>> the only other thing not posted was the output of ps auxww during
>> the time of the lockup, not sure if it will help, but here it is:
>>
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> root 1 0.0 0.0 10320 684 ? Ss Oct16 0:00 init [2]
> ....
>> root 371 0.0 0.0 0 0 ? R< Oct16 0:01 [xfslogd/0]
>> root 372 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/1]
>> root 373 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/2]
>> root 374 0.0 0.0 0 0 ? S< Oct16 0:00 [xfslogd/3]
>> root 375 0.0 0.0 0 0 ? R< Oct16 0:00 [xfsdatad/0]
>> root 376 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsdatad/1]
>> root 377 0.0 0.0 0 0 ? S< Oct16 0:03 [xfsdatad/2]
>> root 378 0.0 0.0 0 0 ? S< Oct16 0:01 [xfsdatad/3]
>> root 379 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/0]
>> root 380 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/1]
>> root 381 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/2]
>> root 382 0.0 0.0 0 0 ? S< Oct16 0:00 [xfsconvertd/3]
> .....
>
> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
> the running state but don't appear to be consuming any significant
> CPU time. If they remain like this then I think that means they are
> stuck waiting on the run queue. Do these XFS threads always appear
> like this when the hang occurs? If so, is there something else that
> is hogging CPU 0 preventing these threads from getting the CPU?
Yes, the XFS threads show up like this on each time the kernel crashed. So far
with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some issue
between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations
on how to catch this bug w/certain options enabled/etc?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-20 8:33 ` Justin Piszcz
@ 2009-10-21 10:19 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
> On Tue, 20 Oct 2009, Dave Chinner wrote:
>
>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>> It has happened again, all sysrq-X output was saved this time.
>>>> .....
>>>>
>>>> All pointing to log IO not completing.
>>>>
>> ....
>>> So far I do not have a reproducible test case,
>>
>> Ok. What sort of load is being placed on the machine?
> Hello, generally the load is low, it mainly serves out some samba shares.
>
>>
>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>> the running state but don't appear to be consuming any significant
>> CPU time. If they remain like this then I think that means they are
>> stuck waiting on the run queue. Do these XFS threads always appear
>> like this when the hang occurs? If so, is there something else that
>> is hogging CPU 0 preventing these threads from getting the CPU?
> Yes, the XFS threads show up like this on each time the kernel crashed. So
> far
> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
> issue
> between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations
> on how to catch this bug w/certain options enabled/etc?
>
>
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>
Uptime with 2.6.30.9:
06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
No issues yet, so it first started happening in 2.6.(31).(x).
Any further recommendations on how to debug this issue? BTW: Do you view this
as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-21 10:19 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
> On Tue, 20 Oct 2009, Dave Chinner wrote:
>
>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>> It has happened again, all sysrq-X output was saved this time.
>>>> .....
>>>>
>>>> All pointing to log IO not completing.
>>>>
>> ....
>>> So far I do not have a reproducible test case,
>>
>> Ok. What sort of load is being placed on the machine?
> Hello, generally the load is low, it mainly serves out some samba shares.
>
>>
>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>> the running state but don't appear to be consuming any significant
>> CPU time. If they remain like this then I think that means they are
>> stuck waiting on the run queue. Do these XFS threads always appear
>> like this when the hang occurs? If so, is there something else that
>> is hogging CPU 0 preventing these threads from getting the CPU?
> Yes, the XFS threads show up like this on each time the kernel crashed. So
> far
> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
> issue
> between 2.6.30.9 and 2.6.31.x when this began happening. Any recommendations
> on how to catch this bug w/certain options enabled/etc?
>
>
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
>
Uptime with 2.6.30.9:
06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
No issues yet, so it first started happening in 2.6.(31).(x).
Any further recommendations on how to debug this issue? BTW: Do you view this
as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* mdadm --detail showing annoying device
2009-10-21 10:19 ` Justin Piszcz
(?)
@ 2009-10-21 14:17 ` Stephane Bunel
2009-10-21 21:46 ` Neil Brown
-1 siblings, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-10-21 14:17 UTC (permalink / raw)
To: linux-raid
Hi,
I'm a newbie in the mdadm world. I defined some udev rules to make disk
staticly named according to the bus/host/target. I.e. /dev/sda become
/dev/raid_disk0. So nothing very special with that, it's just a convenient way
to assign disk name to it's physical location.
#ls -la /dev/raid*
brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0
brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
A RAID1 (/dev/md0) is assembled over this two disk.
When looking for detailed information, mdadm show annoying device name in
place of /dev/raid_disk*:
---8x---
#mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Tue Oct 13 12:53:54 2009
Raid Level : raid1
Array Size : 488386496 (465.76 GiB 500.11 GB)
Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Wed Oct 21 15:16:09 2009
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
Events : 0.1526
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/char/21:0
1 8 16 1 active sync /dev/char/21:1
---8x---
Looking in the source code of mdadm I found that the device name selection
rule is (too) simply: select the shortest name in case of multiple
possibility. So '/dev/char/21:0' being shorter than '/dev/raid_disk0', mdadm
display '/dev/char...'. It's annoying for me to see a CHAR (/dev/char/...)
device to represent a hard disk (wich is is a block device of course). The
purpose of the following patch is to handle the directory part to always
prefere the name that is closer to /dev. In case of equality, the shorter name
wins.
---8x---
#./mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Tue Oct 13 12:53:54 2009
Raid Level : raid1
Array Size : 488386496 (465.76 GiB 500.11 GB)
Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Wed Oct 21 15:30:34 2009
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
Events : 0.1526
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/raid_disk0
1 8 16 1 active sync /dev/raid_disk1
---8x---
Fell free to refactor the code. The last time I wrote C code I still had hair
on my head ;-)
---8x---
--- /var/tmp/mdadm_snapshot/util.c 2009-10-20 04:50:23.000000000 +0200
+++ /var/tmp/mdadm_dirlevel/util.c 2009-10-20 13:13:32.000000000 +0200
@@ -507,6 +507,38 @@
#endif /* HAVE_FTW */
#endif /* HAVE_NFTW */
+char *select_by_directory_level(char *current, char *registered)
+{
+ unsigned int current_level = 0;
+ unsigned int registered_level = 0;
+
+ unsigned int level(char *pathname)
+ {
+ unsigned int count = 0;
+ char *p = pathname;
+
+ while ( (p = strchr( p, '/' )) ) {
+ count++;
+ p++;
+ }
+
+ return( count );
+ }
+
+ current_level = level( current );
+ registered_level = level( registered );
+
+ if ( current_level < registered_level )
+ return current;
+
+ if ( current_level == registered_level )
+ if ( strlen( strrchr( current, '/')) <
+ strlen( strrchr( registered,'/')) )
+ return current;
+
+ return registered;
+}
+
/*
* Find a block device with the right major/minor number.
* If we find multiple names, choose the shortest.
@@ -544,14 +576,18 @@
if (p->major == major &&
p->minor == minor) {
if (strncmp(p->name, "/dev/md/",8) == 0) {
- if (preferred == NULL ||
- strlen(p->name) < strlen(preferred))
- preferred = p->name;
- } else {
- if (regular == NULL ||
- strlen(p->name) < strlen(regular))
- regular = p->name;
- }
+ if (preferred == NULL)
+ preferred = p->name;
+ else
+ preferred =
select_by_directory_level(
+ p->name,
preferred );
+ } else {
+ if (regular == NULL )
+ regular = p->name;
+ else
+ regular = select_by_directory_level(
+ p->name, regular );
+ }
}
if (!regular && !preferred && !did_check) {
devlist_ready = 0;
---8x---
Stéphane Bunel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-21 14:17 ` mdadm --detail showing annoying device Stephane Bunel
@ 2009-10-21 21:46 ` Neil Brown
2009-10-22 11:22 ` Stephane Bunel
2009-10-22 11:29 ` Mario 'BitKoenig' Holbe
0 siblings, 2 replies; 49+ messages in thread
From: Neil Brown @ 2009-10-21 21:46 UTC (permalink / raw)
To: Stephane Bunel; +Cc: linux-raid
On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
> Hi,
>
> I'm a newbie in the mdadm world. I defined some udev rules to make disk
> staticly named according to the bus/host/target. I.e. /dev/sda become
> /dev/raid_disk0. So nothing very special with that, it's just a convenient way
> to assign disk name to it's physical location.
>
> #ls -la /dev/raid*
> brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0
> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
>
> A RAID1 (/dev/md0) is assembled over this two disk.
> When looking for detailed information, mdadm show annoying device name in
> place of /dev/raid_disk*:
>
....
> Number Major Minor RaidDevice State
> 0 8 0 0 active sync /dev/char/21:0
> 1 8 16 1 active sync /dev/char/21:1
What is a block device doing in /dev/char ??? There should only be
character devices in there.
If these are actually block device, then I think there is something
wrong with your udev rules.
If these are char devices, then mdadm is doing the wrong thing, but I
cannot see that from the code.
Your proposal of choosing the highest rather than the shortest name
has some merit, but I your current situation doesn't seem to justify
it, and I particularly like the simplicity of the current heuristic.
So for now I think I'll leave it as it is and encourage you to fix
your udev rule.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-21 21:46 ` Neil Brown
@ 2009-10-22 11:22 ` Stephane Bunel
2009-10-29 3:44 ` Neil Brown
2009-10-22 11:29 ` Mario 'BitKoenig' Holbe
1 sibling, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-10-22 11:22 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Neil Brown a écrit :
> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>> Hi,
>>
>> I'm a newbie in the mdadm world. I defined some udev rules to make disk
>> staticly named according to the bus/host/target. I.e. /dev/sda become
>> /dev/raid_disk0. So nothing very special with that, it's just a convenient way
>> to assign disk name to it's physical location.
>>
>> #ls -la /dev/raid*
>> brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0
>> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
>>
>> A RAID1 (/dev/md0) is assembled over this two disk.
>> When looking for detailed information, mdadm show annoying device name in
>> place of /dev/raid_disk*:
>>
> ....
>> Number Major Minor RaidDevice State
>> 0 8 0 0 active sync /dev/char/21:0
>> 1 8 16 1 active sync /dev/char/21:1
>
> What is a block device doing in /dev/char ??? There should only be
> character devices in there.
>
> If these are actually block device, then I think there is something
> wrong with your udev rules.
I think my udev rules are not in cause because they just change /dev/sd* to
/dev/raid_disk*. For udev /dev/char/21:0 seems correspond to the generic scsi
device driver (sg) wich is binded to scsi devices.
> If these are char devices, then mdadm is doing the wrong thing, but I
> cannot see that from the code.
mdadm by choosing the shorter name without differentiate path (/dev/.../ )
and name (sda) choose /dev/char/21:0 just because it is shorter than
/dev/raid_disk0.
> Your proposal of choosing the highest rather than the shortest name
> has some merit, but I your current situation doesn't seem to justify
> it, and I particularly like the simplicity of the current heuristic.
My proposal doesn't choose the highest name but does a selection based on
the shortest path to the device name. I.e. my proposal choose:
/dev/raid_disk0 (1 directory level) over /dev/char/21:0 (2 directory level)
/dev/raid_disk0 (1 directory level) over /dev/block/8:0 (2 directory level)
/dev/sda1 (1 directory level) over /dev/disk/by-label/BOOT (3 directory level)
So in fact, my proposal doesn't change current situation but "adjust" the
heuristic to avoid seeing an annoying device name (char device) as a member of
raid just because it's fullname is shorter than a "semantically better" name.
Add a printf in map_dev() show that my proposal seems help the heuristic to
be more robust in case of "advanced" device naming, without changing current
things.
#./mdadm --detail /dev/md0
(/dev/char/21:0)
(/dev/block/8:0)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
(/dev/raid_disk0)
(/dev/char/21:1)
(/dev/block/8:16)
(/dev/raid_disk1)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
/dev/md0:
Version : 0.90
Creation Time : Tue Oct 13 12:53:54 2009
Raid Level : raid1
Array Size : 488386496 (465.76 GiB 500.11 GB)
Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu Oct 22 12:42:44 2009
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
Events : 0.1526
Number Major Minor RaidDevice State
0 8 0 0 active sync(/dev/char/21:0)
(/dev/block/8:0)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
(/dev/raid_disk0)
/dev/raid_disk0
1 8 16 1 active sync(/dev/char/21:1)
(/dev/block/8:16)
(/dev/raid_disk1)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
/dev/raid_disk1
Thanks,
Stéphane Bunel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-21 21:46 ` Neil Brown
2009-10-22 11:22 ` Stephane Bunel
@ 2009-10-22 11:29 ` Mario 'BitKoenig' Holbe
2009-10-22 14:17 ` Stephane Bunel
1 sibling, 1 reply; 49+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-10-22 11:29 UTC (permalink / raw)
To: linux-raid
Neil Brown <neilb@suse.de> wrote:
> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>> 0 8 0 0 active sync /dev/char/21:0
>> 1 8 16 1 active sync /dev/char/21:1
> What is a block device doing in /dev/char ??? There should only be
> character devices in there.
major 21 are usually SCSI generic devices (/dev/sg) and they are char...
crw-rw---- 1 root root 21, 0 Oct 10 21:48 /dev/sg0
crw-rw---- 1 root root 21, 1 Oct 10 21:48 /dev/sg1
The question is, why do they appear at mdadm --detail
regards
Mario
--
There are trivial truths and the great truths.
The opposite of a trivial truth is plainly false.
The opposite of a great truth is also true.
-- Niels Bohr
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-22 11:29 ` Mario 'BitKoenig' Holbe
@ 2009-10-22 14:17 ` Stephane Bunel
2009-10-22 16:00 ` Stephane Bunel
0 siblings, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-10-22 14:17 UTC (permalink / raw)
To: Mario 'BitKoenig' Holbe; +Cc: linux-raid
Mario 'BitKoenig' Holbe a écrit :
> Neil Brown <neilb@suse.de> wrote:
>> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>>> 0 8 0 0 active sync /dev/char/21:0
>>> 1 8 16 1 active sync /dev/char/21:1
>> What is a block device doing in /dev/char ??? There should only be
>> character devices in there.
>
> major 21 are usually SCSI generic devices (/dev/sg) and they are char...
> crw-rw---- 1 root root 21, 0 Oct 10 21:48 /dev/sg0
> crw-rw---- 1 root root 21, 1 Oct 10 21:48 /dev/sg1
>
> The question is, why do they appear at mdadm --detail
mdadm performs a physical walk to not follow symbolic links (cf nftw(
FTW_PHYS ) in map_dev() ). But using stat() mdadm finaly follow the symbolic
link and so returns the same type/major/minor as the targeted link.
lstat() is identical to stat(), except that if path is a symbolic link,
then the link itself is stat-ed, not the file that it refers to.
#ls -la /dev/char/21:0
lrwxrwxrwx 1 root root 13 2009-10-16 18:12 /dev/char/21:0 -> ../raid_disk0
Tested from Python:
>>> import os, stat
Using stat:
>>> mode = os.stat( '/dev/char/21:0' )[ stat.ST_MODE ]
>>> stat.S_ISBLK( mode )
True
using lstat():
>>> mode = os.lstat( '/dev/char/21:0' )[ stat.ST_MODE ]
>>> stat.S_ISBLK( mode )
False
Stéphane Bunel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-22 14:17 ` Stephane Bunel
@ 2009-10-22 16:00 ` Stephane Bunel
0 siblings, 0 replies; 49+ messages in thread
From: Stephane Bunel @ 2009-10-22 16:00 UTC (permalink / raw)
To: Mario 'BitKoenig' Holbe; +Cc: linux-raid
Stephane Bunel a écrit :
> Mario 'BitKoenig' Holbe a écrit :
>> Neil Brown <neilb@suse.de> wrote:
>>> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>>>> 0 8 0 0 active sync /dev/char/21:0
>>>> 1 8 16 1 active sync /dev/char/21:1
>>> What is a block device doing in /dev/char ??? There should only be
>>> character devices in there.
>>
>> major 21 are usually SCSI generic devices (/dev/sg) and they are char...
>> crw-rw---- 1 root root 21, 0 Oct 10 21:48 /dev/sg0
>> crw-rw---- 1 root root 21, 1 Oct 10 21:48 /dev/sg1
>>
>> The question is, why do they appear at mdadm --detail
>
> mdadm performs a physical walk to not follow symbolic links (cf nftw(
> FTW_PHYS ) in map_dev() ). But using stat() mdadm finaly follow the
> symbolic link and so returns the same type/major/minor as the targeted
> link.
>
> lstat() is identical to stat(), except that if path is a symbolic
> link, then the link itself is stat-ed, not the file that it refers to.
>
>
> #ls -la /dev/char/21:0
> lrwxrwxrwx 1 root root 13 2009-10-16 18:12 /dev/char/21:0 -> ../raid_disk0
>
> Tested from Python:
> >>> import os, stat
>
> Using stat:
> >>> mode = os.stat( '/dev/char/21:0' )[ stat.ST_MODE ]
> >>> stat.S_ISBLK( mode )
> True
>
> using lstat():
> >>> mode = os.lstat( '/dev/char/21:0' )[ stat.ST_MODE ]
> >>> stat.S_ISBLK( mode )
> False
Just for fun ;-)
--- util.c.orig 2009-10-22 17:54:11.000000000 +0200
+++ util.c 2009-10-22 17:55:09.000000000 +0200
@@ -468,7 +468,7 @@
struct stat st;
if (S_ISLNK(stb->st_mode)) {
- if (stat(name, &st) != 0)
+ if (lstat(name, &st) != 0)
return 0;
stb = &st;
}
Stéphane Bunel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-21 10:19 ` Justin Piszcz
@ 2009-10-22 22:49 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Wed, 21 Oct 2009, Justin Piszcz wrote:
>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>>
>>
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>>
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>>
>>>>> All pointing to log IO not completing.
>>>>>
>>> ....
>>>> So far I do not have a reproducible test case,
>>>
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>>
>>>
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue. Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed. So
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>>
>>
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue? BTW: Do you view
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>
Any other ideas?
Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has
no load at all either.. Has anyone else reported similar problems?
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-22 22:49 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Wed, 21 Oct 2009, Justin Piszcz wrote:
>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>>
>>
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>>
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>>
>>>>> All pointing to log IO not completing.
>>>>>
>>> ....
>>>> So far I do not have a reproducible test case,
>>>
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>>
>>>
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue. Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed. So
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>>
>>
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue? BTW: Do you view
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>
Any other ideas?
Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has
no load at all either.. Has anyone else reported similar problems?
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-22 22:49 ` Justin Piszcz
@ 2009-10-22 23:00 ` Dave Chinner
-1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote:
> On Wed, 21 Oct 2009, Justin Piszcz wrote:
>> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>>> the running state but don't appear to be consuming any significant
>>>> CPU time. If they remain like this then I think that means they are
>>>> stuck waiting on the run queue. Do these XFS threads always appear
>>>> like this when the hang occurs? If so, is there something else that
>>>> is hogging CPU 0 preventing these threads from getting the CPU?
>>> Yes, the XFS threads show up like this on each time the kernel
>>> crashed. So far
>>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be
>>> some issue
>>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>>> recommendations
>>> on how to catch this bug w/certain options enabled/etc?
>>
>> Uptime with 2.6.30.9:
>>
>> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>>
>> No issues yet, so it first started happening in 2.6.(31).(x).
Ok.
>> Any further recommendations on how to debug this issue? BTW: Do
>> you view this as an XFS bug or MD/VFS layer issue based on the
>> logs/output thus far?
Could be either. Nothing so far points at a cause.
> Any other ideas?
If it is relatively quick to reproduce, you could run a git bisect
to try to find the offending commit. Or when it has locked up, run
oprofile with callgraph sampling and so we can get an idea of what
is actually running when XFS appears to hang.
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has
> no load at all either.. Has anyone else reported similar problems?
Not that I know of.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-22 23:00 ` Dave Chinner
0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote:
> On Wed, 21 Oct 2009, Justin Piszcz wrote:
>> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>>> the running state but don't appear to be consuming any significant
>>>> CPU time. If they remain like this then I think that means they are
>>>> stuck waiting on the run queue. Do these XFS threads always appear
>>>> like this when the hang occurs? If so, is there something else that
>>>> is hogging CPU 0 preventing these threads from getting the CPU?
>>> Yes, the XFS threads show up like this on each time the kernel
>>> crashed. So far
>>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be
>>> some issue
>>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>>> recommendations
>>> on how to catch this bug w/certain options enabled/etc?
>>
>> Uptime with 2.6.30.9:
>>
>> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>>
>> No issues yet, so it first started happening in 2.6.(31).(x).
Ok.
>> Any further recommendations on how to debug this issue? BTW: Do
>> you view this as an XFS bug or MD/VFS layer issue based on the
>> logs/output thus far?
Could be either. Nothing so far points at a cause.
> Any other ideas?
If it is relatively quick to reproduce, you could run a git bisect
to try to find the offending commit. Or when it has locked up, run
oprofile with callgraph sampling and so we can get an idea of what
is actually running when XFS appears to hang.
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has
> no load at all either.. Has anyone else reported similar problems?
Not that I know of.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-22 22:49 ` Justin Piszcz
@ 2009-10-26 11:24 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
> Any other ideas?
>
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no
> load at all either.. Has anyone else reported similar problems?
>
> Justin.
>
--
Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2
later today and wait another 48 hours.
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-26 11:24 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
> Any other ideas?
>
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no
> load at all either.. Has anyone else reported similar problems?
>
> Justin.
>
--
Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2
later today and wait another 48 hours.
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-22 11:22 ` Stephane Bunel
@ 2009-10-29 3:44 ` Neil Brown
2009-11-03 9:37 ` Stephane Bunel
0 siblings, 1 reply; 49+ messages in thread
From: Neil Brown @ 2009-10-29 3:44 UTC (permalink / raw)
To: Stephane Bunel; +Cc: linux-raid
On Thursday October 22, stephane.bunel@forumdesimages.fr wrote:
> Neil Brown a écrit :
> > On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
> >> Hi,
> >>
> >> I'm a newbie in the mdadm world. I defined some udev rules to make disk
> >> staticly named according to the bus/host/target. I.e. /dev/sda become
> >> /dev/raid_disk0. So nothing very special with that, it's just a convenient way
> >> to assign disk name to it's physical location.
> >>
> >> #ls -la /dev/raid*
> >> brw-rw---- 1 root disk 8, 0 2009-10-16 18:12 /dev/raid_disk0
> >> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
> >>
> >> A RAID1 (/dev/md0) is assembled over this two disk.
> >> When looking for detailed information, mdadm show annoying device name in
> >> place of /dev/raid_disk*:
> >>
> > ....
> >> Number Major Minor RaidDevice State
> >> 0 8 0 0 active sync /dev/char/21:0
> >> 1 8 16 1 active sync /dev/char/21:1
> >
> > What is a block device doing in /dev/char ??? There should only be
> > character devices in there.
> >
> > If these are actually block device, then I think there is something
> > wrong with your udev rules.
>
> I think my udev rules are not in cause because they just change /dev/sd* to
> /dev/raid_disk*. For udev /dev/char/21:0 seems correspond to the generic scsi
> device driver (sg) wich is binded to scsi devices.
That doesn't answer the question of why a block device is appearing in
/dev/char/.
My guess (which is quite possibly wrong, but it is the best I can do)
is that whatever change to udev.rules that you made to get /dev/sdXX
to be renamed to /dev/raid_diskXX, also renamed the scsi-generic
devices to be /dev/raid_diskXX.
I think that would have the effect that you are seeing.
>
> > If these are char devices, then mdadm is doing the wrong thing, but I
> > cannot see that from the code.
>
> mdadm by choosing the shorter name without differentiate path (/dev/.../ )
> and name (sda) choose /dev/char/21:0 just because it is shorter than
> /dev/raid_disk0.
>
> > Your proposal of choosing the highest rather than the shortest name
> > has some merit, but I your current situation doesn't seem to justify
> > it, and I particularly like the simplicity of the current heuristic.
>
> My proposal doesn't choose the highest name but does a selection based on
> the shortest path to the device name. I.e. my proposal choose:
>
> /dev/raid_disk0 (1 directory level) over /dev/char/21:0 (2 directory level)
> /dev/raid_disk0 (1 directory level) over /dev/block/8:0 (2 directory level)
> /dev/sda1 (1 directory level) over /dev/disk/by-label/BOOT (3 directory level)
This is what I meant by 'highest'. highest up in the directory tree -
closest to the root (I guess some people might think that being close
to the root is 'low', not 'high :-)
>
> So in fact, my proposal doesn't change current situation but "adjust" the
> heuristic to avoid seeing an annoying device name (char device) as a member of
> raid just because it's fullname is shorter than a "semantically better" name.
I appreciate that. I am not against changing the heuristic if I find
a good reason ... particularly if I can find something that actually
measures "semantic goodness"!
But I don't think you have provided a good reason.
/dev/char/21:0 should be a char device, not a block device. So mdadm
should ignore it.
On your system, /dev/char/21:0 is a block device (or a link to a block
device) so there is clearly some sort of configuration error.
If you still cannot find it, maybe you could show us the change you
made to udev.rules, and an 'ls -l' of '/dev/char'. That might help
shed some light on your situation.
NeilBrown
>
> Add a printf in map_dev() show that my proposal seems help the heuristic to
> be more robust in case of "advanced" device naming, without changing current
> things.
>
>
> #./mdadm --detail /dev/md0
> (/dev/char/21:0)
> (/dev/block/8:0)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
> (/dev/raid_disk0)
> (/dev/char/21:1)
> (/dev/block/8:16)
> (/dev/raid_disk1)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
> /dev/md0:
> Version : 0.90
> Creation Time : Tue Oct 13 12:53:54 2009
> Raid Level : raid1
> Array Size : 488386496 (465.76 GiB 500.11 GB)
> Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 0
> Persistence : Superblock is persistent
>
> Update Time : Thu Oct 22 12:42:44 2009
> State : clean
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 0
>
> UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
> Events : 0.1526
>
> Number Major Minor RaidDevice State
> 0 8 0 0 active sync(/dev/char/21:0)
> (/dev/block/8:0)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
> (/dev/raid_disk0)
> /dev/raid_disk0
> 1 8 16 1 active sync(/dev/char/21:1)
> (/dev/block/8:16)
> (/dev/raid_disk1)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
> /dev/raid_disk1
>
>
>
> Thanks,
> Stéphane Bunel.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
2009-10-26 11:24 ` Justin Piszcz
@ 2009-11-02 21:46 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Mon, 26 Oct 2009, Justin Piszcz wrote:
>
>
> On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
>>
>> Any other ideas?
>>
>> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no
>> load at all either.. Has anyone else reported similar problems?
>>
>> Justin.
>>
>
> --
>
> Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2
> later today and wait another 48 hours.
>
> Justin.
>
>
Kernel report:
2.6.31-rc1: no crash - uptime 2+ days
2.6.31-rc2: no crash - uptime 2+ days
2.6.31-rc3: no crash - uptime 2+ days
2.6.31-rc4: no crash but network kept dropping out
2.6.31-rc5: cannot test, service owner needs host available
2.6.31-rc6: cannot test, service owner needs host available
2.6.31-rc7: cannot test, service owner needs host available
2.6.31-rc8: cannot test, service owner needs host available
2.6.31-rc9: cannot test, service owner needs host available
2.6.31.x: locks up D-state
It would be somewhere between -rc4 and 2.6.31.x.
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-11-02 21:46 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Mon, 26 Oct 2009, Justin Piszcz wrote:
>
>
> On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
>>
>> Any other ideas?
>>
>> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no
>> load at all either.. Has anyone else reported similar problems?
>>
>> Justin.
>>
>
> --
>
> Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2
> later today and wait another 48 hours.
>
> Justin.
>
>
Kernel report:
2.6.31-rc1: no crash - uptime 2+ days
2.6.31-rc2: no crash - uptime 2+ days
2.6.31-rc3: no crash - uptime 2+ days
2.6.31-rc4: no crash but network kept dropping out
2.6.31-rc5: cannot test, service owner needs host available
2.6.31-rc6: cannot test, service owner needs host available
2.6.31-rc7: cannot test, service owner needs host available
2.6.31-rc8: cannot test, service owner needs host available
2.6.31-rc9: cannot test, service owner needs host available
2.6.31.x: locks up D-state
It would be somewhere between -rc4 and 2.6.31.x.
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-10-29 3:44 ` Neil Brown
@ 2009-11-03 9:37 ` Stephane Bunel
2009-11-03 10:09 ` Beolach
0 siblings, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-11-03 9:37 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Neil Brown a écrit :
(...)
> On your system, /dev/char/21:0 is a block device (or a link to a block
> device) so there is clearly some sort of configuration error.
All files in /dev/char are symlinks (see below). Rules are comming from Gentoo.
> If you still cannot find it, maybe you could show us the change you
> made to udev.rules, and an 'ls -l' of '/dev/char'. That might help
> shed some light on your situation.
Considering mdadm is only involved by "real" block device file, why not just
skipping symlink ?
o Udev rules used to rename /dev/sd[ab]:
#cat 65-persistent-block.rules
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
NAME="raid_disk0"
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
NAME="raid_disk1"
o System is Gentoo.
o Content of /dev/char:
#ls -la /dev/char/
total 0
drwxr-xr-x 2 root root 2880 2009-11-02 16:26 .
drwxr-xr-x 16 root root 3480 2009-11-02 16:26 ..
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 10:1 -> ../psaux
lrwxrwxrwx 1 root root 9 2009-10-16 18:12 10:227 -> ../mcelog
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 10:229 -> ../fuse
lrwxrwxrwx 1 root root 11 2009-10-16 18:12 10:231 -> ../snapshot
lrwxrwxrwx 1 root root 13 2009-10-16 18:12 10:252 -> ../dac960_gam
lrwxrwxrwx 1 root root 11 2009-10-16 18:12 10:58 -> ../megadev0
lrwxrwxrwx 1 root root 6 2009-10-16 18:12 10:59 -> ../tgt
lrwxrwxrwx 1 root root 21 2009-10-16 18:12 10:60 -> ../network_throughput
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 10:61 -> ../network_latency
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 10:62 -> ../cpu_dma_latency
lrwxrwxrwx 1 root root 17 2009-10-16 18:12 10:63 -> ../mapper/control
lrwxrwxrwx 1 root root 6 2009-10-16 18:12 1:1 -> ../mem
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:11 -> ../kmsg
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:2 -> ../kmem
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:3 -> ../null
lrwxrwxrwx 1 root root 12 2009-10-16 18:12 13:0 -> ../input/js0
lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:32 -> ../input/mouse0
lrwxrwxrwx 1 root root 13 2009-10-16 18:12 13:63 -> ../input/mice
lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:64 -> ../input/event0
lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:65 -> ../input/event1
lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:67 -> ../input/event3
lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:68 -> ../input/event4
lrwxrwxrwx 1 root root 15 2009-10-16 18:12 13:69 -> ../input/event5
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:4 -> ../port
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:5 -> ../zero
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 1:7 -> ../full
lrwxrwxrwx 1 root root 9 2009-10-16 18:12 1:8 -> ../random
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:0 -> ../bus/usb/001/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:128 -> ../bus/usb/002/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:129 -> ../bus/usb/002/002
lrwxrwxrwx 1 root root 18 2009-11-02 16:25 189:133 -> ../bus/usb/002/006
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:256 -> ../bus/usb/003/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:384 -> ../bus/usb/004/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:385 -> ../bus/usb/004/002
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:512 -> ../bus/usb/005/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:640 -> ../bus/usb/006/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:768 -> ../bus/usb/007/001
lrwxrwxrwx 1 root root 18 2009-10-16 18:12 189:896 -> ../bus/usb/008/001
lrwxrwxrwx 1 root root 10 2009-10-16 18:12 1:9 -> ../urandom
lrwxrwxrwx 1 root root 13 2009-10-16 18:12 21:0 -> ../raid_disk0
lrwxrwxrwx 1 root root 13 2009-10-16 18:12 21:1 -> ../raid_disk1
lrwxrwxrwx 1 root root 6 2009-11-02 16:26 21:2 -> ../sg2
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 248:0 -> ../rtc0
lrwxrwxrwx 1 root root 10 2009-10-16 18:12 254:0 -> ../hidraw0
lrwxrwxrwx 1 root root 10 2009-10-16 18:12 254:1 -> ../hidraw1
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:0 -> ../tty0
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:1 -> ../tty1
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:10 -> ../tty10
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:11 -> ../tty11
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:12 -> ../tty12
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:13 -> ../tty13
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:14 -> ../tty14
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:15 -> ../tty15
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:16 -> ../tty16
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:17 -> ../tty17
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:18 -> ../tty18
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:19 -> ../tty19
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:2 -> ../tty2
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:20 -> ../tty20
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:21 -> ../tty21
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:22 -> ../tty22
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:23 -> ../tty23
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:24 -> ../tty24
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:25 -> ../tty25
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:26 -> ../tty26
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:27 -> ../tty27
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:28 -> ../tty28
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:29 -> ../tty29
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:3 -> ../tty3
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:30 -> ../tty30
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:31 -> ../tty31
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:32 -> ../tty32
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:33 -> ../tty33
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:34 -> ../tty34
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:35 -> ../tty35
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:36 -> ../tty36
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:37 -> ../tty37
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:38 -> ../tty38
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:39 -> ../tty39
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:4 -> ../tty4
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:40 -> ../tty40
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:41 -> ../tty41
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:42 -> ../tty42
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:43 -> ../tty43
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:44 -> ../tty44
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:45 -> ../tty45
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:46 -> ../tty46
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:47 -> ../tty47
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:48 -> ../tty48
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:49 -> ../tty49
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:5 -> ../tty5
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:50 -> ../tty50
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:51 -> ../tty51
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:52 -> ../tty52
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:53 -> ../tty53
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:54 -> ../tty54
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:55 -> ../tty55
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:56 -> ../tty56
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:57 -> ../tty57
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:58 -> ../tty58
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:59 -> ../tty59
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:6 -> ../tty6
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:60 -> ../tty60
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:61 -> ../tty61
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:62 -> ../tty62
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:63 -> ../tty63
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:64 -> ../ttyS0
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:65 -> ../ttyS1
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:66 -> ../ttyS2
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 4:67 -> ../ttyS3
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:7 -> ../tty7
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:8 -> ../tty8
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 4:9 -> ../tty9
lrwxrwxrwx 1 root root 6 2009-10-16 18:12 5:0 -> ../tty
lrwxrwxrwx 1 root root 10 2009-10-16 18:12 5:1 -> ../console
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 5:2 -> ../ptmx
lrwxrwxrwx 1 root root 6 2009-10-16 18:12 7:0 -> ../vcs
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:1 -> ../vcs1
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:10 -> ../vcs10
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:11 -> ../vcs11
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:12 -> ../vcs12
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:128 -> ../vcsa
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:129 -> ../vcsa1
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:130 -> ../vcsa2
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:131 -> ../vcsa3
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:132 -> ../vcsa4
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:133 -> ../vcsa5
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:134 -> ../vcsa6
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:135 -> ../vcsa7
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:136 -> ../vcsa8
lrwxrwxrwx 1 root root 8 2009-10-16 18:12 7:137 -> ../vcsa9
lrwxrwxrwx 1 root root 9 2009-10-16 18:12 7:138 -> ../vcsa10
lrwxrwxrwx 1 root root 9 2009-10-16 18:12 7:139 -> ../vcsa11
lrwxrwxrwx 1 root root 9 2009-10-16 18:12 7:140 -> ../vcsa12
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:2 -> ../vcs2
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:3 -> ../vcs3
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:4 -> ../vcs4
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:5 -> ../vcs5
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:6 -> ../vcs6
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:7 -> ../vcs7
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:8 -> ../vcs8
lrwxrwxrwx 1 root root 7 2009-10-16 18:12 7:9 -> ../vcs9
Stéphane Bunel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-11-03 9:37 ` Stephane Bunel
@ 2009-11-03 10:09 ` Beolach
2009-11-03 12:16 ` Stephane Bunel
0 siblings, 1 reply; 49+ messages in thread
From: Beolach @ 2009-11-03 10:09 UTC (permalink / raw)
To: Stephane Bunel; +Cc: Neil Brown, linux-raid
On Tue, Nov 3, 2009 at 02:37, Stephane Bunel
<stephane.bunel@forumdesimages.fr> wrote:
> Neil Brown a écrit :
> (...)
>
>> On your system, /dev/char/21:0 is a block device (or a link to a block
>> device) so there is clearly some sort of configuration error.
>
> All files in /dev/char are symlinks (see below). Rules are comming from
> Gentoo.
>
>> If you still cannot find it, maybe you could show us the change you
>> made to udev.rules, and an 'ls -l' of '/dev/char'. That might help
>> shed some light on your situation.
>
> Considering mdadm is only involved by "real" block device file, why not just
> skipping symlink ?
>
> o Udev rules used to rename /dev/sd[ab]:
>
> #cat 65-persistent-block.rules
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
> NAME="raid_disk0"
>
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
> NAME="raid_disk1"
>
Try prepending SUBSYSTEM=="block" to those, so they they'll only match
the (block) sd* devices, and not the (char) sg? devices:
SUBSYSTEM=="block",
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
NAME="raid_disk0"
SUBSYSTEM=="block",
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
NAME="raid_disk1"
Good luck,
Conway S. Smith
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: mdadm --detail showing annoying device
2009-11-03 10:09 ` Beolach
@ 2009-11-03 12:16 ` Stephane Bunel
0 siblings, 0 replies; 49+ messages in thread
From: Stephane Bunel @ 2009-11-03 12:16 UTC (permalink / raw)
To: Beolach; +Cc: Neil Brown, linux-raid
Beolach a écrit :
> On Tue, Nov 3, 2009 at 02:37, Stephane Bunel
> <stephane.bunel@forumdesimages.fr> wrote:
>> Neil Brown a écrit :
>> (...)
>>
>>> On your system, /dev/char/21:0 is a block device (or a link to a block
>>> device) so there is clearly some sort of configuration error.
>> All files in /dev/char are symlinks (see below). Rules are comming from
>> Gentoo.
>>
>>> If you still cannot find it, maybe you could show us the change you
>>> made to udev.rules, and an 'ls -l' of '/dev/char'. That might help
>>> shed some light on your situation.
>> Considering mdadm is only involved by "real" block device file, why not just
>> skipping symlink ?
>>
>> o Udev rules used to rename /dev/sd[ab]:
>>
>> #cat 65-persistent-block.rules
>> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
>> NAME="raid_disk0"
>>
>> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
>> NAME="raid_disk1"
>>
>
> Try prepending SUBSYSTEM=="block" to those, so they they'll only match
> the (block) sd* devices, and not the (char) sg? devices:
> SUBSYSTEM=="block",
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
> NAME="raid_disk0"
> SUBSYSTEM=="block",
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
> NAME="raid_disk1"
Well done!! This correct the sg* link issue.
#mdadm --detail /dev/md0
(...)
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/block/8:0
1 8 16 1 active sync /dev/block/8:16
(...)
But we loop back to the original problem. mdadm show '/dev/block/8:0' instead
of '/dev/raid_disk0' (as i wish). This because the symlink '/dev/block/8:0' is
shorter than the real block file name '/dev/raid_disk0'.
Actual heuristic prevents renaming of real block device name as desired under
penalty of seeing symlinks from /dev/block/*.
maybe mdadm could change this behavior by:
o always prefers real bloc file over symlinks.
o simply skips symlinks.
o changing the heuristic like proposed by my patch (prefer the name that is
closer to /dev).
Stéphane Bunel.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
2009-10-21 10:19 ` Justin Piszcz
@ 2009-11-20 20:39 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-kernel, linux-raid, xfs, Alan Piszcz, asterisk-users, submit
Package: asterisk
Version: 1.6.2.0~dfsg~rc1-1
See below for issue:
On Wed, 21 Oct 2009, Justin Piszcz wrote:
>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>>
>>
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>>
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>>
>>>>> All pointing to log IO not completing.
>>>>>
>>> ....
>>>> So far I do not have a reproducible test case,
>>>
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>>
>>>
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue. Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed. So
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>>
>>
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue? BTW: Do you view
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>
Found root cause-- root cause is asterisk PBX software. I use an SPA3102.
When someone called me, they accidentally dropped the connection, I called
them back in a short period. It is during this time (and the last time)
this happened that the box froze under multiple(!) kernels, always when
someone was calling.
I have removed asterisk but this is the version I was running:
~$ dpkg -l | grep -i asterisk
rc asterisk 1:1.6.2.0~dfsg~rc1-1 Open S
I don't know what asterisk is doing but top did run before the crash
and asterisk was using 100% CPU and as I noted before all other processes
were in D-state.
When this bug occurs, it freezes I/O to all devices and the only way to recover
is to reboot the system.
Just FYI if anyone else out there has their system crash when running asterisk.
Just out of curiosity, has anyone else running asterisk had such an issue?
I was not running any special VoIP PCI cards/etc.
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 20:39 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw)
To: Dave Chinner
Cc: submit, linux-kernel, xfs, linux-raid, asterisk-users, Alan Piszcz
Package: asterisk
Version: 1.6.2.0~dfsg~rc1-1
See below for issue:
On Wed, 21 Oct 2009, Justin Piszcz wrote:
>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>>
>>
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>>
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>>
>>>>> All pointing to log IO not completing.
>>>>>
>>> ....
>>>> So far I do not have a reproducible test case,
>>>
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>>
>>>
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue. Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed. So
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed. So it appears to be some
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening. Any
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>>
>>
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users, load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue? BTW: Do you view
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>
Found root cause-- root cause is asterisk PBX software. I use an SPA3102.
When someone called me, they accidentally dropped the connection, I called
them back in a short period. It is during this time (and the last time)
this happened that the box froze under multiple(!) kernels, always when
someone was calling.
I have removed asterisk but this is the version I was running:
~$ dpkg -l | grep -i asterisk
rc asterisk 1:1.6.2.0~dfsg~rc1-1 Open S
I don't know what asterisk is doing but top did run before the crash
and asterisk was using 100% CPU and as I noted before all other processes
were in D-state.
When this bug occurs, it freezes I/O to all devices and the only way to recover
is to reboot the system.
Just FYI if anyone else out there has their system crash when running asterisk.
Just out of curiosity, has anyone else running asterisk had such an issue?
I was not running any special VoIP PCI cards/etc.
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
2009-11-20 20:39 ` Justin Piszcz
(?)
@ 2009-11-20 23:44 ` Faidon Liambotis
-1 siblings, 0 replies; 49+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
To: Justin Piszcz, 557262
Cc: linux-raid, Dave Chinner, linux-kernel, xfs, submit,
asterisk-users, Alan Piszcz
Justin Piszcz wrote:
> Found root cause-- root cause is asterisk PBX software. I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period. It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
>
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.
It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.
This is almost probably a kernel issue and asterisk just does something
that triggers this bug.
Regards,
Faidon
_______________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
http://lists.digium.com/mailman/listinfo/asterisk-users
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 23:44 ` Faidon Liambotis
0 siblings, 0 replies; 49+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
To: Justin Piszcz, 557262
Cc: Dave Chinner, submit, linux-kernel, xfs, linux-raid,
asterisk-users, Alan Piszcz
Justin Piszcz wrote:
> Found root cause-- root cause is asterisk PBX software. I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period. It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
>
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.
It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.
This is almost probably a kernel issue and asterisk just does something
that triggers this bug.
Regards,
Faidon
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 23:44 ` Faidon Liambotis
0 siblings, 0 replies; 49+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
To: Justin Piszcz, 557262
Cc: linux-raid, linux-kernel, xfs, submit, asterisk-users, Alan Piszcz
Justin Piszcz wrote:
> Found root cause-- root cause is asterisk PBX software. I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period. It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
>
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.
It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.
This is almost probably a kernel issue and asterisk just does something
that triggers this bug.
Regards,
Faidon
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
2009-11-20 23:44 ` Faidon Liambotis
@ 2009-11-20 23:51 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw)
To: Faidon Liambotis
Cc: 557262, Dave Chinner, submit, linux-kernel, xfs, linux-raid,
asterisk-users, Alan Piszcz
On Sat, 21 Nov 2009, Faidon Liambotis wrote:
> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software. I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period. It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
>
It is possible although I tried with several kernels (2.6.30.[0-9] &
2.6.31+ (never had a crash with earlier versions, I installed asterisk long
ago) but it always used to be 1.4.x until recently.. Nasty bug :\
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 23:51 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw)
To: Faidon Liambotis
Cc: linux-raid, 557262, linux-kernel, xfs, submit, asterisk-users,
Alan Piszcz
On Sat, 21 Nov 2009, Faidon Liambotis wrote:
> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software. I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period. It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
>
It is possible although I tried with several kernels (2.6.30.[0-9] &
2.6.31+ (never had a crash with earlier versions, I installed asterisk long
ago) but it always used to be 1.4.x until recently.. Nasty bug :\
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
2009-11-20 23:44 ` Faidon Liambotis
@ 2009-11-21 14:29 ` Roger Heflin
-1 siblings, 0 replies; 49+ messages in thread
From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw)
To: Faidon Liambotis
Cc: Justin Piszcz, 557262, Dave Chinner, submit, linux-kernel, xfs,
linux-raid, asterisk-users, Alan Piszcz
Faidon Liambotis wrote:
> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software. I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period. It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
I had an application in 2.6.5 (SLES9)...that would hang XFS.
The underlying application was multi-threaded and both threads were
doing full disks syncs every so often, and sometimes when doing the
full disk sync the XFS subsystem would deadlock, it appeared to me tha
one sync had a lock and was waiting for another, and the other process
had the second lock and was waiting for the first... We were able to
disable the full disk sync from the application and the deadlock went
away. All non-xfs filesytems still worked and could still be
accessed. I did report the bug with some traces but I don't believe
anyone ever determined where the underlying issues was.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-21 14:29 ` Roger Heflin
0 siblings, 0 replies; 49+ messages in thread
From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw)
To: Faidon Liambotis
Cc: linux-raid, 557262, linux-kernel, xfs, submit, asterisk-users,
Justin Piszcz, Alan Piszcz
Faidon Liambotis wrote:
> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software. I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period. It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
I had an application in 2.6.5 (SLES9)...that would hang XFS.
The underlying application was multi-threaded and both threads were
doing full disks syncs every so often, and sometimes when doing the
full disk sync the XFS subsystem would deadlock, it appeared to me tha
one sync had a lock and was waiting for another, and the other process
had the second lock and was waiting for the first... We were able to
disable the full disk sync from the application and the deadlock went
away. All non-xfs filesytems still worked and could still be
accessed. I did report the bug with some traces but I don't believe
anyone ever determined where the underlying issues was.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Which kernel options should be enabled to find the root cause of this bug?
2009-10-17 22:34 ` Justin Piszcz
@ 2009-11-24 13:08 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw)
To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz
On Sat, 17 Oct 2009, Justin Piszcz wrote:
> Hello,
>
> I have a system I recently upgraded from 2.6.30.x and after approximately
> 24-48 hours--sometimes longer, the system cannot write any more files to disk
> (luckily though I can still write to /dev/shm) -- to which I have
> saved the sysrq-t and sysrq-w output:
>
> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Configuration:
>
> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 :
> active raid1 sdb2[1] sda2[0]
> 136448 blocks [2/2] [UU]
>
> md2 : active raid1 sdb3[1] sda3[0]
> 129596288 blocks [2/2] [UU]
>
> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1]
> sdc1[0]
> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md0 : active raid1 sdb1[1] sda1[0]
> 16787776 blocks [2/2] [UU]
>
> $ mount
> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> /dev/md1 on /boot type ext3 (rw,noatime)
> /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Distribution: Debian Testing
> Arch: x86_64
>
> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
> persists.
>
> Here is a snippet of two processes in D-state, the first was not doing
> anything, the second was mrtg.
>
> [121444.684000] pickup D 0000000000000003 0 18407 4521
> 0x00000000
> [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000
> 0000000000000000
> [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10
> ffff880176794f90
> [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8
> ffff880176794f90
> [121444.684000] Call Trace:
> [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
> [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
> [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
> [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
> [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
> [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
> [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> [121444.684000] rateup D 0000000000000000 0 18538 18465
> 0x00000000
> [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000
> ffff88023ea09ec8
> [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50
> ffff88023faad0d0
> [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00
> ffff88023faad0d0
> [121444.684000] Call Trace:
> [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
> [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
> [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
> [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
> [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
> [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
> [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
> [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
> [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
> [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
> [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
> [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
> [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
> [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
> [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
> [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
> [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
> [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40
> [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
> [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40
> [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
> [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
> [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
> [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
> [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
> [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
> [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
> [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90
> [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100
> [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> Anyone know what is going on here?
>
> Justin.
>
In addition to using netconsole, which kernel options should be enabled
to better diagnose this issue?
Should I enable these to help track down this bug?
[ ] XFS Debugging support (EXPERIMENTAL)
[ ] Compile the kernel with frame pointers
Are there any other options that will help determine the root cause of this
bug that are recommended?
Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 13:08 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw)
To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz
On Sat, 17 Oct 2009, Justin Piszcz wrote:
> Hello,
>
> I have a system I recently upgraded from 2.6.30.x and after approximately
> 24-48 hours--sometimes longer, the system cannot write any more files to disk
> (luckily though I can still write to /dev/shm) -- to which I have
> saved the sysrq-t and sysrq-w output:
>
> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Configuration:
>
> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 :
> active raid1 sdb2[1] sda2[0]
> 136448 blocks [2/2] [UU]
>
> md2 : active raid1 sdb3[1] sda3[0]
> 129596288 blocks [2/2] [UU]
>
> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1]
> sdc1[0]
> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md0 : active raid1 sdb1[1] sda1[0]
> 16787776 blocks [2/2] [UU]
>
> $ mount
> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> /dev/md1 on /boot type ext3 (rw,noatime)
> /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Distribution: Debian Testing
> Arch: x86_64
>
> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
> persists.
>
> Here is a snippet of two processes in D-state, the first was not doing
> anything, the second was mrtg.
>
> [121444.684000] pickup D 0000000000000003 0 18407 4521
> 0x00000000
> [121444.684000] ffff880231dd2290 0000000000000086 0000000000000000
> 0000000000000000
> [121444.684000] 000000000000ff40 000000000000c8c8 ffff880176794d10
> ffff880176794f90
> [121444.684000] 000000032266dd08 ffff8801407a87f0 ffff8800280878d8
> ffff880176794f90
> [121444.684000] Call Trace:
> [121444.684000] [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
> [121444.684000] [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
> [121444.684000] [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
> [121444.684000] [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
> [121444.684000] [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
> [121444.684000] [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
> [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> [121444.684000] rateup D 0000000000000000 0 18538 18465
> 0x00000000
> [121444.684000] ffff88023f8a8c10 0000000000000082 0000000000000000
> ffff88023ea09ec8
> [121444.684000] 000000000000ff40 000000000000c8c8 ffff88023faace50
> ffff88023faad0d0
> [121444.684000] 0000000300003e00 000000010720cc78 0000000000003e00
> ffff88023faad0d0
> [121444.684000] Call Trace:
> [121444.684000] [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
> [121444.684000] [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
> [121444.684000] [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
> [121444.684000] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [121444.684000] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [121444.684000] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [121444.684000] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
> [121444.684000] [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
> [121444.684000] [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
> [121444.684000] [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
> [121444.684000] [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
> [121444.684000] [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
> [121444.684000] [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
> [121444.684000] [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
> [121444.684000] [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
> [121444.684000] [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
> [121444.684000] [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
> [121444.684000] [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
> [121444.684000] [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
> [121444.684000] [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
> [121444.684000] [<ffffffff8108b21a>] ? __writepage+0xa/0x40
> [121444.684000] [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
> [121444.684000] [<ffffffff8108b210>] ? __writepage+0x0/0x40
> [121444.684000] [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
> [121444.684000] [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
> [121444.684000] [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
> [121444.684000] [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
> [121444.684000] [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
> [121444.684000] [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
> [121444.684000] [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
> [121444.684000] [<ffffffff810af556>] ? filp_close+0x56/0x90
> [121444.684000] [<ffffffff810af636>] ? sys_close+0xa6/0x100
> [121444.684000] [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> Anyone know what is going on here?
>
> Justin.
>
In addition to using netconsole, which kernel options should be enabled
to better diagnose this issue?
Should I enable these to help track down this bug?
[ ] XFS Debugging support (EXPERIMENTAL)
[ ] Compile the kernel with frame pointers
Are there any other options that will help determine the root cause of this
bug that are recommended?
Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug?
2009-11-24 13:08 ` Justin Piszcz
@ 2009-11-24 15:14 ` Eric Sandeen
-1 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
Justin Piszcz wrote:
>
>
> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>
>> Hello,
>>
>> I have a system I recently upgraded from 2.6.30.x and after
>> approximately 24-48 hours--sometimes longer, the system cannot write
>> any more files to disk (luckily though I can still write to /dev/shm)
>> -- to which I have
>> saved the sysrq-t and sysrq-w output:
>>
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
The sysrq-w trace has the "show blocked state" start a ways down the file,
for anyone playing along at home ;)
Other things you might try are a sysrq-m to get memory state...
>> Configuration:
>>
>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>> : active raid1 sdb2[1] sda2[0]
>> 136448 blocks [2/2] [UU]
>>
>> md2 : active raid1 sdb3[1] sda3[0]
>> 129596288 blocks [2/2] [UU]
>>
>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>> sdd1[1] sdc1[0]
>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md0 : active raid1 sdb1[1] sda1[0]
>> 16787776 blocks [2/2] [UU]
>>
>> $ mount
>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> udev on /dev type tmpfs (rw,mode=0755)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>> /dev/md1 on /boot type ext3 (rw,noatime)
>> /dev/md3 on /r/1 type xfs
>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>> nfsd on /proc/fs/nfsd type nfsd (rw)
Do you get the same behavior if you don't add the log options at mount time?
Kind of grasping at straws here for now ...
>> Distribution: Debian Testing
>> Arch: x86_64
>>
>> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
>> persists.
>>
...
> In addition to using netconsole, which kernel options should be enabled
> to better diagnose this issue?
>
> Should I enable these to help track down this bug?
>
> [ ] XFS Debugging support (EXPERIMENTAL)
> [ ] Compile the kernel with frame pointers
The former probably won't hurt; the latter might gibe us better backtraces.
> Are there any other options that will help determine the root cause of this
> bug that are recommended?
Not that I can think of off hand ...
-Eric
> Justin.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 15:14 ` Eric Sandeen
0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
Justin Piszcz wrote:
>
>
> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>
>> Hello,
>>
>> I have a system I recently upgraded from 2.6.30.x and after
>> approximately 24-48 hours--sometimes longer, the system cannot write
>> any more files to disk (luckily though I can still write to /dev/shm)
>> -- to which I have
>> saved the sysrq-t and sysrq-w output:
>>
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
The sysrq-w trace has the "show blocked state" start a ways down the file,
for anyone playing along at home ;)
Other things you might try are a sysrq-m to get memory state...
>> Configuration:
>>
>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>> : active raid1 sdb2[1] sda2[0]
>> 136448 blocks [2/2] [UU]
>>
>> md2 : active raid1 sdb3[1] sda3[0]
>> 129596288 blocks [2/2] [UU]
>>
>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>> sdd1[1] sdc1[0]
>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md0 : active raid1 sdb1[1] sda1[0]
>> 16787776 blocks [2/2] [UU]
>>
>> $ mount
>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> udev on /dev type tmpfs (rw,mode=0755)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>> /dev/md1 on /boot type ext3 (rw,noatime)
>> /dev/md3 on /r/1 type xfs
>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>> nfsd on /proc/fs/nfsd type nfsd (rw)
Do you get the same behavior if you don't add the log options at mount time?
Kind of grasping at straws here for now ...
>> Distribution: Debian Testing
>> Arch: x86_64
>>
>> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
>> persists.
>>
...
> In addition to using netconsole, which kernel options should be enabled
> to better diagnose this issue?
>
> Should I enable these to help track down this bug?
>
> [ ] XFS Debugging support (EXPERIMENTAL)
> [ ] Compile the kernel with frame pointers
The former probably won't hurt; the latter might gibe us better backtraces.
> Are there any other options that will help determine the root cause of this
> bug that are recommended?
Not that I can think of off hand ...
-Eric
> Justin.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug?
2009-11-24 15:14 ` Eric Sandeen
@ 2009-11-24 16:20 ` Justin Piszcz
-1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Tue, 24 Nov 2009, Eric Sandeen wrote:
> Justin Piszcz wrote:
>>
>>
>> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>>
>>> Hello,
>>>
>>> I have a system I recently upgraded from 2.6.30.x and after
>>> approximately 24-48 hours--sometimes longer, the system cannot write
>>> any more files to disk (luckily though I can still write to /dev/shm)
>>> -- to which I have
>>> saved the sysrq-t and sysrq-w output:
>>>
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
Yes, when this occurred the first few times, I can only grab whats in dmesg
to the ramdisk, trying to access any file system other than the ramdisk
(tmpfs) /dev/shm, will cause the process to be locked.
>
> The sysrq-w trace has the "show blocked state" start a ways down the file,
> for anyone playing along at home ;)
>
> Other things you might try are a sysrq-m to get memory state...
I actually performed most of the useful sysrq-commands, please see
the following:
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
>
>>> Configuration:
>>>
>>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>>> : active raid1 sdb2[1] sda2[0]
>>> 136448 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[1] sda3[0]
>>> 129596288 blocks [2/2] [UU]
>>>
>>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>>> sdd1[1] sdc1[0]
>>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>>
>>> md0 : active raid1 sdb1[1] sda1[0]
>>> 16787776 blocks [2/2] [UU]
>>>
>>> $ mount
>>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> udev on /dev type tmpfs (rw,mode=0755)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>> /dev/md1 on /boot type ext3 (rw,noatime)
>>> /dev/md3 on /r/1 type xfs
>>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>>> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Do you get the same behavior if you don't add the log options at mount time?
I have not tried disabling the log options, although they have been in effect
for a long time, (the logsbufs and bufsize and recently) the nobarrier
support. Could there be an issue using -o nobarrier on a raid1+xfs?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 16:20 ` Justin Piszcz
0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
On Tue, 24 Nov 2009, Eric Sandeen wrote:
> Justin Piszcz wrote:
>>
>>
>> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>>
>>> Hello,
>>>
>>> I have a system I recently upgraded from 2.6.30.x and after
>>> approximately 24-48 hours--sometimes longer, the system cannot write
>>> any more files to disk (luckily though I can still write to /dev/shm)
>>> -- to which I have
>>> saved the sysrq-t and sysrq-w output:
>>>
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
Yes, when this occurred the first few times, I can only grab whats in dmesg
to the ramdisk, trying to access any file system other than the ramdisk
(tmpfs) /dev/shm, will cause the process to be locked.
>
> The sysrq-w trace has the "show blocked state" start a ways down the file,
> for anyone playing along at home ;)
>
> Other things you might try are a sysrq-m to get memory state...
I actually performed most of the useful sysrq-commands, please see
the following:
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
>
>>> Configuration:
>>>
>>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>>> : active raid1 sdb2[1] sda2[0]
>>> 136448 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[1] sda3[0]
>>> 129596288 blocks [2/2] [UU]
>>>
>>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>>> sdd1[1] sdc1[0]
>>> 5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>>
>>> md0 : active raid1 sdb1[1] sda1[0]
>>> 16787776 blocks [2/2] [UU]
>>>
>>> $ mount
>>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> udev on /dev type tmpfs (rw,mode=0755)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>> /dev/md1 on /boot type ext3 (rw,noatime)
>>> /dev/md3 on /r/1 type xfs
>>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>>> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Do you get the same behavior if you don't add the log options at mount time?
I have not tried disabling the log options, although they have been in effect
for a long time, (the logsbufs and bufsize and recently) the nobarrier
support. Could there be an issue using -o nobarrier on a raid1+xfs?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug?
2009-11-24 16:20 ` Justin Piszcz
@ 2009-11-24 16:23 ` Eric Sandeen
-1 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
Justin Piszcz wrote:
>
> On Tue, 24 Nov 2009, Eric Sandeen wrote:
...
>> Do you get the same behavior if you don't add the log options at mount time?
> I have not tried disabling the log options, although they have been in effect
> for a long time, (the logsbufs and bufsize and recently) the nobarrier
> support. Could there be an issue using -o nobarrier on a raid1+xfs?
nobarrier should not cause problems.
-Eric
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 16:23 ` Eric Sandeen
0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs
Justin Piszcz wrote:
>
> On Tue, 24 Nov 2009, Eric Sandeen wrote:
...
>> Do you get the same behavior if you don't add the log options at mount time?
> I have not tried disabling the log options, although they have been in effect
> for a long time, (the logsbufs and bufsize and recently) the nobarrier
> support. Could there be an issue using -o nobarrier on a raid1+xfs?
nobarrier should not cause problems.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2009-11-24 16:23 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-17 22:34 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
2009-10-17 22:34 ` Justin Piszcz
2009-10-18 20:17 ` Justin Piszcz
2009-10-18 20:17 ` Justin Piszcz
2009-10-19 3:04 ` Dave Chinner
2009-10-19 3:04 ` Dave Chinner
2009-10-19 10:18 ` Justin Piszcz
2009-10-19 10:18 ` Justin Piszcz
2009-10-20 0:33 ` Dave Chinner
2009-10-20 0:33 ` Dave Chinner
2009-10-20 8:33 ` Justin Piszcz
2009-10-20 8:33 ` Justin Piszcz
2009-10-21 10:19 ` Justin Piszcz
2009-10-21 10:19 ` Justin Piszcz
2009-10-21 14:17 ` mdadm --detail showing annoying device Stephane Bunel
2009-10-21 21:46 ` Neil Brown
2009-10-22 11:22 ` Stephane Bunel
2009-10-29 3:44 ` Neil Brown
2009-11-03 9:37 ` Stephane Bunel
2009-11-03 10:09 ` Beolach
2009-11-03 12:16 ` Stephane Bunel
2009-10-22 11:29 ` Mario 'BitKoenig' Holbe
2009-10-22 14:17 ` Stephane Bunel
2009-10-22 16:00 ` Stephane Bunel
2009-10-22 22:49 ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
2009-10-22 22:49 ` Justin Piszcz
2009-10-22 23:00 ` Dave Chinner
2009-10-22 23:00 ` Dave Chinner
2009-10-26 11:24 ` Justin Piszcz
2009-10-26 11:24 ` Justin Piszcz
2009-11-02 21:46 ` Justin Piszcz
2009-11-02 21:46 ` Justin Piszcz
2009-11-20 20:39 ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz
2009-11-20 20:39 ` Justin Piszcz
2009-11-20 23:44 ` Bug#557262: " Faidon Liambotis
2009-11-20 23:44 ` Faidon Liambotis
2009-11-20 23:44 ` Faidon Liambotis
2009-11-20 23:51 ` Justin Piszcz
2009-11-20 23:51 ` Justin Piszcz
2009-11-21 14:29 ` Roger Heflin
2009-11-21 14:29 ` Roger Heflin
2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz
2009-11-24 13:08 ` Justin Piszcz
2009-11-24 15:14 ` Eric Sandeen
2009-11-24 15:14 ` Eric Sandeen
2009-11-24 16:20 ` Justin Piszcz
2009-11-24 16:20 ` Justin Piszcz
2009-11-24 16:23 ` Eric Sandeen
2009-11-24 16:23 ` Eric Sandeen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.