All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-17 22:34 ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz

Hello,

I have a system I recently upgraded from 2.6.30.x and after approximately 
24-48 hours--sometimes longer, the system cannot write any more files to 
disk (luckily though I can still write to /dev/shm) -- to which I have
saved the sysrq-t and sysrq-w output:

http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Configuration:

$ cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid1 sdb2[1] sda2[0]
       136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
       129596288 blocks [2/2] [UU]

md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0]
       5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]

md0 : active raid1 sdb1[1] sda1[0]
       16787776 blocks [2/2] [UU]

$ mount
/dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw,noatime)
/dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

Distribution: Debian Testing
Arch: x86_64

The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
persists.

Here is a snippet of two processes in D-state, the first was not doing 
anything, the second was mrtg.

[121444.684000] pickup        D 0000000000000003     0 18407   4521 0x00000000
[121444.684000]  ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000
[121444.684000]  000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90
[121444.684000]  000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90
[121444.684000] Call Trace:
[121444.684000]  [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
[121444.684000]  [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
[121444.684000]  [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
[121444.684000]  [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
[121444.684000]  [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
[121444.684000]  [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
[121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

[121444.684000] rateup        D 0000000000000000     0 18538  18465 0x00000000
[121444.684000]  ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8
[121444.684000]  000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0
[121444.684000]  0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0
[121444.684000] Call Trace:
[121444.684000]  [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
[121444.684000]  [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
[121444.684000]  [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
[121444.684000]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[121444.684000]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[121444.684000]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[121444.684000]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[121444.684000]  [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
[121444.684000]  [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
[121444.684000]  [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
[121444.684000]  [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
[121444.684000]  [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
[121444.684000]  [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
[121444.684000]  [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
[121444.684000]  [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
[121444.684000]  [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
[121444.684000]  [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
[121444.684000]  [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
[121444.684000]  [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
[121444.684000]  [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
[121444.684000]  [<ffffffff8108b21a>] ? __writepage+0xa/0x40
[121444.684000]  [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
[121444.684000]  [<ffffffff8108b210>] ? __writepage+0x0/0x40
[121444.684000]  [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
[121444.684000]  [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
[121444.684000]  [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
[121444.684000]  [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
[121444.684000]  [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
[121444.684000]  [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
[121444.684000]  [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
[121444.684000]  [<ffffffff810af556>] ? filp_close+0x56/0x90
[121444.684000]  [<ffffffff810af636>] ? sys_close+0xa6/0x100
[121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

Anyone know what is going on here?

Justin.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-17 22:34 ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz

Hello,

I have a system I recently upgraded from 2.6.30.x and after approximately 
24-48 hours--sometimes longer, the system cannot write any more files to 
disk (luckily though I can still write to /dev/shm) -- to which I have
saved the sysrq-t and sysrq-w output:

http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Configuration:

$ cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid1 sdb2[1] sda2[0]
       136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
       129596288 blocks [2/2] [UU]

md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0]
       5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]

md0 : active raid1 sdb1[1] sda1[0]
       16787776 blocks [2/2] [UU]

$ mount
/dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw,noatime)
/dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

Distribution: Debian Testing
Arch: x86_64

The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
persists.

Here is a snippet of two processes in D-state, the first was not doing 
anything, the second was mrtg.

[121444.684000] pickup        D 0000000000000003     0 18407   4521 0x00000000
[121444.684000]  ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000
[121444.684000]  000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90
[121444.684000]  000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90
[121444.684000] Call Trace:
[121444.684000]  [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
[121444.684000]  [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
[121444.684000]  [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
[121444.684000]  [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
[121444.684000]  [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
[121444.684000]  [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
[121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

[121444.684000] rateup        D 0000000000000000     0 18538  18465 0x00000000
[121444.684000]  ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8
[121444.684000]  000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0
[121444.684000]  0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0
[121444.684000] Call Trace:
[121444.684000]  [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
[121444.684000]  [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
[121444.684000]  [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
[121444.684000]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[121444.684000]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[121444.684000]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[121444.684000]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[121444.684000]  [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
[121444.684000]  [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
[121444.684000]  [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
[121444.684000]  [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
[121444.684000]  [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
[121444.684000]  [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
[121444.684000]  [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
[121444.684000]  [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
[121444.684000]  [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
[121444.684000]  [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
[121444.684000]  [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
[121444.684000]  [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
[121444.684000]  [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
[121444.684000]  [<ffffffff8108b21a>] ? __writepage+0xa/0x40
[121444.684000]  [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
[121444.684000]  [<ffffffff8108b210>] ? __writepage+0x0/0x40
[121444.684000]  [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
[121444.684000]  [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
[121444.684000]  [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
[121444.684000]  [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
[121444.684000]  [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
[121444.684000]  [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
[121444.684000]  [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
[121444.684000]  [<ffffffff810af556>] ? filp_close+0x56/0x90
[121444.684000]  [<ffffffff810af636>] ? sys_close+0xa6/0x100
[121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

Anyone know what is going on here?

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-17 22:34 ` Justin Piszcz
@ 2009-10-18 20:17   ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz



On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,

It has happened again, all sysrq-X output was saved this time.

wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

Kernel configuration:

wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt

Diff of the two configs:

$ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_"
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_CONSTRUCTORS=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_HAVE_DMA_ATTRS=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_X86_NEW_MCE=y
> CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_UNEVICTABLE_LRU=y
< CONFIG_PHYSICAL_START=0x200000
> CONFIG_PHYSICAL_START=0x1000000
< CONFIG_PHYSICAL_ALIGN=0x200000
> CONFIG_PHYSICAL_ALIGN=0x1000000
< CONFIG_COMPAT_NET_DEV_OPS=y
< CONFIG_SND_JACK=y
> CONFIG_HID_DRAGONRISE=y
> CONFIG_HID_GREENASIA=y
> CONFIG_HID_SMARTJOYPLUS=y
> CONFIG_HID_THRUSTMASTER=y
> CONFIG_HID_ZEROPLUS=y
> CONFIG_FSNOTIFY=y
> CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
> CONFIG_HAVE_ARCH_KMEMCHECK=y

I have reverted back to 2.6.30.9 to see if the problem recurs with this
kernel version.

I do not recall seeing this on the older 2.6.30.x kernels:

[    9.276427] md3: detected capacity change from 0 to 5251073572864
[    9.277411] md2: detected capacity change from 0 to 132706598912
[    9.278305] md1: detected capacity change from 0 to 139722752
[    9.278921] md0: detected capacity change from 0 to 17190682624

Again, some more D-state processes:

[76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
[76325.608087] Call Trace:
[76325.608095]  [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
[76325.608099]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[76325.608103]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[76325.608106]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[76325.608108]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40

[76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
[76325.608214] Call Trace:
[76325.608216]  [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
[76325.608220]  [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
[76325.608222]  [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
[76325.608225]  [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
[76325.608228]  [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
[76325.608231]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
[76325.608233]  [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
[76325.608236]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0

All of the D-state processes:

$ cat sysrq-w.txt  |grep ' D'
[76307.285125] alpine        D 0000000000000000     0  7659  29120 0x00000000
[76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
[76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
[76325.608257] syslogd       D 0000000000000002     0  2438      1 0x00000000
[76325.608318] freshclam     D 0000000000000000     0  2877      1 0x00000000
[76325.608428] asterisk      D 0000000000000001     0  3278      1 0x00000000
[76325.608492] console-kit-d D 0000000000000000     0  3299      1 0x00000000
[76325.608562] dhcpd3        D 0000000000000000     0  3554      1 0x00000000
[76325.608621] plasma-deskto D 0000000000000002     0 32482      1 0x00000000
[76325.608713] kaccess       D 0000000000000001     0 32488      1 0x00000000
[76325.608752] mail          D 0000000000000000     0  7397   7386 0x00000000
[76325.608830] hal-acl-tool  D 0000000000000000     0  7430   3399 0x00000004
[76325.608888] mrtg          D 0000000000000000     0  7444   7433 0x00000000
[76325.608981] cron          D 0000000000000000     0  7500   3630 0x00000000
[76325.609000] alpine        D 0000000000000000     0  7659  29120 0x00000000

List of functions underneath the D-state processes (sorted/uniqued)--

     121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
      77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
      62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0
      60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170
      60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180
      59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0
      59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0
      59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80
      59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0
      59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100
      59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80
      58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270
      58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0
      57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200
      57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0
      57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90
      57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30
      55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0
       8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
       7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
       7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150
       6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
       5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
       4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70
       4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
       4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50
       4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0
       4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
       4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20
       4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60
       4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110
       4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000
       3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
       3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
       3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130
       3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190
       3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0
       3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0
       3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
       3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
       3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40
       3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
       3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
       3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
       3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
       3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
       3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140
       3 [<ffffffff810c2110>] ? pollwake+0x0/0x60
       3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0
       3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0
       3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0
       3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0
       3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80
       3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0
       3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70
       2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0
       2 [<ffffffff81455151>] ? __down+0x61/0xa0
       2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0
       2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140
       2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0
       2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
       2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0
       2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190
       2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220
       2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
       2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
       2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
       2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530
       2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340
       2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360
       2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0
       2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
       2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
       2 [<ffffffff810c31dc>] ? dput+0xac/0x160
       2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70
       2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130
       2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780
       2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0
       2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220
       2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
       2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
       2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40
       2 [<ffffffff8108b210>] ? __writepage+0x0/0x40
       2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0
       2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0
       2 [<ffffffff8106e016>] ? down+0x46/0x50
       2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80
       2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130
       2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0
       2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30
       2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30
       2 [<ffffffff81069886>] ? kthread+0xa6/0xb0
       2 [<ffffffff810697e0>] ? kthread+0x0/0xb0
       2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10
       2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60
       2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70
       2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20
       2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20
       1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20
       1 [<ffffffff81455857>] ? __down_read+0x17/0xae
       1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80
       1 [<ffffffff81454144>] ? io_schedule+0x34/0x50
       1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180
       1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30
       1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10
       1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20
       1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0
       1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150
       1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60
       1 [<ffffffff8126c391>] ? tty_write+0x221/0x270
       1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100
       1 [<ffffffff81219361>] ? __up_read+0x21/0xc0
       1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80
       1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0
       1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
       1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0
       1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0
       1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
       1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50
       1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20
       1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
       1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30
       1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
       1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100
       1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340
       1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
       1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880
       1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190
       1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
       1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0
       1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530
       1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100
       1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0
       1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60
       1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40
       1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
       1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
       1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
       1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0
       1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0
       1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0
       1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320
       1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
       1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480
       1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480
       1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70
       1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320
       1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80
       1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110
       1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0
       1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0
       1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0
       1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0
       1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0
       1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0
       1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450
       1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0
       1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190
       1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0
       1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
       1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0
       1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0
       1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460
       1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0
       1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140
       1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120
       1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110
       1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310
       1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670
       1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50
       1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0
       1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0
       1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60
       1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
       1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0
       1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
       1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80
       1 [<ffffffff810af636>] ? sys_close+0xa6/0x100
       1 [<ffffffff810af556>] ? filp_close+0x56/0x90
       1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590
       1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30
       1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220
       1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140
       1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140
       1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
       1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140
       1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80
       1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80
       1 [<ffffffff81084d95>] ? sync_page+0x35/0x60
       1 [<ffffffff81084d60>] ? sync_page+0x0/0x60
       1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250
       1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30
       1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30
       1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240
       1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360
       1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0
       1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90
       1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90
       1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90
       1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30
       1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-18 20:17   ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz



On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,

It has happened again, all sysrq-X output was saved this time.

wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

Kernel configuration:

wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt

Diff of the two configs:

$ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_"
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_CONSTRUCTORS=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_HAVE_DMA_ATTRS=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_X86_NEW_MCE=y
> CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_UNEVICTABLE_LRU=y
< CONFIG_PHYSICAL_START=0x200000
> CONFIG_PHYSICAL_START=0x1000000
< CONFIG_PHYSICAL_ALIGN=0x200000
> CONFIG_PHYSICAL_ALIGN=0x1000000
< CONFIG_COMPAT_NET_DEV_OPS=y
< CONFIG_SND_JACK=y
> CONFIG_HID_DRAGONRISE=y
> CONFIG_HID_GREENASIA=y
> CONFIG_HID_SMARTJOYPLUS=y
> CONFIG_HID_THRUSTMASTER=y
> CONFIG_HID_ZEROPLUS=y
> CONFIG_FSNOTIFY=y
> CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
> CONFIG_HAVE_ARCH_KMEMCHECK=y

I have reverted back to 2.6.30.9 to see if the problem recurs with this
kernel version.

I do not recall seeing this on the older 2.6.30.x kernels:

[    9.276427] md3: detected capacity change from 0 to 5251073572864
[    9.277411] md2: detected capacity change from 0 to 132706598912
[    9.278305] md1: detected capacity change from 0 to 139722752
[    9.278921] md0: detected capacity change from 0 to 17190682624

Again, some more D-state processes:

[76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
[76325.608087] Call Trace:
[76325.608095]  [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
[76325.608099]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[76325.608103]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[76325.608106]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[76325.608108]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40

[76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
[76325.608214] Call Trace:
[76325.608216]  [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
[76325.608220]  [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
[76325.608222]  [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
[76325.608225]  [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
[76325.608228]  [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
[76325.608231]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
[76325.608233]  [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
[76325.608236]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0

All of the D-state processes:

$ cat sysrq-w.txt  |grep ' D'
[76307.285125] alpine        D 0000000000000000     0  7659  29120 0x00000000
[76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
[76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
[76325.608257] syslogd       D 0000000000000002     0  2438      1 0x00000000
[76325.608318] freshclam     D 0000000000000000     0  2877      1 0x00000000
[76325.608428] asterisk      D 0000000000000001     0  3278      1 0x00000000
[76325.608492] console-kit-d D 0000000000000000     0  3299      1 0x00000000
[76325.608562] dhcpd3        D 0000000000000000     0  3554      1 0x00000000
[76325.608621] plasma-deskto D 0000000000000002     0 32482      1 0x00000000
[76325.608713] kaccess       D 0000000000000001     0 32488      1 0x00000000
[76325.608752] mail          D 0000000000000000     0  7397   7386 0x00000000
[76325.608830] hal-acl-tool  D 0000000000000000     0  7430   3399 0x00000004
[76325.608888] mrtg          D 0000000000000000     0  7444   7433 0x00000000
[76325.608981] cron          D 0000000000000000     0  7500   3630 0x00000000
[76325.609000] alpine        D 0000000000000000     0  7659  29120 0x00000000

List of functions underneath the D-state processes (sorted/uniqued)--

     121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
      77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
      62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0
      60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170
      60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180
      59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0
      59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0
      59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80
      59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0
      59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100
      59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80
      58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270
      58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0
      57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200
      57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0
      57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90
      57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30
      55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0
       8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
       7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
       7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150
       6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
       5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
       4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70
       4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
       4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50
       4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0
       4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
       4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20
       4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60
       4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110
       4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000
       3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
       3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
       3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130
       3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190
       3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0
       3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0
       3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
       3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
       3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40
       3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
       3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
       3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
       3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
       3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
       3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140
       3 [<ffffffff810c2110>] ? pollwake+0x0/0x60
       3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0
       3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0
       3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0
       3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0
       3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80
       3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0
       3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70
       2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0
       2 [<ffffffff81455151>] ? __down+0x61/0xa0
       2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0
       2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140
       2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0
       2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
       2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0
       2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190
       2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220
       2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
       2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
       2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
       2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530
       2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340
       2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360
       2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0
       2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
       2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
       2 [<ffffffff810c31dc>] ? dput+0xac/0x160
       2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70
       2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130
       2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780
       2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0
       2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220
       2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
       2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
       2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40
       2 [<ffffffff8108b210>] ? __writepage+0x0/0x40
       2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0
       2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0
       2 [<ffffffff8106e016>] ? down+0x46/0x50
       2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80
       2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130
       2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0
       2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30
       2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30
       2 [<ffffffff81069886>] ? kthread+0xa6/0xb0
       2 [<ffffffff810697e0>] ? kthread+0x0/0xb0
       2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10
       2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60
       2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70
       2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20
       2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20
       1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20
       1 [<ffffffff81455857>] ? __down_read+0x17/0xae
       1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80
       1 [<ffffffff81454144>] ? io_schedule+0x34/0x50
       1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180
       1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30
       1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10
       1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20
       1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0
       1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150
       1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60
       1 [<ffffffff8126c391>] ? tty_write+0x221/0x270
       1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100
       1 [<ffffffff81219361>] ? __up_read+0x21/0xc0
       1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80
       1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0
       1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
       1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0
       1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0
       1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
       1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50
       1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20
       1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
       1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30
       1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
       1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100
       1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340
       1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
       1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880
       1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190
       1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
       1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0
       1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530
       1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100
       1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0
       1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60
       1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40
       1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
       1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
       1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
       1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0
       1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0
       1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0
       1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320
       1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
       1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480
       1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480
       1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70
       1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320
       1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80
       1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110
       1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0
       1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0
       1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0
       1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0
       1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0
       1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0
       1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450
       1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0
       1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190
       1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0
       1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
       1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0
       1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0
       1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460
       1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0
       1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140
       1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120
       1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110
       1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310
       1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670
       1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50
       1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0
       1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0
       1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60
       1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
       1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0
       1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
       1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80
       1 [<ffffffff810af636>] ? sys_close+0xa6/0x100
       1 [<ffffffff810af556>] ? filp_close+0x56/0x90
       1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590
       1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30
       1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220
       1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140
       1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140
       1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
       1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140
       1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80
       1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80
       1 [<ffffffff81084d95>] ? sync_page+0x35/0x60
       1 [<ffffffff81084d60>] ? sync_page+0x0/0x60
       1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250
       1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30
       1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30
       1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240
       1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360
       1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0
       1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90
       1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90
       1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90
       1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30
       1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-18 20:17   ` Justin Piszcz
@ 2009-10-19  3:04     ` Dave Chinner
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-19  3:04 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095]  [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216]  [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220]  [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222]  [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225]  [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228]  [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233]  [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:

All pointing to log IO not completing.

That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.

Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-19  3:04     ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-19  3:04 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095]  [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216]  [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220]  [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222]  [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225]  [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228]  [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233]  [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:

All pointing to log IO not completing.

That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.

Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-19  3:04     ` Dave Chinner
@ 2009-10-19 10:18       ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Mon, 19 Oct 2009, Dave Chinner wrote:

> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>> It has happened again, all sysrq-X output was saved this time.
>>
>> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
> .....
>
> All pointing to log IO not completing.
>
> That is, all of the D state processes are backed up on locks or
> waiting for IO completion processing. A lot of the processes are
> waiting for _xfs_log_force to complete, others are waiting for
> inodes to be unpinned or are backed up behind locked inodes that are
> waiting on log IO to complete before they can complete the
> transaction and unlock the inode, and so on.
>
> Unfortunately, the xfslogd and xfsdatad kernel threads are not
> present in any of the output given, so I can't tell if these have
> deadlocked themselves and caused the problem. However, my experience
> with such pile-ups is that an I/O completion has not been run for
> some reason and that is the cause of the problem. I don't know if
> you can provide enough information to tell us if this happened or
> not. Instead, do you have a test case that you can share?
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Hello,

So far I do not have a reproducible test case, the only other thing not posted
was the output of ps auxww during the time of the lockup, not sure if it will
help, but here it is:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2] 
root         2  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/1]
root         6  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/1]
root         7  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/2]
root         8  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/2]
root         9  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/3]
root        10  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/3]
root        11  0.0  0.0      0     0 ?        R<   Oct16   0:00 [events/0]
root        12  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/1]
root        13  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/2]
root        14  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/3]
root        15  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khelper]
root        20  0.0  0.0      0     0 ?        S<   Oct16   0:00 [async/mgr]
root       180  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/0]
root       181  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/1]
root       182  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/2]
root       183  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/3]
root       185  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpid]
root       186  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpi_notify]
root       187  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpi_hotplug]
root       271  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/0]
root       272  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/1]
root       273  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/2]
root       274  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/3]
root       275  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata_aux]
root       276  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksuspend_usbd]
root       280  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khubd]
root       283  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kseriod]
root       318  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khpsbpkt]
root       361  0.0  0.0      0     0 ?        S    Oct16   0:00 [pdflush]
root       362  0.0  0.0      0     0 ?        D    Oct16   0:43 [pdflush]
root       363  0.0  0.0      0     0 ?        S<   Oct16   0:21 [kswapd0]
root       364  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/0]
root       365  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/1]
root       366  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/2]
root       367  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/3]
root       368  0.0  0.0      0     0 ?        S<   Oct16   0:00 [nfsiod]
root       369  0.0  0.0      0     0 ?        S<   Oct16   0:00 [cifsoplockd]
root       370  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfs_mru_cache]
root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
root       518  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_0]
root       521  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_1]
root       524  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_2]
root       527  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_3]
root       530  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_4]
root       533  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_5]
root       542  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_6]
root       545  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_7]
root       551  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_8]
root       554  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_9]
root       558  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_10]
root       562  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_11]
root       568  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_12]
root       571  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_13]
root       584  0.0  0.0      0     0 ?        S<   Oct16   0:00 [knodemgrd_0]
root       616  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kpsmoused]
root       666  0.0  0.0      0     0 ?        S<   Oct16   0:00 [usbhid_resumer]
root       683  0.0  0.0      0     0 ?        S<   Oct16   0:00 [hd-audio0]
root       703  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/0]
root       704  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/1]
root       705  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/2]
root       706  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/3]
root       811  4.0  0.0      0     0 ?        S<   Oct16  81:54 [md3_raid5]
root       817  0.3  0.0      0     0 ?        S<   Oct16   6:30 [md2_raid1]
root       823  0.0  0.0      0     0 ?        S<   Oct16   0:00 [md1_raid1]
root       827  0.0  0.0      0     0 ?        S<   Oct16   0:50 [md0_raid1]
root       829  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsbufd]
root       830  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsaild]
root       831  0.0  0.0      0     0 ?        D<   Oct16   0:00 [xfssyncd]
root       884  0.0  0.0  16736   740 ?        S<s  Oct16   0:00 udevd --daemon
postfix   1649  0.0  0.0  39124  2468 ?        S    05:00   0:00 qmgr -l -t fifo -u -c
www-data  1877  0.0  0.0 146612  5248 ?        S    05:01   0:00 /usr/sbin/apache2 -k start
root      3182  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kjournald]
root      3183  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsbufd]
root      3184  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsaild]
root      3185  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfssyncd]
root      3337  0.0  0.0   8072   132 ?        Ss   Oct16   0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d
root      3339  0.0  0.0   8072   424 ?        Ds   Oct16   0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d
daemon    3526  0.0  0.0   8016   564 ?        Ss   Oct16   0:00 /sbin/portmap
statd     3538  0.0  0.0  10148   776 ?        Ss   Oct16   0:00 /sbin/rpc.statd
root      3547  0.0  0.0  26956   572 ?        Ss   Oct16   0:00 /usr/sbin/rpc.idmapd
root      3732  0.0  0.0   5900   676 ?        Ds   Oct16   0:00 /sbin/syslogd
root      3741  0.0  0.0   3796   452 ?        Ss   Oct16   0:00 /sbin/klogd -x
root      3750  0.0  0.0   3804   640 ?        Ss   Oct16   0:00 /usr/sbin/acpid
110       3760  0.0  0.0  23560  1476 ?        Ss   Oct16   0:02 /usr/bin/dbus-daemon --system
bind      3773  0.0  0.8 264364 70004 ?        Ssl  Oct16   0:05 /usr/sbin/named -u bind -S 1024
root      3796  0.0  0.0  49028  1260 ?        Ss   Oct16   0:00 /usr/sbin/sshd
root      3827  0.0  0.0 104804  7452 ?        Ssl  Oct16   0:08 /usr/sbin/console-kit-daemon
amavis    3909  0.0  1.1 217948 91004 ?        Ss   Oct16   0:00 amavisd (master)
nobody    3913  0.0  0.2 118612 21632 ?        Sl   Oct16   0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid
polw      3932  0.0  0.1  51868 12084 ?        Ss   Oct16   0:00 policyd-weight (master)
polw      3933  0.0  0.1  51868 11744 ?        Ss   Oct16   0:00 policyd-weight (cache)
postfw    3937  0.0  0.1  55248 14064 ?        Ss   Oct16   0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid
postgrey  3940  0.0  0.1  57900 13232 ?        Ss   Oct16   0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421 
clamav    4166  0.0  1.3 215576 112772 ?       Ssl  Oct16   0:04 /usr/sbin/clamd
clamav    4262  0.0  0.0  25776  1328 ?        Ss   Oct16   0:00 /usr/bin/freshclam -d --quiet
amavis    4307  0.0  1.1 224072 96732 ?        S    Oct16   0:02 amavisd (ch14-avail)
amavis    4308  0.0  1.2 228168 100820 ?       S    Oct16   0:05 amavisd (ch14-avail)
root      4353  0.0  0.0   7224  3944 ?        S    Oct16   0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
root      4362  0.0  0.0   8864   520 ?        Ss   Oct16   0:00 /usr/sbin/irqbalance
daemon    4377  0.0  0.0  38488   840 ?        Ss   Oct16   0:00 lpd Waiting 
root      4412  0.0  0.0      0     0 ?        S<   Oct16   0:00 [lockd]
root      4413  0.0  0.0      0     0 ?        S<   Oct16   0:10 [nfsd]
root      4414  0.0  0.0      0     0 ?        D<   Oct16   0:03 [nfsd]
root      4415  0.0  0.0      0     0 ?        D<   Oct16   0:13 [nfsd]
root      4416  0.0  0.0      0     0 ?        S<   Oct16   0:07 [nfsd]
root      4417  0.0  0.0      0     0 ?        D<   Oct16   0:11 [nfsd]
root      4418  0.0  0.0      0     0 ?        D<   Oct16   0:01 [nfsd]
root      4419  0.0  0.0      0     0 ?        D<   Oct16   0:04 [nfsd]
root      4420  0.0  0.0      0     0 ?        D<   Oct16   0:02 [nfsd]
root      4424  0.0  0.0  18812  1208 ?        Ss   Oct16   0:00 /usr/sbin/rpc.mountd --manage-gids
oident    4432  0.0  0.0  12232   576 ?        Ss   Oct16   0:00 /usr/sbin/oidentd -m -u oident -g oident
root      4444  0.0  0.0  10124   660 ?        Ss   Oct16   0:00 /usr/sbin/inetd
root      4521  0.0  0.0  37020  2388 ?        Ss   Oct16   0:00 /usr/lib/postfix/master
root      4545  0.0  0.0  58084  1796 ?        Ds   Oct16   0:00 /usr/sbin/nmbd -D
root      4547  0.0  0.0  93724  3044 ?        Ss   Oct16   0:00 /usr/sbin/smbd -D
root      4568  0.0  0.0  93752  1776 ?        S    Oct16   0:00 /usr/sbin/smbd -D
root      4700  0.0  0.0  18768  1348 ?        S    Oct16   0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid
asterisk  4714  1.3  0.2 666972 24460 ?        Ssl  Oct16  26:36 /usr/sbin/asterisk -p -U asterisk
asterisk  4715  0.0  0.0  13656   880 ?        S    Oct16   0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
ntp       4734  0.0  0.0  23428  1400 ?        Ss   Oct16   0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g
root      4789  0.0  0.0  16644   780 ?        Ss   Oct16   0:00 /usr/sbin/dovecot
root      4790  0.0  0.0  74664  3196 ?        S    Oct16   0:00 dovecot-auth
root      4792  0.0  0.0  74664  3240 ?        S    Oct16   0:00 dovecot-auth -w
111       4797  0.0  0.0  38084  4616 ?        Ss   Oct16   0:07 /usr/sbin/hald
root      4798  0.0  0.0  20040  1432 ?        S    Oct16   0:01 hald-runner
root      4895  0.0  0.0  22028  1240 ?        S    Oct16   0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3
root      4899  0.0  0.0  22032  1232 ?        S    Oct16   0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled
root      4900  0.0  0.0  22028  1228 ?        S    Oct16   0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled
111       4902  0.0  0.0  25920  1204 ?        S    Oct16   0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root      4911  0.0  0.0  12484   656 ?        Ss   Oct16   0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root      4929  0.0  0.0  24020   584 ?        Ss   Oct16   0:00 /usr/sbin/squid -D -YC
proxy     4931  0.0  0.0  26900  5136 ?        S    Oct16   0:00 (squid) -D -YC
root      4942  0.0  0.0  28924  3548 ?        Ss   Oct16   0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345
root      4943  0.0  0.0  17392  1380 ?        S    Oct16   0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1
daemon    4945  0.3  0.0  16596  3172 ?        S    Oct16   7:01 p0f -u daemon -i eth1 -l tcp dst port 25
root      4947  0.0  0.0  93604  5832 ?        SNs  Oct16   0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph
root      4951  0.0  0.0   9296  3716 ?        Ss   Oct16   0:00 /usr/sbin/dhcpd3 -q eth0
nut       4964  0.0  0.0  14384   744 ?        Ss   Oct16   0:15 /lib/nut/usbhid-ups -a apc
nut       4966  0.0  0.0  14316   600 ?        Ss   Oct16   0:01 /sbin/upsd
root      4968  0.0  0.0  14284   728 ?        Ss   Oct16   0:00 /sbin/upsmon
nut       4970  0.0  0.0  14284   712 ?        S    Oct16   0:01 /sbin/upsmon
daemon    5007  0.0  0.0  16356   416 ?        Ss   Oct16   0:00 /usr/sbin/atd
root      5027  0.0  0.0  20988  1060 ?        Ss   Oct16   0:00 /usr/sbin/cron
dovecot   5887  0.0  0.0  18504  2108 ?        S    15:31   0:00 imap-login
root      6184  0.0  0.0  73496  6380 ?        Sl   Oct16   0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      6205  0.0  0.0  26292   720 ?        Ss   Oct16   0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc
root      6213  0.1  1.3 192020 112852 tty7    Ss+  Oct16   2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6
root      6284  0.0  0.0   5852   600 tty1     Ss+  Oct16   0:00 /sbin/getty 38400 tty1
root      6285  0.0  0.0   5852   596 tty2     Ss+  Oct16   0:00 /sbin/getty 38400 tty2
root      6286  0.0  0.0   5852   596 tty3     Ss+  Oct16   0:00 /sbin/getty 38400 tty3
root      6287  0.0  0.0   5852   600 tty4     Ss+  Oct16   0:00 /sbin/getty 38400 tty4
root      6288  0.0  0.0   5852   596 tty5     Ss+  Oct16   0:00 /sbin/getty 38400 tty5
root      6289  0.0  0.0   5852   600 tty6     Ss+  Oct16   0:00 /sbin/getty 38400 tty6
root      6297  0.0  0.0  55716  1864 ?        S    Oct16   0:00 -:0 
postfix   6362  0.0  0.0  39076  2308 ?        S    15:36   0:00 anvil -l -t unix -u -o max_idle=3600s
root      7003  0.0  0.0  80904  3412 ?        Ss   Oct16   0:00 sshd: ap [priv] 
polw      7443  0.0  0.1  52000 12516 ?        S    Oct16   0:00 policyd-weight (child)
dovecot   8922  0.0  0.0  18504  2108 ?        S    16:01   0:00 imap-login
dovecot   8923  0.0  0.0  18504  2112 ?        S    16:01   0:00 imap-login
postfix  18407  0.0  0.0  39076  2352 ?        D    17:50   0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks
asterisk 18424  0.0  0.0  36996  2260 ?        D    17:54   0:00 /usr/sbin/postdrop -r
root     18425  0.0  0.0  46372  1440 ?        S    17:55   0:00 /USR/SBIN/CRON
root     18459  0.0  0.0  11404  1328 ?        Ss   17:55   0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi
root     18460  0.0  0.0  21888  1432 ?        D    17:55   0:00 /usr/lib/hal/hal-acl-tool --reconfigure
root     18465  0.0  0.2  43416 16892 ?        S    17:55   0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg
root     18479  0.0  0.0   4140   612 ?        S    17:55   0:00 tee -a /var/log/mrtg/mrtg.log
root     18538  0.0  0.0  38476  1976 ?        D    17:55   0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0
postfix  18539  0.0  0.1  50740  9836 ?        S    17:55   0:00 cleanup -z -t unix -u -c
root     18555  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18556  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18557  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18558  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18559  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18560  0.0  0.0  46368  1172 ?        S    18:02   0:00 /USR/SBIN/CRON
root     18561  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18562  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18563  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18564  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18565  0.0  0.0  46368  1172 ?        S    18:09   0:00 /USR/SBIN/CRON
root     18566  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18567  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18568  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18578  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18579  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18580  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18581  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18582  0.0  0.0  46368  1172 ?        S    18:17   0:00 /USR/SBIN/CRON
root     18583  0.0  0.0  46368  1172 ?        S    18:18   0:00 /USR/SBIN/CRON
postfix  18584  0.0  0.0  71452  8104 ?        S    18:18   0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings
postfix  18585  0.0  0.0  39076  2300 ?        S    18:18   0:00 proxymap -t unix -u
sshd     18597  0.0  0.0      0     0 ?        Z    18:19   0:00 [sshd] <defunct>
root     18601  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     18602  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     18603  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     30866  0.0  0.0  54392   944 ?        Ss   04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30867  0.0  0.0  54392   652 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30868  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30869  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30870  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     31018  0.0  0.0 146612  7856 ?        Ss   04:31   0:00 /usr/sbin/apache2 -k start
www-data 31025  0.0  0.0 146612  5248 ?        S    04:31   0:00 /usr/sbin/apache2 -k start
www-data 31044  0.0  0.0 146612  5248 ?        S    04:31   0:00 /usr/sbin/apache2 -k start
root     31076  0.0  0.0  23128   756 ?        Ss   04:31   0:00 pure-ftpd (SERVER)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-19 10:18       ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Mon, 19 Oct 2009, Dave Chinner wrote:

> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>> It has happened again, all sysrq-X output was saved this time.
>>
>> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
> .....
>
> All pointing to log IO not completing.
>
> That is, all of the D state processes are backed up on locks or
> waiting for IO completion processing. A lot of the processes are
> waiting for _xfs_log_force to complete, others are waiting for
> inodes to be unpinned or are backed up behind locked inodes that are
> waiting on log IO to complete before they can complete the
> transaction and unlock the inode, and so on.
>
> Unfortunately, the xfslogd and xfsdatad kernel threads are not
> present in any of the output given, so I can't tell if these have
> deadlocked themselves and caused the problem. However, my experience
> with such pile-ups is that an I/O completion has not been run for
> some reason and that is the cause of the problem. I don't know if
> you can provide enough information to tell us if this happened or
> not. Instead, do you have a test case that you can share?
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Hello,

So far I do not have a reproducible test case, the only other thing not posted
was the output of ps auxww during the time of the lockup, not sure if it will
help, but here it is:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2] 
root         2  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/1]
root         6  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/1]
root         7  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/2]
root         8  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/2]
root         9  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/3]
root        10  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/3]
root        11  0.0  0.0      0     0 ?        R<   Oct16   0:00 [events/0]
root        12  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/1]
root        13  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/2]
root        14  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/3]
root        15  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khelper]
root        20  0.0  0.0      0     0 ?        S<   Oct16   0:00 [async/mgr]
root       180  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/0]
root       181  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/1]
root       182  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/2]
root       183  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/3]
root       185  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpid]
root       186  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpi_notify]
root       187  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpi_hotplug]
root       271  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/0]
root       272  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/1]
root       273  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/2]
root       274  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/3]
root       275  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata_aux]
root       276  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksuspend_usbd]
root       280  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khubd]
root       283  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kseriod]
root       318  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khpsbpkt]
root       361  0.0  0.0      0     0 ?        S    Oct16   0:00 [pdflush]
root       362  0.0  0.0      0     0 ?        D    Oct16   0:43 [pdflush]
root       363  0.0  0.0      0     0 ?        S<   Oct16   0:21 [kswapd0]
root       364  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/0]
root       365  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/1]
root       366  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/2]
root       367  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/3]
root       368  0.0  0.0      0     0 ?        S<   Oct16   0:00 [nfsiod]
root       369  0.0  0.0      0     0 ?        S<   Oct16   0:00 [cifsoplockd]
root       370  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfs_mru_cache]
root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
root       518  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_0]
root       521  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_1]
root       524  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_2]
root       527  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_3]
root       530  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_4]
root       533  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_5]
root       542  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_6]
root       545  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_7]
root       551  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_8]
root       554  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_9]
root       558  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_10]
root       562  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_11]
root       568  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_12]
root       571  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_13]
root       584  0.0  0.0      0     0 ?        S<   Oct16   0:00 [knodemgrd_0]
root       616  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kpsmoused]
root       666  0.0  0.0      0     0 ?        S<   Oct16   0:00 [usbhid_resumer]
root       683  0.0  0.0      0     0 ?        S<   Oct16   0:00 [hd-audio0]
root       703  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/0]
root       704  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/1]
root       705  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/2]
root       706  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/3]
root       811  4.0  0.0      0     0 ?        S<   Oct16  81:54 [md3_raid5]
root       817  0.3  0.0      0     0 ?        S<   Oct16   6:30 [md2_raid1]
root       823  0.0  0.0      0     0 ?        S<   Oct16   0:00 [md1_raid1]
root       827  0.0  0.0      0     0 ?        S<   Oct16   0:50 [md0_raid1]
root       829  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsbufd]
root       830  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsaild]
root       831  0.0  0.0      0     0 ?        D<   Oct16   0:00 [xfssyncd]
root       884  0.0  0.0  16736   740 ?        S<s  Oct16   0:00 udevd --daemon
postfix   1649  0.0  0.0  39124  2468 ?        S    05:00   0:00 qmgr -l -t fifo -u -c
www-data  1877  0.0  0.0 146612  5248 ?        S    05:01   0:00 /usr/sbin/apache2 -k start
root      3182  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kjournald]
root      3183  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsbufd]
root      3184  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsaild]
root      3185  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfssyncd]
root      3337  0.0  0.0   8072   132 ?        Ss   Oct16   0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d
root      3339  0.0  0.0   8072   424 ?        Ds   Oct16   0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d
daemon    3526  0.0  0.0   8016   564 ?        Ss   Oct16   0:00 /sbin/portmap
statd     3538  0.0  0.0  10148   776 ?        Ss   Oct16   0:00 /sbin/rpc.statd
root      3547  0.0  0.0  26956   572 ?        Ss   Oct16   0:00 /usr/sbin/rpc.idmapd
root      3732  0.0  0.0   5900   676 ?        Ds   Oct16   0:00 /sbin/syslogd
root      3741  0.0  0.0   3796   452 ?        Ss   Oct16   0:00 /sbin/klogd -x
root      3750  0.0  0.0   3804   640 ?        Ss   Oct16   0:00 /usr/sbin/acpid
110       3760  0.0  0.0  23560  1476 ?        Ss   Oct16   0:02 /usr/bin/dbus-daemon --system
bind      3773  0.0  0.8 264364 70004 ?        Ssl  Oct16   0:05 /usr/sbin/named -u bind -S 1024
root      3796  0.0  0.0  49028  1260 ?        Ss   Oct16   0:00 /usr/sbin/sshd
root      3827  0.0  0.0 104804  7452 ?        Ssl  Oct16   0:08 /usr/sbin/console-kit-daemon
amavis    3909  0.0  1.1 217948 91004 ?        Ss   Oct16   0:00 amavisd (master)
nobody    3913  0.0  0.2 118612 21632 ?        Sl   Oct16   0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid
polw      3932  0.0  0.1  51868 12084 ?        Ss   Oct16   0:00 policyd-weight (master)
polw      3933  0.0  0.1  51868 11744 ?        Ss   Oct16   0:00 policyd-weight (cache)
postfw    3937  0.0  0.1  55248 14064 ?        Ss   Oct16   0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid
postgrey  3940  0.0  0.1  57900 13232 ?        Ss   Oct16   0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421 
clamav    4166  0.0  1.3 215576 112772 ?       Ssl  Oct16   0:04 /usr/sbin/clamd
clamav    4262  0.0  0.0  25776  1328 ?        Ss   Oct16   0:00 /usr/bin/freshclam -d --quiet
amavis    4307  0.0  1.1 224072 96732 ?        S    Oct16   0:02 amavisd (ch14-avail)
amavis    4308  0.0  1.2 228168 100820 ?       S    Oct16   0:05 amavisd (ch14-avail)
root      4353  0.0  0.0   7224  3944 ?        S    Oct16   0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
root      4362  0.0  0.0   8864   520 ?        Ss   Oct16   0:00 /usr/sbin/irqbalance
daemon    4377  0.0  0.0  38488   840 ?        Ss   Oct16   0:00 lpd Waiting 
root      4412  0.0  0.0      0     0 ?        S<   Oct16   0:00 [lockd]
root      4413  0.0  0.0      0     0 ?        S<   Oct16   0:10 [nfsd]
root      4414  0.0  0.0      0     0 ?        D<   Oct16   0:03 [nfsd]
root      4415  0.0  0.0      0     0 ?        D<   Oct16   0:13 [nfsd]
root      4416  0.0  0.0      0     0 ?        S<   Oct16   0:07 [nfsd]
root      4417  0.0  0.0      0     0 ?        D<   Oct16   0:11 [nfsd]
root      4418  0.0  0.0      0     0 ?        D<   Oct16   0:01 [nfsd]
root      4419  0.0  0.0      0     0 ?        D<   Oct16   0:04 [nfsd]
root      4420  0.0  0.0      0     0 ?        D<   Oct16   0:02 [nfsd]
root      4424  0.0  0.0  18812  1208 ?        Ss   Oct16   0:00 /usr/sbin/rpc.mountd --manage-gids
oident    4432  0.0  0.0  12232   576 ?        Ss   Oct16   0:00 /usr/sbin/oidentd -m -u oident -g oident
root      4444  0.0  0.0  10124   660 ?        Ss   Oct16   0:00 /usr/sbin/inetd
root      4521  0.0  0.0  37020  2388 ?        Ss   Oct16   0:00 /usr/lib/postfix/master
root      4545  0.0  0.0  58084  1796 ?        Ds   Oct16   0:00 /usr/sbin/nmbd -D
root      4547  0.0  0.0  93724  3044 ?        Ss   Oct16   0:00 /usr/sbin/smbd -D
root      4568  0.0  0.0  93752  1776 ?        S    Oct16   0:00 /usr/sbin/smbd -D
root      4700  0.0  0.0  18768  1348 ?        S    Oct16   0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid
asterisk  4714  1.3  0.2 666972 24460 ?        Ssl  Oct16  26:36 /usr/sbin/asterisk -p -U asterisk
asterisk  4715  0.0  0.0  13656   880 ?        S    Oct16   0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
ntp       4734  0.0  0.0  23428  1400 ?        Ss   Oct16   0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g
root      4789  0.0  0.0  16644   780 ?        Ss   Oct16   0:00 /usr/sbin/dovecot
root      4790  0.0  0.0  74664  3196 ?        S    Oct16   0:00 dovecot-auth
root      4792  0.0  0.0  74664  3240 ?        S    Oct16   0:00 dovecot-auth -w
111       4797  0.0  0.0  38084  4616 ?        Ss   Oct16   0:07 /usr/sbin/hald
root      4798  0.0  0.0  20040  1432 ?        S    Oct16   0:01 hald-runner
root      4895  0.0  0.0  22028  1240 ?        S    Oct16   0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3
root      4899  0.0  0.0  22032  1232 ?        S    Oct16   0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled
root      4900  0.0  0.0  22028  1228 ?        S    Oct16   0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled
111       4902  0.0  0.0  25920  1204 ?        S    Oct16   0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root      4911  0.0  0.0  12484   656 ?        Ss   Oct16   0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root      4929  0.0  0.0  24020   584 ?        Ss   Oct16   0:00 /usr/sbin/squid -D -YC
proxy     4931  0.0  0.0  26900  5136 ?        S    Oct16   0:00 (squid) -D -YC
root      4942  0.0  0.0  28924  3548 ?        Ss   Oct16   0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345
root      4943  0.0  0.0  17392  1380 ?        S    Oct16   0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1
daemon    4945  0.3  0.0  16596  3172 ?        S    Oct16   7:01 p0f -u daemon -i eth1 -l tcp dst port 25
root      4947  0.0  0.0  93604  5832 ?        SNs  Oct16   0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph
root      4951  0.0  0.0   9296  3716 ?        Ss   Oct16   0:00 /usr/sbin/dhcpd3 -q eth0
nut       4964  0.0  0.0  14384   744 ?        Ss   Oct16   0:15 /lib/nut/usbhid-ups -a apc
nut       4966  0.0  0.0  14316   600 ?        Ss   Oct16   0:01 /sbin/upsd
root      4968  0.0  0.0  14284   728 ?        Ss   Oct16   0:00 /sbin/upsmon
nut       4970  0.0  0.0  14284   712 ?        S    Oct16   0:01 /sbin/upsmon
daemon    5007  0.0  0.0  16356   416 ?        Ss   Oct16   0:00 /usr/sbin/atd
root      5027  0.0  0.0  20988  1060 ?        Ss   Oct16   0:00 /usr/sbin/cron
dovecot   5887  0.0  0.0  18504  2108 ?        S    15:31   0:00 imap-login
root      6184  0.0  0.0  73496  6380 ?        Sl   Oct16   0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      6205  0.0  0.0  26292   720 ?        Ss   Oct16   0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc
root      6213  0.1  1.3 192020 112852 tty7    Ss+  Oct16   2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6
root      6284  0.0  0.0   5852   600 tty1     Ss+  Oct16   0:00 /sbin/getty 38400 tty1
root      6285  0.0  0.0   5852   596 tty2     Ss+  Oct16   0:00 /sbin/getty 38400 tty2
root      6286  0.0  0.0   5852   596 tty3     Ss+  Oct16   0:00 /sbin/getty 38400 tty3
root      6287  0.0  0.0   5852   600 tty4     Ss+  Oct16   0:00 /sbin/getty 38400 tty4
root      6288  0.0  0.0   5852   596 tty5     Ss+  Oct16   0:00 /sbin/getty 38400 tty5
root      6289  0.0  0.0   5852   600 tty6     Ss+  Oct16   0:00 /sbin/getty 38400 tty6
root      6297  0.0  0.0  55716  1864 ?        S    Oct16   0:00 -:0 
postfix   6362  0.0  0.0  39076  2308 ?        S    15:36   0:00 anvil -l -t unix -u -o max_idle=3600s
root      7003  0.0  0.0  80904  3412 ?        Ss   Oct16   0:00 sshd: ap [priv] 
polw      7443  0.0  0.1  52000 12516 ?        S    Oct16   0:00 policyd-weight (child)
dovecot   8922  0.0  0.0  18504  2108 ?        S    16:01   0:00 imap-login
dovecot   8923  0.0  0.0  18504  2112 ?        S    16:01   0:00 imap-login
postfix  18407  0.0  0.0  39076  2352 ?        D    17:50   0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks
asterisk 18424  0.0  0.0  36996  2260 ?        D    17:54   0:00 /usr/sbin/postdrop -r
root     18425  0.0  0.0  46372  1440 ?        S    17:55   0:00 /USR/SBIN/CRON
root     18459  0.0  0.0  11404  1328 ?        Ss   17:55   0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi
root     18460  0.0  0.0  21888  1432 ?        D    17:55   0:00 /usr/lib/hal/hal-acl-tool --reconfigure
root     18465  0.0  0.2  43416 16892 ?        S    17:55   0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg
root     18479  0.0  0.0   4140   612 ?        S    17:55   0:00 tee -a /var/log/mrtg/mrtg.log
root     18538  0.0  0.0  38476  1976 ?        D    17:55   0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0
postfix  18539  0.0  0.1  50740  9836 ?        S    17:55   0:00 cleanup -z -t unix -u -c
root     18555  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18556  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18557  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18558  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18559  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18560  0.0  0.0  46368  1172 ?        S    18:02   0:00 /USR/SBIN/CRON
root     18561  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18562  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18563  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18564  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18565  0.0  0.0  46368  1172 ?        S    18:09   0:00 /USR/SBIN/CRON
root     18566  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18567  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18568  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18578  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18579  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18580  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18581  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18582  0.0  0.0  46368  1172 ?        S    18:17   0:00 /USR/SBIN/CRON
root     18583  0.0  0.0  46368  1172 ?        S    18:18   0:00 /USR/SBIN/CRON
postfix  18584  0.0  0.0  71452  8104 ?        S    18:18   0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings
postfix  18585  0.0  0.0  39076  2300 ?        S    18:18   0:00 proxymap -t unix -u
sshd     18597  0.0  0.0      0     0 ?        Z    18:19   0:00 [sshd] <defunct>
root     18601  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     18602  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     18603  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     30866  0.0  0.0  54392   944 ?        Ss   04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30867  0.0  0.0  54392   652 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30868  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30869  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30870  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     31018  0.0  0.0 146612  7856 ?        Ss   04:31   0:00 /usr/sbin/apache2 -k start
www-data 31025  0.0  0.0 146612  5248 ?        S    04:31   0:00 /usr/sbin/apache2 -k start
www-data 31044  0.0  0.0 146612  5248 ?        S    04:31   0:00 /usr/sbin/apache2 -k start
root     31076  0.0  0.0  23128   756 ?        Ss   04:31   0:00 pure-ftpd (SERVER)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-19 10:18       ` Justin Piszcz
@ 2009-10-20  0:33         ` Dave Chinner
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-20  0:33 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
> On Mon, 19 Oct 2009, Dave Chinner wrote:
>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>> It has happened again, all sysrq-X output was saved this time.
>> .....
>>
>> All pointing to log IO not completing.
>>
....
> So far I do not have a reproducible test case,

Ok. What sort of load is being placed on the machine?

> the only other thing not posted was the output of ps auxww during
> the time of the lockup, not sure if it will help, but here it is:
>
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2] 
....
> root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
> root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
> root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
> root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
> root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
> root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
> root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
> root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
> root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
> root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
> root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
> root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
.....

It appears that both the xfslogd and the xfsdatad on CPU 0 are in
the running state but don't appear to be consuming any significant
CPU time. If they remain like this then I think that means they are
stuck waiting on the run queue.  Do these XFS threads always appear
like this when the hang occurs? If so, is there something else that
is hogging CPU 0 preventing these threads from getting the CPU?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-20  0:33         ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-20  0:33 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
> On Mon, 19 Oct 2009, Dave Chinner wrote:
>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>> It has happened again, all sysrq-X output was saved this time.
>> .....
>>
>> All pointing to log IO not completing.
>>
....
> So far I do not have a reproducible test case,

Ok. What sort of load is being placed on the machine?

> the only other thing not posted was the output of ps auxww during
> the time of the lockup, not sure if it will help, but here it is:
>
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2] 
....
> root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
> root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
> root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
> root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
> root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
> root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
> root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
> root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
> root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
> root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
> root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
> root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
.....

It appears that both the xfslogd and the xfsdatad on CPU 0 are in
the running state but don't appear to be consuming any significant
CPU time. If they remain like this then I think that means they are
stuck waiting on the run queue.  Do these XFS threads always appear
like this when the hang occurs? If so, is there something else that
is hogging CPU 0 preventing these threads from getting the CPU?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-20  0:33         ` Dave Chinner
@ 2009-10-20  8:33           ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-20  8:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Tue, 20 Oct 2009, Dave Chinner wrote:

> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>> It has happened again, all sysrq-X output was saved this time.
>>> .....
>>>
>>> All pointing to log IO not completing.
>>>
> ....
>> So far I do not have a reproducible test case,
>
> Ok. What sort of load is being placed on the machine?
Hello, generally the load is low, it mainly serves out some samba shares.

>
>> the only other thing not posted was the output of ps auxww during
>> the time of the lockup, not sure if it will help, but here it is:
>>
>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>> root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2]
> ....
>> root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
>> root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
>> root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
>> root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
>> root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
>> root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
>> root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
>> root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
>> root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
>> root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
>> root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
>> root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
> .....
>
> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
> the running state but don't appear to be consuming any significant
> CPU time. If they remain like this then I think that means they are
> stuck waiting on the run queue.  Do these XFS threads always appear
> like this when the hang occurs? If so, is there something else that
> is hogging CPU 0 preventing these threads from getting the CPU?
Yes, the XFS threads show up like this on each time the kernel crashed.  So far
with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some issue
between 2.6.30.9 and 2.6.31.x when this began happening.  Any recommendations
on how to catch this bug w/certain options enabled/etc?


>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-20  8:33           ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-20  8:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Tue, 20 Oct 2009, Dave Chinner wrote:

> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>> It has happened again, all sysrq-X output was saved this time.
>>> .....
>>>
>>> All pointing to log IO not completing.
>>>
> ....
>> So far I do not have a reproducible test case,
>
> Ok. What sort of load is being placed on the machine?
Hello, generally the load is low, it mainly serves out some samba shares.

>
>> the only other thing not posted was the output of ps auxww during
>> the time of the lockup, not sure if it will help, but here it is:
>>
>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>> root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2]
> ....
>> root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
>> root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
>> root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
>> root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
>> root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
>> root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
>> root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
>> root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
>> root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
>> root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
>> root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
>> root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
> .....
>
> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
> the running state but don't appear to be consuming any significant
> CPU time. If they remain like this then I think that means they are
> stuck waiting on the run queue.  Do these XFS threads always appear
> like this when the hang occurs? If so, is there something else that
> is hogging CPU 0 preventing these threads from getting the CPU?
Yes, the XFS threads show up like this on each time the kernel crashed.  So far
with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some issue
between 2.6.30.9 and 2.6.31.x when this began happening.  Any recommendations
on how to catch this bug w/certain options enabled/etc?


>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-20  8:33           ` Justin Piszcz
@ 2009-10-21 10:19             ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Tue, 20 Oct 2009, Justin Piszcz wrote:


>
>
> On Tue, 20 Oct 2009, Dave Chinner wrote:
>
>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>> It has happened again, all sysrq-X output was saved this time.
>>>> .....
>>>> 
>>>> All pointing to log IO not completing.
>>>> 
>> ....
>>> So far I do not have a reproducible test case,
>> 
>> Ok. What sort of load is being placed on the machine?
> Hello, generally the load is low, it mainly serves out some samba shares.
>
>> 
>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>> the running state but don't appear to be consuming any significant
>> CPU time. If they remain like this then I think that means they are
>> stuck waiting on the run queue.  Do these XFS threads always appear
>> like this when the hang occurs? If so, is there something else that
>> is hogging CPU 0 preventing these threads from getting the CPU?
> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
> far
> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
> issue
> between 2.6.30.9 and 2.6.31.x when this began happening.  Any recommendations
> on how to catch this bug w/certain options enabled/etc?
>
>
>> 
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> 
>

Uptime with 2.6.30.9:

  06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07

No issues yet, so it first started happening in 2.6.(31).(x).

Any further recommendations on how to debug this issue?  BTW: Do you view this
as an XFS bug or MD/VFS layer issue based on the logs/output thus far?

Justin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-21 10:19             ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Tue, 20 Oct 2009, Justin Piszcz wrote:


>
>
> On Tue, 20 Oct 2009, Dave Chinner wrote:
>
>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>> It has happened again, all sysrq-X output was saved this time.
>>>> .....
>>>> 
>>>> All pointing to log IO not completing.
>>>> 
>> ....
>>> So far I do not have a reproducible test case,
>> 
>> Ok. What sort of load is being placed on the machine?
> Hello, generally the load is low, it mainly serves out some samba shares.
>
>> 
>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>> the running state but don't appear to be consuming any significant
>> CPU time. If they remain like this then I think that means they are
>> stuck waiting on the run queue.  Do these XFS threads always appear
>> like this when the hang occurs? If so, is there something else that
>> is hogging CPU 0 preventing these threads from getting the CPU?
> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
> far
> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
> issue
> between 2.6.30.9 and 2.6.31.x when this began happening.  Any recommendations
> on how to catch this bug w/certain options enabled/etc?
>
>
>> 
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> 
>

Uptime with 2.6.30.9:

  06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07

No issues yet, so it first started happening in 2.6.(31).(x).

Any further recommendations on how to debug this issue?  BTW: Do you view this
as an XFS bug or MD/VFS layer issue based on the logs/output thus far?

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* mdadm --detail showing annoying device
  2009-10-21 10:19             ` Justin Piszcz
  (?)
@ 2009-10-21 14:17             ` Stephane Bunel
  2009-10-21 21:46               ` Neil Brown
  -1 siblings, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-10-21 14:17 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm a newbie in the mdadm world. I defined some udev rules to make disk 
staticly named according to the bus/host/target. I.e. /dev/sda become 
/dev/raid_disk0. So nothing very special with that, it's just a convenient way 
to assign disk name to it's physical location.

#ls -la /dev/raid*
brw-rw---- 1 root disk 8,  0 2009-10-16 18:12 /dev/raid_disk0
brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1

A RAID1 (/dev/md0) is assembled over this two disk.
When looking for detailed information, mdadm show annoying device name in 
place of /dev/raid_disk*:

---8x---
#mdadm --detail /dev/md0
/dev/md0:
         Version : 0.90
   Creation Time : Tue Oct 13 12:53:54 2009
      Raid Level : raid1
      Array Size : 488386496 (465.76 GiB 500.11 GB)
   Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Wed Oct 21 15:16:09 2009
           State : clean
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
          Events : 0.1526

     Number   Major   Minor   RaidDevice State
        0       8        0        0      active sync   /dev/char/21:0
        1       8       16        1      active sync   /dev/char/21:1
---8x---


Looking in the source code of mdadm I found that the device name selection 
rule is (too) simply: select the shortest name in case of multiple 
possibility. So '/dev/char/21:0' being shorter than '/dev/raid_disk0', mdadm 
display '/dev/char...'. It's annoying for me to see a CHAR (/dev/char/...) 
device to represent a hard disk (wich is is a block device of course). The 
purpose of the following patch is to handle the directory part to always 
prefere the name that is closer to /dev. In case of equality, the shorter name 
wins.


---8x---
#./mdadm --detail /dev/md0
/dev/md0:
         Version : 0.90
   Creation Time : Tue Oct 13 12:53:54 2009
      Raid Level : raid1
      Array Size : 488386496 (465.76 GiB 500.11 GB)
   Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Wed Oct 21 15:30:34 2009
           State : clean
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
          Events : 0.1526

     Number   Major   Minor   RaidDevice State
        0       8        0        0      active sync   /dev/raid_disk0
        1       8       16        1      active sync   /dev/raid_disk1
---8x---


Fell free to refactor the code. The last time I wrote C code I still had hair 
on my head ;-)


---8x---
--- /var/tmp/mdadm_snapshot/util.c    2009-10-20 04:50:23.000000000 +0200
+++ /var/tmp/mdadm_dirlevel/util.c    2009-10-20 13:13:32.000000000 +0200
@@ -507,6 +507,38 @@
  #endif /* HAVE_FTW */
  #endif /* HAVE_NFTW */

+char *select_by_directory_level(char *current, char *registered)
+{
+        unsigned int current_level = 0;
+        unsigned int registered_level = 0;
+
+        unsigned int level(char *pathname)
+        {
+                unsigned int count = 0;
+                char *p = pathname;
+
+                while ( (p = strchr( p, '/' )) ) {
+                        count++;
+                        p++;
+                }
+
+                return( count );
+        }
+
+        current_level = level( current );
+        registered_level = level( registered );
+
+        if (  current_level < registered_level )
+                return current;
+
+        if ( current_level == registered_level )
+                if ( strlen( strrchr( current, '/')) <
+                     strlen( strrchr( registered,'/')) )
+                        return current;
+
+        return registered;
+}
+
  /*
   * Find a block device with the right major/minor number.
   * If we find multiple names, choose the shortest.
@@ -544,14 +576,18 @@
          if (p->major == major &&
              p->minor == minor) {
              if (strncmp(p->name, "/dev/md/",8) == 0) {
-                if (preferred == NULL ||
-                    strlen(p->name) < strlen(preferred))
-                    preferred = p->name;
-            } else {
-                if (regular == NULL ||
-                    strlen(p->name) < strlen(regular))
-                    regular = p->name;
-            }
+                                if (preferred == NULL)
+                                        preferred = p->name;
+                                else
+                                        preferred =
select_by_directory_level(
+                                                        p->name,
preferred );
+                        } else {
+                                if (regular == NULL )
+                                        regular = p->name;
+                                else
+                                        regular = select_by_directory_level(
+                                                        p->name, regular );
+                        }
          }
      if (!regular && !preferred && !did_check) {
          devlist_ready = 0;
---8x---



Stéphane Bunel.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-21 14:17             ` mdadm --detail showing annoying device Stephane Bunel
@ 2009-10-21 21:46               ` Neil Brown
  2009-10-22 11:22                 ` Stephane Bunel
  2009-10-22 11:29                 ` Mario 'BitKoenig' Holbe
  0 siblings, 2 replies; 49+ messages in thread
From: Neil Brown @ 2009-10-21 21:46 UTC (permalink / raw)
  To: Stephane Bunel; +Cc: linux-raid

On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
> Hi,
> 
> I'm a newbie in the mdadm world. I defined some udev rules to make disk 
> staticly named according to the bus/host/target. I.e. /dev/sda become 
> /dev/raid_disk0. So nothing very special with that, it's just a convenient way 
> to assign disk name to it's physical location.
> 
> #ls -la /dev/raid*
> brw-rw---- 1 root disk 8,  0 2009-10-16 18:12 /dev/raid_disk0
> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
> 
> A RAID1 (/dev/md0) is assembled over this two disk.
> When looking for detailed information, mdadm show annoying device name in 
> place of /dev/raid_disk*:
> 
....
>      Number   Major   Minor   RaidDevice State
>         0       8        0        0      active sync   /dev/char/21:0
>         1       8       16        1      active sync   /dev/char/21:1

What is a block device doing in /dev/char ???  There should only be
character devices in there.

If these are actually block device, then I think there is something
wrong with your udev rules.
If these are char devices, then mdadm is doing the wrong thing, but I
cannot see that from the code.

Your proposal of choosing the highest rather than the shortest name
has some merit, but I your current situation doesn't seem to justify
it, and I particularly like the simplicity of the current heuristic.

So for now I think I'll leave it as it is and encourage you to fix
your udev rule.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-21 21:46               ` Neil Brown
@ 2009-10-22 11:22                 ` Stephane Bunel
  2009-10-29  3:44                   ` Neil Brown
  2009-10-22 11:29                 ` Mario 'BitKoenig' Holbe
  1 sibling, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-10-22 11:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown a écrit :
> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>> Hi,
>>
>> I'm a newbie in the mdadm world. I defined some udev rules to make disk 
>> staticly named according to the bus/host/target. I.e. /dev/sda become 
>> /dev/raid_disk0. So nothing very special with that, it's just a convenient way 
>> to assign disk name to it's physical location.
>>
>> #ls -la /dev/raid*
>> brw-rw---- 1 root disk 8,  0 2009-10-16 18:12 /dev/raid_disk0
>> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
>>
>> A RAID1 (/dev/md0) is assembled over this two disk.
>> When looking for detailed information, mdadm show annoying device name in 
>> place of /dev/raid_disk*:
>>
> ....
>>      Number   Major   Minor   RaidDevice State
>>         0       8        0        0      active sync   /dev/char/21:0
>>         1       8       16        1      active sync   /dev/char/21:1
> 
> What is a block device doing in /dev/char ???  There should only be
> character devices in there.
> 
> If these are actually block device, then I think there is something
> wrong with your udev rules.

   I think my udev rules are not in cause because they just change /dev/sd* to 
/dev/raid_disk*. For udev /dev/char/21:0 seems correspond to the generic scsi 
device driver (sg) wich is binded to scsi devices.

> If these are char devices, then mdadm is doing the wrong thing, but I
> cannot see that from the code.

   mdadm by choosing the shorter name without differentiate path (/dev/.../ ) 
and name (sda) choose /dev/char/21:0 just because it is shorter than 
/dev/raid_disk0.

> Your proposal of choosing the highest rather than the shortest name
> has some merit, but I your current situation doesn't seem to justify
> it, and I particularly like the simplicity of the current heuristic.

  My proposal doesn't choose the highest name but does a selection based on 
the shortest path to the device name. I.e. my proposal choose:

/dev/raid_disk0 (1 directory level) over /dev/char/21:0 (2 directory level)
/dev/raid_disk0 (1 directory level) over /dev/block/8:0 (2 directory level)
/dev/sda1 (1 directory level) over /dev/disk/by-label/BOOT (3 directory level)

   So in fact, my proposal doesn't change current situation but "adjust" the 
heuristic to avoid seeing an annoying device name (char device) as a member of 
raid just because it's fullname is shorter than a "semantically better" name.

   Add a printf in map_dev() show that my proposal seems help the heuristic to 
be more robust in case of "advanced" device naming, without changing current 
things.


#./mdadm --detail /dev/md0
(/dev/char/21:0)
(/dev/block/8:0)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
(/dev/raid_disk0)
(/dev/char/21:1)
(/dev/block/8:16)
(/dev/raid_disk1)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
/dev/md0:
         Version : 0.90
   Creation Time : Tue Oct 13 12:53:54 2009
      Raid Level : raid1
      Array Size : 488386496 (465.76 GiB 500.11 GB)
   Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Thu Oct 22 12:42:44 2009
           State : clean
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
          Events : 0.1526

     Number   Major   Minor   RaidDevice State
        0       8        0        0      active sync(/dev/char/21:0)
(/dev/block/8:0)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
(/dev/raid_disk0)
    /dev/raid_disk0
        1       8       16        1      active sync(/dev/char/21:1)
(/dev/block/8:16)
(/dev/raid_disk1)
(/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
(/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
    /dev/raid_disk1



Thanks,
Stéphane Bunel.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-21 21:46               ` Neil Brown
  2009-10-22 11:22                 ` Stephane Bunel
@ 2009-10-22 11:29                 ` Mario 'BitKoenig' Holbe
  2009-10-22 14:17                   ` Stephane Bunel
  1 sibling, 1 reply; 49+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-10-22 11:29 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@suse.de> wrote:
> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>>         0       8        0        0      active sync   /dev/char/21:0
>>         1       8       16        1      active sync   /dev/char/21:1
> What is a block device doing in /dev/char ???  There should only be
> character devices in there.

major 21 are usually SCSI generic devices (/dev/sg) and they are char...
crw-rw----  1 root root  21, 0 Oct 10 21:48 /dev/sg0
crw-rw----  1 root root  21, 1 Oct 10 21:48 /dev/sg1

The question is, why do they appear at mdadm --detail


regards
   Mario
-- 
There are trivial truths and the great truths.
The opposite of a trivial truth is plainly false.
The opposite of a great truth is also true.
                                    -- Niels Bohr


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-22 11:29                 ` Mario 'BitKoenig' Holbe
@ 2009-10-22 14:17                   ` Stephane Bunel
  2009-10-22 16:00                     ` Stephane Bunel
  0 siblings, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-10-22 14:17 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

Mario 'BitKoenig' Holbe a écrit :
> Neil Brown <neilb@suse.de> wrote:
>> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>>>         0       8        0        0      active sync   /dev/char/21:0
>>>         1       8       16        1      active sync   /dev/char/21:1
>> What is a block device doing in /dev/char ???  There should only be
>> character devices in there.
> 
> major 21 are usually SCSI generic devices (/dev/sg) and they are char...
> crw-rw----  1 root root  21, 0 Oct 10 21:48 /dev/sg0
> crw-rw----  1 root root  21, 1 Oct 10 21:48 /dev/sg1
> 
> The question is, why do they appear at mdadm --detail

   mdadm performs a physical walk to not follow symbolic links (cf nftw( 
FTW_PHYS ) in map_dev() ). But using stat() mdadm finaly follow the symbolic 
link and so returns the same type/major/minor as the targeted link.

   lstat() is identical to stat(), except that if path is a symbolic link, 
then the link itself is stat-ed, not the file that it refers to.


#ls -la /dev/char/21:0
lrwxrwxrwx 1 root root 13 2009-10-16 18:12 /dev/char/21:0 -> ../raid_disk0

Tested from Python:
 >>> import os, stat

Using stat:
 >>> mode = os.stat( '/dev/char/21:0' )[ stat.ST_MODE ]
 >>> stat.S_ISBLK( mode )
True

using lstat():
 >>> mode = os.lstat( '/dev/char/21:0' )[ stat.ST_MODE ]
 >>> stat.S_ISBLK( mode )
False



Stéphane Bunel.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-22 14:17                   ` Stephane Bunel
@ 2009-10-22 16:00                     ` Stephane Bunel
  0 siblings, 0 replies; 49+ messages in thread
From: Stephane Bunel @ 2009-10-22 16:00 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

Stephane Bunel a écrit :
> Mario 'BitKoenig' Holbe a écrit :
>> Neil Brown <neilb@suse.de> wrote:
>>> On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
>>>>         0       8        0        0      active sync   /dev/char/21:0
>>>>         1       8       16        1      active sync   /dev/char/21:1
>>> What is a block device doing in /dev/char ???  There should only be
>>> character devices in there.
>>
>> major 21 are usually SCSI generic devices (/dev/sg) and they are char...
>> crw-rw----  1 root root  21, 0 Oct 10 21:48 /dev/sg0
>> crw-rw----  1 root root  21, 1 Oct 10 21:48 /dev/sg1
>>
>> The question is, why do they appear at mdadm --detail
> 
>   mdadm performs a physical walk to not follow symbolic links (cf nftw( 
> FTW_PHYS ) in map_dev() ). But using stat() mdadm finaly follow the 
> symbolic link and so returns the same type/major/minor as the targeted 
> link.
> 
>   lstat() is identical to stat(), except that if path is a symbolic 
> link, then the link itself is stat-ed, not the file that it refers to.
> 
> 
> #ls -la /dev/char/21:0
> lrwxrwxrwx 1 root root 13 2009-10-16 18:12 /dev/char/21:0 -> ../raid_disk0
> 
> Tested from Python:
>  >>> import os, stat
> 
> Using stat:
>  >>> mode = os.stat( '/dev/char/21:0' )[ stat.ST_MODE ]
>  >>> stat.S_ISBLK( mode )
> True
> 
> using lstat():
>  >>> mode = os.lstat( '/dev/char/21:0' )[ stat.ST_MODE ]
>  >>> stat.S_ISBLK( mode )
> False


Just for fun ;-)

--- util.c.orig	2009-10-22 17:54:11.000000000 +0200
+++ util.c	2009-10-22 17:55:09.000000000 +0200
@@ -468,7 +468,7 @@
  	struct stat st;

  	if (S_ISLNK(stb->st_mode)) {
-		if (stat(name, &st) != 0)
+		if (lstat(name, &st) != 0)
  			return 0;
  		stb = &st;
  	}



Stéphane Bunel.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-21 10:19             ` Justin Piszcz
@ 2009-10-22 22:49               ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>> 
>> 
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>> 
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>> 
>>>>> All pointing to log IO not completing.
>>>>> 
>>> ....
>>>> So far I do not have a reproducible test case,
>>> 
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>> 
>>> 
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any 
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>>> 
>> 
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue?  BTW: Do you view 
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Any other ideas?

Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has 
no load at all either.. Has anyone else reported similar problems?

Justin.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-22 22:49               ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>> 
>> 
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>> 
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>> 
>>>>> All pointing to log IO not completing.
>>>>> 
>>> ....
>>>> So far I do not have a reproducible test case,
>>> 
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>> 
>>> 
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any 
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>>> 
>> 
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue?  BTW: Do you view 
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Any other ideas?

Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has 
no load at all either.. Has anyone else reported similar problems?

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-22 22:49               ` Justin Piszcz
@ 2009-10-22 23:00                 ` Dave Chinner
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote:
> On Wed, 21 Oct 2009, Justin Piszcz wrote:
>> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>>> the running state but don't appear to be consuming any significant
>>>> CPU time. If they remain like this then I think that means they are
>>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>>> like this when the hang occurs? If so, is there something else that
>>>> is hogging CPU 0 preventing these threads from getting the CPU?
>>> Yes, the XFS threads show up like this on each time the kernel 
>>> crashed.  So far
>>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be 
>>> some issue
>>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any  
>>> recommendations
>>> on how to catch this bug w/certain options enabled/etc?
>>
>> Uptime with 2.6.30.9:
>>
>> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>>
>> No issues yet, so it first started happening in 2.6.(31).(x).

Ok.

>> Any further recommendations on how to debug this issue?  BTW: Do
>> you view this as an XFS bug or MD/VFS layer issue based on the
>> logs/output thus far?

Could be either. Nothing so far points at a cause.

> Any other ideas?

If it is relatively quick to reproduce, you could run a git bisect
to try to find the offending commit. Or when it has locked up, run
oprofile with callgraph sampling and so we can get an idea of what
is actually running when XFS appears to hang.

> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has  
> no load at all either.. Has anyone else reported similar problems?

Not that I know of.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-22 23:00                 ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote:
> On Wed, 21 Oct 2009, Justin Piszcz wrote:
>> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>>> the running state but don't appear to be consuming any significant
>>>> CPU time. If they remain like this then I think that means they are
>>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>>> like this when the hang occurs? If so, is there something else that
>>>> is hogging CPU 0 preventing these threads from getting the CPU?
>>> Yes, the XFS threads show up like this on each time the kernel 
>>> crashed.  So far
>>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be 
>>> some issue
>>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any  
>>> recommendations
>>> on how to catch this bug w/certain options enabled/etc?
>>
>> Uptime with 2.6.30.9:
>>
>> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>>
>> No issues yet, so it first started happening in 2.6.(31).(x).

Ok.

>> Any further recommendations on how to debug this issue?  BTW: Do
>> you view this as an XFS bug or MD/VFS layer issue based on the
>> logs/output thus far?

Could be either. Nothing so far points at a cause.

> Any other ideas?

If it is relatively quick to reproduce, you could run a git bisect
to try to find the offending commit. Or when it has locked up, run
oprofile with callgraph sampling and so we can get an idea of what
is actually running when XFS appears to hang.

> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has  
> no load at all either.. Has anyone else reported similar problems?

Not that I know of.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-22 22:49               ` Justin Piszcz
@ 2009-10-26 11:24                 ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Thu, 22 Oct 2009, Justin Piszcz wrote:

>
> Any other ideas?
>
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no 
> load at all either.. Has anyone else reported similar problems?
>
> Justin.
>

--

Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 
later today and wait another 48 hours.

Justin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-26 11:24                 ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Thu, 22 Oct 2009, Justin Piszcz wrote:

>
> Any other ideas?
>
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no 
> load at all either.. Has anyone else reported similar problems?
>
> Justin.
>

--

Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 
later today and wait another 48 hours.

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-22 11:22                 ` Stephane Bunel
@ 2009-10-29  3:44                   ` Neil Brown
  2009-11-03  9:37                     ` Stephane Bunel
  0 siblings, 1 reply; 49+ messages in thread
From: Neil Brown @ 2009-10-29  3:44 UTC (permalink / raw)
  To: Stephane Bunel; +Cc: linux-raid

On Thursday October 22, stephane.bunel@forumdesimages.fr wrote:
> Neil Brown a écrit :
> > On Wednesday October 21, stephane.bunel@forumdesimages.fr wrote:
> >> Hi,
> >>
> >> I'm a newbie in the mdadm world. I defined some udev rules to make disk 
> >> staticly named according to the bus/host/target. I.e. /dev/sda become 
> >> /dev/raid_disk0. So nothing very special with that, it's just a convenient way 
> >> to assign disk name to it's physical location.
> >>
> >> #ls -la /dev/raid*
> >> brw-rw---- 1 root disk 8,  0 2009-10-16 18:12 /dev/raid_disk0
> >> brw-rw---- 1 root disk 8, 16 2009-10-16 18:12 /dev/raid_disk1
> >>
> >> A RAID1 (/dev/md0) is assembled over this two disk.
> >> When looking for detailed information, mdadm show annoying device name in 
> >> place of /dev/raid_disk*:
> >>
> > ....
> >>      Number   Major   Minor   RaidDevice State
> >>         0       8        0        0      active sync   /dev/char/21:0
> >>         1       8       16        1      active sync   /dev/char/21:1
> > 
> > What is a block device doing in /dev/char ???  There should only be
> > character devices in there.
> > 
> > If these are actually block device, then I think there is something
> > wrong with your udev rules.
> 
>    I think my udev rules are not in cause because they just change /dev/sd* to 
> /dev/raid_disk*. For udev /dev/char/21:0 seems correspond to the generic scsi 
> device driver (sg) wich is binded to scsi devices.

That doesn't answer the question of why a block device is appearing in
/dev/char/.
My guess (which is quite possibly wrong, but it is the best I can do)
is that whatever change to udev.rules that you made to get /dev/sdXX
to be renamed to /dev/raid_diskXX, also renamed the scsi-generic
devices to be /dev/raid_diskXX.
I think that would have the effect that you are seeing.


> 
> > If these are char devices, then mdadm is doing the wrong thing, but I
> > cannot see that from the code.
> 
>    mdadm by choosing the shorter name without differentiate path (/dev/.../ ) 
> and name (sda) choose /dev/char/21:0 just because it is shorter than 
> /dev/raid_disk0.
> 
> > Your proposal of choosing the highest rather than the shortest name
> > has some merit, but I your current situation doesn't seem to justify
> > it, and I particularly like the simplicity of the current heuristic.
> 
>   My proposal doesn't choose the highest name but does a selection based on 
> the shortest path to the device name. I.e. my proposal choose:
> 
> /dev/raid_disk0 (1 directory level) over /dev/char/21:0 (2 directory level)
> /dev/raid_disk0 (1 directory level) over /dev/block/8:0 (2 directory level)
> /dev/sda1 (1 directory level) over /dev/disk/by-label/BOOT (3 directory level)

This is what I meant by 'highest'.  highest up in the directory tree -
closest to the root (I guess some people might think that being close
to the root is 'low', not 'high :-)

> 
>    So in fact, my proposal doesn't change current situation but "adjust" the 
> heuristic to avoid seeing an annoying device name (char device) as a member of 
> raid just because it's fullname is shorter than a "semantically better" name.

I appreciate that.  I am not against changing the heuristic if I find
a good reason ... particularly if I can find something that actually
measures "semantic goodness"!
But I don't think you have provided a good reason.
/dev/char/21:0 should be a char device, not a block device.  So mdadm
should ignore it.
On your system, /dev/char/21:0 is a block device (or a link to a block
device) so there is clearly some sort of configuration error.

If you still cannot find it, maybe you could show us the change you
made to udev.rules, and an 'ls -l' of '/dev/char'.  That might help
shed some light on your situation.

NeilBrown


> 
>    Add a printf in map_dev() show that my proposal seems help the heuristic to 
> be more robust in case of "advanced" device naming, without changing current 
> things.
> 
> 
> #./mdadm --detail /dev/md0
> (/dev/char/21:0)
> (/dev/block/8:0)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
> (/dev/raid_disk0)
> (/dev/char/21:1)
> (/dev/block/8:16)
> (/dev/raid_disk1)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
> /dev/md0:
>          Version : 0.90
>    Creation Time : Tue Oct 13 12:53:54 2009
>       Raid Level : raid1
>       Array Size : 488386496 (465.76 GiB 500.11 GB)
>    Used Dev Size : 488386496 (465.76 GiB 500.11 GB)
>     Raid Devices : 2
>    Total Devices : 2
> Preferred Minor : 0
>      Persistence : Superblock is persistent
> 
>      Update Time : Thu Oct 22 12:42:44 2009
>            State : clean
>   Active Devices : 2
> Working Devices : 2
>   Failed Devices : 0
>    Spare Devices : 0
> 
>             UUID : 3fea95a0:e1b8a341:3b119117:e416f62b
>           Events : 0.1526
> 
>      Number   Major   Minor   RaidDevice State
>         0       8        0        0      active sync(/dev/char/21:0)
> (/dev/block/8:0)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ37D3RA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ37D3RA)
> (/dev/raid_disk0)
>     /dev/raid_disk0
>         1       8       16        1      active sync(/dev/char/21:1)
> (/dev/block/8:16)
> (/dev/raid_disk1)
> (/dev/disk/by-id/ata-Hitachi_HDP725050GLA360_GEA534RJ36T9VA)
> (/dev/disk/by-id/scsi-SATA_Hitachi_HDP7250_GEA534RJ36T9VA)
>     /dev/raid_disk1
> 
> 
> 
> Thanks,
> Stéphane Bunel.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-26 11:24                 ` Justin Piszcz
@ 2009-11-02 21:46                   ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Mon, 26 Oct 2009, Justin Piszcz wrote:

>
>
> On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
>> 
>> Any other ideas?
>> 
>> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no 
>> load at all either.. Has anyone else reported similar problems?
>> 
>> Justin.
>> 
>
> --
>
> Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 
> later today and wait another 48 hours.
>
> Justin.
>
>

Kernel report:

2.6.31-rc1: no crash - uptime 2+ days
2.6.31-rc2: no crash - uptime 2+ days
2.6.31-rc3: no crash - uptime 2+ days
2.6.31-rc4: no crash but network kept dropping out
2.6.31-rc5: cannot test, service owner needs host available
2.6.31-rc6: cannot test, service owner needs host available
2.6.31-rc7: cannot test, service owner needs host available
2.6.31-rc8: cannot test, service owner needs host available
2.6.31-rc9: cannot test, service owner needs host available
   2.6.31.x: locks up D-state

It would be somewhere between -rc4 and 2.6.31.x.

Justin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-11-02 21:46                   ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Mon, 26 Oct 2009, Justin Piszcz wrote:

>
>
> On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
>> 
>> Any other ideas?
>> 
>> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no 
>> load at all either.. Has anyone else reported similar problems?
>> 
>> Justin.
>> 
>
> --
>
> Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 
> later today and wait another 48 hours.
>
> Justin.
>
>

Kernel report:

2.6.31-rc1: no crash - uptime 2+ days
2.6.31-rc2: no crash - uptime 2+ days
2.6.31-rc3: no crash - uptime 2+ days
2.6.31-rc4: no crash but network kept dropping out
2.6.31-rc5: cannot test, service owner needs host available
2.6.31-rc6: cannot test, service owner needs host available
2.6.31-rc7: cannot test, service owner needs host available
2.6.31-rc8: cannot test, service owner needs host available
2.6.31-rc9: cannot test, service owner needs host available
   2.6.31.x: locks up D-state

It would be somewhere between -rc4 and 2.6.31.x.

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-10-29  3:44                   ` Neil Brown
@ 2009-11-03  9:37                     ` Stephane Bunel
  2009-11-03 10:09                       ` Beolach
  0 siblings, 1 reply; 49+ messages in thread
From: Stephane Bunel @ 2009-11-03  9:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown a écrit :
(...)

> On your system, /dev/char/21:0 is a block device (or a link to a block
> device) so there is clearly some sort of configuration error.

All files in /dev/char are symlinks (see below). Rules are comming from Gentoo.

> If you still cannot find it, maybe you could show us the change you
> made to udev.rules, and an 'ls -l' of '/dev/char'.  That might help
> shed some light on your situation.

Considering mdadm is only involved by "real" block device file, why not just 
skipping symlink ?

  o Udev rules used to rename /dev/sd[ab]:

#cat 65-persistent-block.rules
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0", 
NAME="raid_disk0"

ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0", 
NAME="raid_disk1"

  o System is Gentoo.
  o Content of /dev/char:

#ls -la /dev/char/
total 0
drwxr-xr-x  2 root root 2880 2009-11-02 16:26 .
drwxr-xr-x 16 root root 3480 2009-11-02 16:26 ..
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 10:1 -> ../psaux
lrwxrwxrwx  1 root root    9 2009-10-16 18:12 10:227 -> ../mcelog
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 10:229 -> ../fuse
lrwxrwxrwx  1 root root   11 2009-10-16 18:12 10:231 -> ../snapshot
lrwxrwxrwx  1 root root   13 2009-10-16 18:12 10:252 -> ../dac960_gam
lrwxrwxrwx  1 root root   11 2009-10-16 18:12 10:58 -> ../megadev0
lrwxrwxrwx  1 root root    6 2009-10-16 18:12 10:59 -> ../tgt
lrwxrwxrwx  1 root root   21 2009-10-16 18:12 10:60 -> ../network_throughput
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 10:61 -> ../network_latency
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 10:62 -> ../cpu_dma_latency
lrwxrwxrwx  1 root root   17 2009-10-16 18:12 10:63 -> ../mapper/control
lrwxrwxrwx  1 root root    6 2009-10-16 18:12 1:1 -> ../mem
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 1:11 -> ../kmsg
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 1:2 -> ../kmem
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 1:3 -> ../null
lrwxrwxrwx  1 root root   12 2009-10-16 18:12 13:0 -> ../input/js0
lrwxrwxrwx  1 root root   15 2009-10-16 18:12 13:32 -> ../input/mouse0
lrwxrwxrwx  1 root root   13 2009-10-16 18:12 13:63 -> ../input/mice
lrwxrwxrwx  1 root root   15 2009-10-16 18:12 13:64 -> ../input/event0
lrwxrwxrwx  1 root root   15 2009-10-16 18:12 13:65 -> ../input/event1
lrwxrwxrwx  1 root root   15 2009-10-16 18:12 13:67 -> ../input/event3
lrwxrwxrwx  1 root root   15 2009-10-16 18:12 13:68 -> ../input/event4
lrwxrwxrwx  1 root root   15 2009-10-16 18:12 13:69 -> ../input/event5
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 1:4 -> ../port
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 1:5 -> ../zero
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 1:7 -> ../full
lrwxrwxrwx  1 root root    9 2009-10-16 18:12 1:8 -> ../random
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:0 -> ../bus/usb/001/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:128 -> ../bus/usb/002/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:129 -> ../bus/usb/002/002
lrwxrwxrwx  1 root root   18 2009-11-02 16:25 189:133 -> ../bus/usb/002/006
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:256 -> ../bus/usb/003/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:384 -> ../bus/usb/004/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:385 -> ../bus/usb/004/002
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:512 -> ../bus/usb/005/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:640 -> ../bus/usb/006/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:768 -> ../bus/usb/007/001
lrwxrwxrwx  1 root root   18 2009-10-16 18:12 189:896 -> ../bus/usb/008/001
lrwxrwxrwx  1 root root   10 2009-10-16 18:12 1:9 -> ../urandom
lrwxrwxrwx  1 root root   13 2009-10-16 18:12 21:0 -> ../raid_disk0
lrwxrwxrwx  1 root root   13 2009-10-16 18:12 21:1 -> ../raid_disk1
lrwxrwxrwx  1 root root    6 2009-11-02 16:26 21:2 -> ../sg2
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 248:0 -> ../rtc0
lrwxrwxrwx  1 root root   10 2009-10-16 18:12 254:0 -> ../hidraw0
lrwxrwxrwx  1 root root   10 2009-10-16 18:12 254:1 -> ../hidraw1
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:0 -> ../tty0
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:1 -> ../tty1
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:10 -> ../tty10
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:11 -> ../tty11
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:12 -> ../tty12
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:13 -> ../tty13
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:14 -> ../tty14
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:15 -> ../tty15
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:16 -> ../tty16
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:17 -> ../tty17
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:18 -> ../tty18
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:19 -> ../tty19
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:2 -> ../tty2
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:20 -> ../tty20
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:21 -> ../tty21
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:22 -> ../tty22
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:23 -> ../tty23
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:24 -> ../tty24
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:25 -> ../tty25
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:26 -> ../tty26
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:27 -> ../tty27
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:28 -> ../tty28
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:29 -> ../tty29
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:3 -> ../tty3
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:30 -> ../tty30
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:31 -> ../tty31
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:32 -> ../tty32
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:33 -> ../tty33
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:34 -> ../tty34
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:35 -> ../tty35
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:36 -> ../tty36
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:37 -> ../tty37
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:38 -> ../tty38
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:39 -> ../tty39
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:4 -> ../tty4
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:40 -> ../tty40
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:41 -> ../tty41
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:42 -> ../tty42
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:43 -> ../tty43
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:44 -> ../tty44
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:45 -> ../tty45
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:46 -> ../tty46
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:47 -> ../tty47
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:48 -> ../tty48
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:49 -> ../tty49
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:5 -> ../tty5
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:50 -> ../tty50
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:51 -> ../tty51
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:52 -> ../tty52
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:53 -> ../tty53
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:54 -> ../tty54
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:55 -> ../tty55
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:56 -> ../tty56
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:57 -> ../tty57
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:58 -> ../tty58
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:59 -> ../tty59
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:6 -> ../tty6
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:60 -> ../tty60
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:61 -> ../tty61
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:62 -> ../tty62
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:63 -> ../tty63
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:64 -> ../ttyS0
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:65 -> ../ttyS1
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:66 -> ../ttyS2
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 4:67 -> ../ttyS3
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:7 -> ../tty7
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:8 -> ../tty8
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 4:9 -> ../tty9
lrwxrwxrwx  1 root root    6 2009-10-16 18:12 5:0 -> ../tty
lrwxrwxrwx  1 root root   10 2009-10-16 18:12 5:1 -> ../console
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 5:2 -> ../ptmx
lrwxrwxrwx  1 root root    6 2009-10-16 18:12 7:0 -> ../vcs
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:1 -> ../vcs1
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:10 -> ../vcs10
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:11 -> ../vcs11
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:12 -> ../vcs12
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:128 -> ../vcsa
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:129 -> ../vcsa1
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:130 -> ../vcsa2
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:131 -> ../vcsa3
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:132 -> ../vcsa4
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:133 -> ../vcsa5
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:134 -> ../vcsa6
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:135 -> ../vcsa7
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:136 -> ../vcsa8
lrwxrwxrwx  1 root root    8 2009-10-16 18:12 7:137 -> ../vcsa9
lrwxrwxrwx  1 root root    9 2009-10-16 18:12 7:138 -> ../vcsa10
lrwxrwxrwx  1 root root    9 2009-10-16 18:12 7:139 -> ../vcsa11
lrwxrwxrwx  1 root root    9 2009-10-16 18:12 7:140 -> ../vcsa12
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:2 -> ../vcs2
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:3 -> ../vcs3
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:4 -> ../vcs4
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:5 -> ../vcs5
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:6 -> ../vcs6
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:7 -> ../vcs7
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:8 -> ../vcs8
lrwxrwxrwx  1 root root    7 2009-10-16 18:12 7:9 -> ../vcs9

Stéphane Bunel.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-11-03  9:37                     ` Stephane Bunel
@ 2009-11-03 10:09                       ` Beolach
  2009-11-03 12:16                         ` Stephane Bunel
  0 siblings, 1 reply; 49+ messages in thread
From: Beolach @ 2009-11-03 10:09 UTC (permalink / raw)
  To: Stephane Bunel; +Cc: Neil Brown, linux-raid

On Tue, Nov 3, 2009 at 02:37, Stephane Bunel
<stephane.bunel@forumdesimages.fr> wrote:
> Neil Brown a écrit :
> (...)
>
>> On your system, /dev/char/21:0 is a block device (or a link to a block
>> device) so there is clearly some sort of configuration error.
>
> All files in /dev/char are symlinks (see below). Rules are comming from
> Gentoo.
>
>> If you still cannot find it, maybe you could show us the change you
>> made to udev.rules, and an 'ls -l' of '/dev/char'.  That might help
>> shed some light on your situation.
>
> Considering mdadm is only involved by "real" block device file, why not just
> skipping symlink ?
>
>  o Udev rules used to rename /dev/sd[ab]:
>
> #cat 65-persistent-block.rules
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
> NAME="raid_disk0"
>
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
> NAME="raid_disk1"
>

Try prepending SUBSYSTEM=="block" to those, so they they'll only match
the (block) sd* devices, and not the (char) sg? devices:
SUBSYSTEM=="block",
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
NAME="raid_disk0"
SUBSYSTEM=="block",
ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
NAME="raid_disk1"


Good luck,
Conway S. Smith
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: mdadm --detail showing annoying device
  2009-11-03 10:09                       ` Beolach
@ 2009-11-03 12:16                         ` Stephane Bunel
  0 siblings, 0 replies; 49+ messages in thread
From: Stephane Bunel @ 2009-11-03 12:16 UTC (permalink / raw)
  To: Beolach; +Cc: Neil Brown, linux-raid

Beolach a écrit :
> On Tue, Nov 3, 2009 at 02:37, Stephane Bunel
> <stephane.bunel@forumdesimages.fr> wrote:
>> Neil Brown a écrit :
>> (...)
>>
>>> On your system, /dev/char/21:0 is a block device (or a link to a block
>>> device) so there is clearly some sort of configuration error.
>> All files in /dev/char are symlinks (see below). Rules are comming from
>> Gentoo.
>>
>>> If you still cannot find it, maybe you could show us the change you
>>> made to udev.rules, and an 'ls -l' of '/dev/char'.  That might help
>>> shed some light on your situation.
>> Considering mdadm is only involved by "real" block device file, why not just
>> skipping symlink ?
>>
>>  o Udev rules used to rename /dev/sd[ab]:
>>
>> #cat 65-persistent-block.rules
>> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
>> NAME="raid_disk0"
>>
>> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
>> NAME="raid_disk1"
>>
> 
> Try prepending SUBSYSTEM=="block" to those, so they they'll only match
> the (block) sd* devices, and not the (char) sg? devices:
> SUBSYSTEM=="block",
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0",
> NAME="raid_disk0"
> SUBSYSTEM=="block",
> ENV{PHYSDEVPATH}=="/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0",
> NAME="raid_disk1"

Well done!! This correct the sg* link issue.

#mdadm --detail /dev/md0
(...)
     Number   Major   Minor   RaidDevice State
        0       8        0        0      active sync   /dev/block/8:0
        1       8       16        1      active sync   /dev/block/8:16
(...)

But we loop back to the original problem. mdadm show '/dev/block/8:0' instead 
of '/dev/raid_disk0' (as i wish). This because the symlink '/dev/block/8:0' is 
shorter than the real block file name '/dev/raid_disk0'.

Actual heuristic prevents renaming of real block device name as desired under 
penalty of seeing symlinks from /dev/block/*.

maybe mdadm could change this behavior by:

  o always prefers real bloc file over symlinks.
  o simply skips symlinks.
  o changing the heuristic like proposed by my patch (prefer the name that is 
closer to /dev).


Stéphane Bunel.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-10-21 10:19             ` Justin Piszcz
@ 2009-11-20 20:39               ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-raid, xfs, Alan Piszcz, asterisk-users, submit

Package: asterisk
Version: 1.6.2.0~dfsg~rc1-1

See below for issue:

On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>> 
>> 
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>> 
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>> 
>>>>> All pointing to log IO not completing.
>>>>> 
>>> ....
>>>> So far I do not have a reproducible test case,
>>> 
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>> 
>>> 
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any 
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>>> 
>> 
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue?  BTW: Do you view 
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Found root cause-- root cause is asterisk PBX software.  I use an SPA3102.
When someone called me, they accidentally dropped the connection, I called
them back in a short period.  It is during this time (and the last time)
this happened that the box froze under multiple(!) kernels, always when
someone was calling.

I have removed asterisk but this is the version I was running:
~$ dpkg -l | grep -i asterisk
rc  asterisk                             1:1.6.2.0~dfsg~rc1-1             Open S

I don't know what asterisk is doing but top did run before the crash
and asterisk was using 100% CPU and as I noted before all other processes
were in D-state.

When this bug occurs, it freezes I/O to all devices and the only way to recover
is to reboot the system.

Just FYI if anyone else out there has their system crash when running asterisk.

Just out of curiosity, has anyone else running asterisk had such an issue? 
I was not running any special VoIP PCI cards/etc.

Justin.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 20:39               ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: submit, linux-kernel, xfs, linux-raid, asterisk-users, Alan Piszcz

Package: asterisk
Version: 1.6.2.0~dfsg~rc1-1

See below for issue:

On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>> 
>> 
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>> 
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>> 
>>>>> All pointing to log IO not completing.
>>>>> 
>>> ....
>>>> So far I do not have a reproducible test case,
>>> 
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>> 
>>> 
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any 
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>>> 
>> 
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue?  BTW: Do you view 
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Found root cause-- root cause is asterisk PBX software.  I use an SPA3102.
When someone called me, they accidentally dropped the connection, I called
them back in a short period.  It is during this time (and the last time)
this happened that the box froze under multiple(!) kernels, always when
someone was calling.

I have removed asterisk but this is the version I was running:
~$ dpkg -l | grep -i asterisk
rc  asterisk                             1:1.6.2.0~dfsg~rc1-1             Open S

I don't know what asterisk is doing but top did run before the crash
and asterisk was using 100% CPU and as I noted before all other processes
were in D-state.

When this bug occurs, it freezes I/O to all devices and the only way to recover
is to reboot the system.

Just FYI if anyone else out there has their system crash when running asterisk.

Just out of curiosity, has anyone else running asterisk had such an issue? 
I was not running any special VoIP PCI cards/etc.

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-11-20 20:39               ` Justin Piszcz
  (?)
@ 2009-11-20 23:44                 ` Faidon Liambotis
  -1 siblings, 0 replies; 49+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
  To: Justin Piszcz, 557262
  Cc: linux-raid, Dave Chinner, linux-kernel, xfs, submit,
	asterisk-users, Alan Piszcz

Justin Piszcz wrote:
 > Found root cause-- root cause is asterisk PBX software.  I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period.  It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
> 
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.

It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.

This is almost probably a kernel issue and asterisk just does something
that triggers this bug.

Regards,
Faidon

_______________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 23:44                 ` Faidon Liambotis
  0 siblings, 0 replies; 49+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
  To: Justin Piszcz, 557262
  Cc: Dave Chinner, submit, linux-kernel, xfs, linux-raid,
	asterisk-users, Alan Piszcz

Justin Piszcz wrote:
 > Found root cause-- root cause is asterisk PBX software.  I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period.  It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
> 
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.

It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.

This is almost probably a kernel issue and asterisk just does something
that triggers this bug.

Regards,
Faidon

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 23:44                 ` Faidon Liambotis
  0 siblings, 0 replies; 49+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
  To: Justin Piszcz, 557262
  Cc: linux-raid, linux-kernel, xfs, submit, asterisk-users, Alan Piszcz

Justin Piszcz wrote:
 > Found root cause-- root cause is asterisk PBX software.  I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period.  It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
> 
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.

It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.

This is almost probably a kernel issue and asterisk just does something
that triggers this bug.

Regards,
Faidon

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-11-20 23:44                 ` Faidon Liambotis
@ 2009-11-20 23:51                   ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw)
  To: Faidon Liambotis
  Cc: 557262, Dave Chinner, submit, linux-kernel, xfs, linux-raid,
	asterisk-users, Alan Piszcz



On Sat, 21 Nov 2009, Faidon Liambotis wrote:

> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software.  I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period.  It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
>

It is possible although I tried with several kernels (2.6.30.[0-9] & 
2.6.31+ (never had a crash with earlier versions, I installed asterisk long
ago) but it always used to be 1.4.x until recently..  Nasty bug :\

Justin.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-20 23:51                   ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw)
  To: Faidon Liambotis
  Cc: linux-raid, 557262, linux-kernel, xfs, submit, asterisk-users,
	Alan Piszcz



On Sat, 21 Nov 2009, Faidon Liambotis wrote:

> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software.  I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period.  It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
>

It is possible although I tried with several kernels (2.6.30.[0-9] & 
2.6.31+ (never had a crash with earlier versions, I installed asterisk long
ago) but it always used to be 1.4.x until recently..  Nasty bug :\

Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-11-20 23:44                 ` Faidon Liambotis
@ 2009-11-21 14:29                   ` Roger Heflin
  -1 siblings, 0 replies; 49+ messages in thread
From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw)
  To: Faidon Liambotis
  Cc: Justin Piszcz, 557262, Dave Chinner, submit, linux-kernel, xfs,
	linux-raid, asterisk-users, Alan Piszcz

Faidon Liambotis wrote:
> Justin Piszcz wrote:
>  > Found root cause-- root cause is asterisk PBX software.  I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period.  It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
> 
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
> 
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
> 
> Regards,
> Faidon


I had an application in 2.6.5 (SLES9)...that would hang XFS.

The underlying application was multi-threaded and both threads were 
doing full disks syncs every so often, and sometimes when doing the 
full disk sync the XFS subsystem would deadlock, it appeared to me tha 
one sync had a lock and was waiting for another, and the other process 
had the second lock and was waiting for the first...   We were able to 
disable the full disk sync from the application and the deadlock went 
away.   All non-xfs filesytems still worked and could still be 
accessed.    I did report the bug with some traces but I don't believe 
anyone ever determined where the underlying issues was.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
@ 2009-11-21 14:29                   ` Roger Heflin
  0 siblings, 0 replies; 49+ messages in thread
From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw)
  To: Faidon Liambotis
  Cc: linux-raid, 557262, linux-kernel, xfs, submit, asterisk-users,
	Justin Piszcz, Alan Piszcz

Faidon Liambotis wrote:
> Justin Piszcz wrote:
>  > Found root cause-- root cause is asterisk PBX software.  I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period.  It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
> 
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
> 
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
> 
> Regards,
> Faidon


I had an application in 2.6.5 (SLES9)...that would hang XFS.

The underlying application was multi-threaded and both threads were 
doing full disks syncs every so often, and sometimes when doing the 
full disk sync the XFS subsystem would deadlock, it appeared to me tha 
one sync had a lock and was waiting for another, and the other process 
had the second lock and was waiting for the first...   We were able to 
disable the full disk sync from the application and the deadlock went 
away.   All non-xfs filesytems still worked and could still be 
accessed.    I did report the bug with some traces but I don't believe 
anyone ever determined where the underlying issues was.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Which kernel options should be enabled to find the root cause of this bug?
  2009-10-17 22:34 ` Justin Piszcz
@ 2009-11-24 13:08   ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz



On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,
>
> I have a system I recently upgraded from 2.6.30.x and after approximately 
> 24-48 hours--sometimes longer, the system cannot write any more files to disk 
> (luckily though I can still write to /dev/shm) -- to which I have
> saved the sysrq-t and sysrq-w output:
>
> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Configuration:
>
> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : 
> active raid1 sdb2[1] sda2[0]
>      136448 blocks [2/2] [UU]
>
> md2 : active raid1 sdb3[1] sda3[0]
>      129596288 blocks [2/2] [UU]
>
> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] 
> sdc1[0]
>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md0 : active raid1 sdb1[1] sda1[0]
>      16787776 blocks [2/2] [UU]
>
> $ mount
> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> /dev/md1 on /boot type ext3 (rw,noatime)
> /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Distribution: Debian Testing
> Arch: x86_64
>
> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
> persists.
>
> Here is a snippet of two processes in D-state, the first was not doing 
> anything, the second was mrtg.
>
> [121444.684000] pickup        D 0000000000000003     0 18407   4521 
> 0x00000000
> [121444.684000]  ffff880231dd2290 0000000000000086 0000000000000000 
> 0000000000000000
> [121444.684000]  000000000000ff40 000000000000c8c8 ffff880176794d10 
> ffff880176794f90
> [121444.684000]  000000032266dd08 ffff8801407a87f0 ffff8800280878d8 
> ffff880176794f90
> [121444.684000] Call Trace:
> [121444.684000]  [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
> [121444.684000]  [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
> [121444.684000]  [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
> [121444.684000]  [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
> [121444.684000]  [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
> [121444.684000]  [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
> [121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> [121444.684000] rateup        D 0000000000000000     0 18538  18465 
> 0x00000000
> [121444.684000]  ffff88023f8a8c10 0000000000000082 0000000000000000 
> ffff88023ea09ec8
> [121444.684000]  000000000000ff40 000000000000c8c8 ffff88023faace50 
> ffff88023faad0d0
> [121444.684000]  0000000300003e00 000000010720cc78 0000000000003e00 
> ffff88023faad0d0
> [121444.684000] Call Trace:
> [121444.684000]  [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
> [121444.684000]  [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
> [121444.684000]  [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
> [121444.684000]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [121444.684000]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [121444.684000]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [121444.684000]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
> [121444.684000]  [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
> [121444.684000]  [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
> [121444.684000]  [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
> [121444.684000]  [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
> [121444.684000]  [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
> [121444.684000]  [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
> [121444.684000]  [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
> [121444.684000]  [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
> [121444.684000]  [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
> [121444.684000]  [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
> [121444.684000]  [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
> [121444.684000]  [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
> [121444.684000]  [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
> [121444.684000]  [<ffffffff8108b21a>] ? __writepage+0xa/0x40
> [121444.684000]  [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
> [121444.684000]  [<ffffffff8108b210>] ? __writepage+0x0/0x40
> [121444.684000]  [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
> [121444.684000]  [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
> [121444.684000]  [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
> [121444.684000]  [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
> [121444.684000]  [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
> [121444.684000]  [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
> [121444.684000]  [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
> [121444.684000]  [<ffffffff810af556>] ? filp_close+0x56/0x90
> [121444.684000]  [<ffffffff810af636>] ? sys_close+0xa6/0x100
> [121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> Anyone know what is going on here?
>
> Justin.
>

In addition to using netconsole, which kernel options should be enabled
to better diagnose this issue?

Should I enable these to help track down this bug?

[ ]   XFS Debugging support (EXPERIMENTAL)
[ ] Compile the kernel with frame pointers

Are there any other options that will help determine the root cause of this
bug that are recommended?

Justin.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 13:08   ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz



On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,
>
> I have a system I recently upgraded from 2.6.30.x and after approximately 
> 24-48 hours--sometimes longer, the system cannot write any more files to disk 
> (luckily though I can still write to /dev/shm) -- to which I have
> saved the sysrq-t and sysrq-w output:
>
> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Configuration:
>
> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : 
> active raid1 sdb2[1] sda2[0]
>      136448 blocks [2/2] [UU]
>
> md2 : active raid1 sdb3[1] sda3[0]
>      129596288 blocks [2/2] [UU]
>
> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] 
> sdc1[0]
>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md0 : active raid1 sdb1[1] sda1[0]
>      16787776 blocks [2/2] [UU]
>
> $ mount
> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> /dev/md1 on /boot type ext3 (rw,noatime)
> /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Distribution: Debian Testing
> Arch: x86_64
>
> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
> persists.
>
> Here is a snippet of two processes in D-state, the first was not doing 
> anything, the second was mrtg.
>
> [121444.684000] pickup        D 0000000000000003     0 18407   4521 
> 0x00000000
> [121444.684000]  ffff880231dd2290 0000000000000086 0000000000000000 
> 0000000000000000
> [121444.684000]  000000000000ff40 000000000000c8c8 ffff880176794d10 
> ffff880176794f90
> [121444.684000]  000000032266dd08 ffff8801407a87f0 ffff8800280878d8 
> ffff880176794f90
> [121444.684000] Call Trace:
> [121444.684000]  [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
> [121444.684000]  [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
> [121444.684000]  [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
> [121444.684000]  [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
> [121444.684000]  [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
> [121444.684000]  [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
> [121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> [121444.684000] rateup        D 0000000000000000     0 18538  18465 
> 0x00000000
> [121444.684000]  ffff88023f8a8c10 0000000000000082 0000000000000000 
> ffff88023ea09ec8
> [121444.684000]  000000000000ff40 000000000000c8c8 ffff88023faace50 
> ffff88023faad0d0
> [121444.684000]  0000000300003e00 000000010720cc78 0000000000003e00 
> ffff88023faad0d0
> [121444.684000] Call Trace:
> [121444.684000]  [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
> [121444.684000]  [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
> [121444.684000]  [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
> [121444.684000]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [121444.684000]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [121444.684000]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [121444.684000]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
> [121444.684000]  [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
> [121444.684000]  [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
> [121444.684000]  [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
> [121444.684000]  [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
> [121444.684000]  [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
> [121444.684000]  [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
> [121444.684000]  [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
> [121444.684000]  [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
> [121444.684000]  [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
> [121444.684000]  [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
> [121444.684000]  [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
> [121444.684000]  [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
> [121444.684000]  [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
> [121444.684000]  [<ffffffff8108b21a>] ? __writepage+0xa/0x40
> [121444.684000]  [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
> [121444.684000]  [<ffffffff8108b210>] ? __writepage+0x0/0x40
> [121444.684000]  [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
> [121444.684000]  [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
> [121444.684000]  [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
> [121444.684000]  [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
> [121444.684000]  [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
> [121444.684000]  [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
> [121444.684000]  [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
> [121444.684000]  [<ffffffff810af556>] ? filp_close+0x56/0x90
> [121444.684000]  [<ffffffff810af636>] ? sys_close+0xa6/0x100
> [121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> Anyone know what is going on here?
>
> Justin.
>

In addition to using netconsole, which kernel options should be enabled
to better diagnose this issue?

Should I enable these to help track down this bug?

[ ]   XFS Debugging support (EXPERIMENTAL)
[ ] Compile the kernel with frame pointers

Are there any other options that will help determine the root cause of this
bug that are recommended?

Justin.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
  2009-11-24 13:08   ` Justin Piszcz
@ 2009-11-24 15:14     ` Eric Sandeen
  -1 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

Justin Piszcz wrote:
> 
> 
> On Sat, 17 Oct 2009, Justin Piszcz wrote:
> 
>> Hello,
>>
>> I have a system I recently upgraded from 2.6.30.x and after
>> approximately 24-48 hours--sometimes longer, the system cannot write
>> any more files to disk (luckily though I can still write to /dev/shm)
>> -- to which I have
>> saved the sysrq-t and sysrq-w output:
>>
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Unfortunately it looks like a lot of the sysrq-t, at least, was lost.

The sysrq-w trace has the "show blocked state" start a ways down the file,
for anyone playing along at home ;)

Other things you might try are a sysrq-m to get memory state...

>> Configuration:
>>
>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>> : active raid1 sdb2[1] sda2[0]
>>      136448 blocks [2/2] [UU]
>>
>> md2 : active raid1 sdb3[1] sda3[0]
>>      129596288 blocks [2/2] [UU]
>>
>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>> sdd1[1] sdc1[0]
>>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md0 : active raid1 sdb1[1] sda1[0]
>>      16787776 blocks [2/2] [UU]
>>
>> $ mount
>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> udev on /dev type tmpfs (rw,mode=0755)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>> /dev/md1 on /boot type ext3 (rw,noatime)
>> /dev/md3 on /r/1 type xfs
>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>> nfsd on /proc/fs/nfsd type nfsd (rw)

Do you get the same behavior if you don't add the log options at mount time?
Kind of grasping at straws here for now ...

>> Distribution: Debian Testing
>> Arch: x86_64
>>
>> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
>> persists.
>>

...

> In addition to using netconsole, which kernel options should be enabled
> to better diagnose this issue?
> 
> Should I enable these to help track down this bug?
> 
> [ ]   XFS Debugging support (EXPERIMENTAL)
> [ ] Compile the kernel with frame pointers

The former probably won't hurt; the latter might gibe us better backtraces.

> Are there any other options that will help determine the root cause of this
> bug that are recommended?

Not that I can think of off hand ...

-Eric

> Justin.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 15:14     ` Eric Sandeen
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

Justin Piszcz wrote:
> 
> 
> On Sat, 17 Oct 2009, Justin Piszcz wrote:
> 
>> Hello,
>>
>> I have a system I recently upgraded from 2.6.30.x and after
>> approximately 24-48 hours--sometimes longer, the system cannot write
>> any more files to disk (luckily though I can still write to /dev/shm)
>> -- to which I have
>> saved the sysrq-t and sysrq-w output:
>>
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Unfortunately it looks like a lot of the sysrq-t, at least, was lost.

The sysrq-w trace has the "show blocked state" start a ways down the file,
for anyone playing along at home ;)

Other things you might try are a sysrq-m to get memory state...

>> Configuration:
>>
>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>> : active raid1 sdb2[1] sda2[0]
>>      136448 blocks [2/2] [UU]
>>
>> md2 : active raid1 sdb3[1] sda3[0]
>>      129596288 blocks [2/2] [UU]
>>
>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>> sdd1[1] sdc1[0]
>>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md0 : active raid1 sdb1[1] sda1[0]
>>      16787776 blocks [2/2] [UU]
>>
>> $ mount
>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> udev on /dev type tmpfs (rw,mode=0755)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>> /dev/md1 on /boot type ext3 (rw,noatime)
>> /dev/md3 on /r/1 type xfs
>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>> nfsd on /proc/fs/nfsd type nfsd (rw)

Do you get the same behavior if you don't add the log options at mount time?
Kind of grasping at straws here for now ...

>> Distribution: Debian Testing
>> Arch: x86_64
>>
>> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
>> persists.
>>

...

> In addition to using netconsole, which kernel options should be enabled
> to better diagnose this issue?
> 
> Should I enable these to help track down this bug?
> 
> [ ]   XFS Debugging support (EXPERIMENTAL)
> [ ] Compile the kernel with frame pointers

The former probably won't hurt; the latter might gibe us better backtraces.

> Are there any other options that will help determine the root cause of this
> bug that are recommended?

Not that I can think of off hand ...

-Eric

> Justin.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
  2009-11-24 15:14     ` Eric Sandeen
@ 2009-11-24 16:20       ` Justin Piszcz
  -1 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Tue, 24 Nov 2009, Eric Sandeen wrote:

> Justin Piszcz wrote:
>>
>>
>> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>>
>>> Hello,
>>>
>>> I have a system I recently upgraded from 2.6.30.x and after
>>> approximately 24-48 hours--sometimes longer, the system cannot write
>>> any more files to disk (luckily though I can still write to /dev/shm)
>>> -- to which I have
>>> saved the sysrq-t and sysrq-w output:
>>>
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
Yes, when this occurred the first few times, I can only grab whats in dmesg
to the ramdisk, trying to access any file system other than the ramdisk
(tmpfs) /dev/shm, will cause the process to be locked.

>
> The sysrq-w trace has the "show blocked state" start a ways down the file,
> for anyone playing along at home ;)
>
> Other things you might try are a sysrq-m to get memory state...
I actually performed most of the useful sysrq-commands, please see
the following:

wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

>
>>> Configuration:
>>>
>>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>>> : active raid1 sdb2[1] sda2[0]
>>>      136448 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[1] sda3[0]
>>>      129596288 blocks [2/2] [UU]
>>>
>>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>>> sdd1[1] sdc1[0]
>>>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>>
>>> md0 : active raid1 sdb1[1] sda1[0]
>>>      16787776 blocks [2/2] [UU]
>>>
>>> $ mount
>>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> udev on /dev type tmpfs (rw,mode=0755)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>> /dev/md1 on /boot type ext3 (rw,noatime)
>>> /dev/md3 on /r/1 type xfs
>>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>>> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Do you get the same behavior if you don't add the log options at mount time?
I have not tried disabling the log options, although they have been in effect
for a long time, (the logsbufs and bufsize and recently) the nobarrier
support.  Could there be an issue using -o nobarrier on a raid1+xfs?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 16:20       ` Justin Piszcz
  0 siblings, 0 replies; 49+ messages in thread
From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Tue, 24 Nov 2009, Eric Sandeen wrote:

> Justin Piszcz wrote:
>>
>>
>> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>>
>>> Hello,
>>>
>>> I have a system I recently upgraded from 2.6.30.x and after
>>> approximately 24-48 hours--sometimes longer, the system cannot write
>>> any more files to disk (luckily though I can still write to /dev/shm)
>>> -- to which I have
>>> saved the sysrq-t and sysrq-w output:
>>>
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
Yes, when this occurred the first few times, I can only grab whats in dmesg
to the ramdisk, trying to access any file system other than the ramdisk
(tmpfs) /dev/shm, will cause the process to be locked.

>
> The sysrq-w trace has the "show blocked state" start a ways down the file,
> for anyone playing along at home ;)
>
> Other things you might try are a sysrq-m to get memory state...
I actually performed most of the useful sysrq-commands, please see
the following:

wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

>
>>> Configuration:
>>>
>>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>>> : active raid1 sdb2[1] sda2[0]
>>>      136448 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[1] sda3[0]
>>>      129596288 blocks [2/2] [UU]
>>>
>>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>>> sdd1[1] sdc1[0]
>>>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>>
>>> md0 : active raid1 sdb1[1] sda1[0]
>>>      16787776 blocks [2/2] [UU]
>>>
>>> $ mount
>>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> udev on /dev type tmpfs (rw,mode=0755)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>> /dev/md1 on /boot type ext3 (rw,noatime)
>>> /dev/md3 on /r/1 type xfs
>>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>>> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Do you get the same behavior if you don't add the log options at mount time?
I have not tried disabling the log options, although they have been in effect
for a long time, (the logsbufs and bufsize and recently) the nobarrier
support.  Could there be an issue using -o nobarrier on a raid1+xfs?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
  2009-11-24 16:20       ` Justin Piszcz
@ 2009-11-24 16:23         ` Eric Sandeen
  -1 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

Justin Piszcz wrote:
> 
> On Tue, 24 Nov 2009, Eric Sandeen wrote:

...

>> Do you get the same behavior if you don't add the log options at mount time?
> I have not tried disabling the log options, although they have been in effect
> for a long time, (the logsbufs and bufsize and recently) the nobarrier
> support.  Could there be an issue using -o nobarrier on a raid1+xfs?

nobarrier should not cause problems.

-Eric

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
@ 2009-11-24 16:23         ` Eric Sandeen
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

Justin Piszcz wrote:
> 
> On Tue, 24 Nov 2009, Eric Sandeen wrote:

...

>> Do you get the same behavior if you don't add the log options at mount time?
> I have not tried disabling the log options, although they have been in effect
> for a long time, (the logsbufs and bufsize and recently) the nobarrier
> support.  Could there be an issue using -o nobarrier on a raid1+xfs?

nobarrier should not cause problems.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2009-11-24 16:23 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-17 22:34 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
2009-10-17 22:34 ` Justin Piszcz
2009-10-18 20:17 ` Justin Piszcz
2009-10-18 20:17   ` Justin Piszcz
2009-10-19  3:04   ` Dave Chinner
2009-10-19  3:04     ` Dave Chinner
2009-10-19 10:18     ` Justin Piszcz
2009-10-19 10:18       ` Justin Piszcz
2009-10-20  0:33       ` Dave Chinner
2009-10-20  0:33         ` Dave Chinner
2009-10-20  8:33         ` Justin Piszcz
2009-10-20  8:33           ` Justin Piszcz
2009-10-21 10:19           ` Justin Piszcz
2009-10-21 10:19             ` Justin Piszcz
2009-10-21 14:17             ` mdadm --detail showing annoying device Stephane Bunel
2009-10-21 21:46               ` Neil Brown
2009-10-22 11:22                 ` Stephane Bunel
2009-10-29  3:44                   ` Neil Brown
2009-11-03  9:37                     ` Stephane Bunel
2009-11-03 10:09                       ` Beolach
2009-11-03 12:16                         ` Stephane Bunel
2009-10-22 11:29                 ` Mario 'BitKoenig' Holbe
2009-10-22 14:17                   ` Stephane Bunel
2009-10-22 16:00                     ` Stephane Bunel
2009-10-22 22:49             ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
2009-10-22 22:49               ` Justin Piszcz
2009-10-22 23:00               ` Dave Chinner
2009-10-22 23:00                 ` Dave Chinner
2009-10-26 11:24               ` Justin Piszcz
2009-10-26 11:24                 ` Justin Piszcz
2009-11-02 21:46                 ` Justin Piszcz
2009-11-02 21:46                   ` Justin Piszcz
2009-11-20 20:39             ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz
2009-11-20 20:39               ` Justin Piszcz
2009-11-20 23:44               ` Bug#557262: " Faidon Liambotis
2009-11-20 23:44                 ` Faidon Liambotis
2009-11-20 23:44                 ` Faidon Liambotis
2009-11-20 23:51                 ` Justin Piszcz
2009-11-20 23:51                   ` Justin Piszcz
2009-11-21 14:29                 ` Roger Heflin
2009-11-21 14:29                   ` Roger Heflin
2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz
2009-11-24 13:08   ` Justin Piszcz
2009-11-24 15:14   ` Eric Sandeen
2009-11-24 15:14     ` Eric Sandeen
2009-11-24 16:20     ` Justin Piszcz
2009-11-24 16:20       ` Justin Piszcz
2009-11-24 16:23       ` Eric Sandeen
2009-11-24 16:23         ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.