linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
@ 2009-10-17 22:34 Justin Piszcz
  2009-10-18 20:17 ` Justin Piszcz
  2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz
  0 siblings, 2 replies; 19+ messages in thread
From: Justin Piszcz @ 2009-10-17 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz

Hello,

I have a system I recently upgraded from 2.6.30.x and after approximately 
24-48 hours--sometimes longer, the system cannot write any more files to 
disk (luckily though I can still write to /dev/shm) -- to which I have
saved the sysrq-t and sysrq-w output:

http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Configuration:

$ cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid1 sdb2[1] sda2[0]
       136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
       129596288 blocks [2/2] [UU]

md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] sdc1[0]
       5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]

md0 : active raid1 sdb1[1] sda1[0]
       16787776 blocks [2/2] [UU]

$ mount
/dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw,noatime)
/dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

Distribution: Debian Testing
Arch: x86_64

The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
persists.

Here is a snippet of two processes in D-state, the first was not doing 
anything, the second was mrtg.

[121444.684000] pickup        D 0000000000000003     0 18407   4521 0x00000000
[121444.684000]  ffff880231dd2290 0000000000000086 0000000000000000 0000000000000000
[121444.684000]  000000000000ff40 000000000000c8c8 ffff880176794d10 ffff880176794f90
[121444.684000]  000000032266dd08 ffff8801407a87f0 ffff8800280878d8 ffff880176794f90
[121444.684000] Call Trace:
[121444.684000]  [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
[121444.684000]  [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
[121444.684000]  [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
[121444.684000]  [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
[121444.684000]  [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
[121444.684000]  [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
[121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

[121444.684000] rateup        D 0000000000000000     0 18538  18465 0x00000000
[121444.684000]  ffff88023f8a8c10 0000000000000082 0000000000000000 ffff88023ea09ec8
[121444.684000]  000000000000ff40 000000000000c8c8 ffff88023faace50 ffff88023faad0d0
[121444.684000]  0000000300003e00 000000010720cc78 0000000000003e00 ffff88023faad0d0
[121444.684000] Call Trace:
[121444.684000]  [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
[121444.684000]  [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
[121444.684000]  [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
[121444.684000]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[121444.684000]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[121444.684000]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[121444.684000]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
[121444.684000]  [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
[121444.684000]  [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
[121444.684000]  [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
[121444.684000]  [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
[121444.684000]  [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
[121444.684000]  [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
[121444.684000]  [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
[121444.684000]  [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
[121444.684000]  [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
[121444.684000]  [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
[121444.684000]  [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
[121444.684000]  [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
[121444.684000]  [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
[121444.684000]  [<ffffffff8108b21a>] ? __writepage+0xa/0x40
[121444.684000]  [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
[121444.684000]  [<ffffffff8108b210>] ? __writepage+0x0/0x40
[121444.684000]  [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
[121444.684000]  [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
[121444.684000]  [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
[121444.684000]  [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
[121444.684000]  [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
[121444.684000]  [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
[121444.684000]  [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
[121444.684000]  [<ffffffff810af556>] ? filp_close+0x56/0x90
[121444.684000]  [<ffffffff810af636>] ? sys_close+0xa6/0x100
[121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b

Anyone know what is going on here?

Justin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-17 22:34 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
@ 2009-10-18 20:17 ` Justin Piszcz
  2009-10-19  3:04   ` Dave Chinner
  2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz
  1 sibling, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-10-18 20:17 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz



On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,

It has happened again, all sysrq-X output was saved this time.

wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

Kernel configuration:

wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.30.9.txt
wget http://home.comcast.net/~jpiszcz/20091018/config-2.6.31.4.txt

Diff of the two configs:

$ diff config-2.6.30.9.txt config-2.6.31.4.txt |grep -v "#"|grep "_"
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_CONSTRUCTORS=y
> CONFIG_HAVE_PERF_COUNTERS=y
> CONFIG_HAVE_DMA_ATTRS=y
> CONFIG_BLK_DEV_BSG=y
> CONFIG_X86_NEW_MCE=y
> CONFIG_X86_THERMAL_VECTOR=y
< CONFIG_UNEVICTABLE_LRU=y
< CONFIG_PHYSICAL_START=0x200000
> CONFIG_PHYSICAL_START=0x1000000
< CONFIG_PHYSICAL_ALIGN=0x200000
> CONFIG_PHYSICAL_ALIGN=0x1000000
< CONFIG_COMPAT_NET_DEV_OPS=y
< CONFIG_SND_JACK=y
> CONFIG_HID_DRAGONRISE=y
> CONFIG_HID_GREENASIA=y
> CONFIG_HID_SMARTJOYPLUS=y
> CONFIG_HID_THRUSTMASTER=y
> CONFIG_HID_ZEROPLUS=y
> CONFIG_FSNOTIFY=y
> CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
> CONFIG_HAVE_ARCH_KMEMCHECK=y

I have reverted back to 2.6.30.9 to see if the problem recurs with this
kernel version.

I do not recall seeing this on the older 2.6.30.x kernels:

[    9.276427] md3: detected capacity change from 0 to 5251073572864
[    9.277411] md2: detected capacity change from 0 to 132706598912
[    9.278305] md1: detected capacity change from 0 to 139722752
[    9.278921] md0: detected capacity change from 0 to 17190682624

Again, some more D-state processes:

[76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
[76325.608087] Call Trace:
[76325.608095]  [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
[76325.608099]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
[76325.608103]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
[76325.608106]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
[76325.608108]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40

[76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
[76325.608214] Call Trace:
[76325.608216]  [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
[76325.608220]  [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
[76325.608222]  [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
[76325.608225]  [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
[76325.608228]  [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
[76325.608231]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
[76325.608233]  [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
[76325.608236]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0

All of the D-state processes:

$ cat sysrq-w.txt  |grep ' D'
[76307.285125] alpine        D 0000000000000000     0  7659  29120 0x00000000
[76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
[76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
[76325.608257] syslogd       D 0000000000000002     0  2438      1 0x00000000
[76325.608318] freshclam     D 0000000000000000     0  2877      1 0x00000000
[76325.608428] asterisk      D 0000000000000001     0  3278      1 0x00000000
[76325.608492] console-kit-d D 0000000000000000     0  3299      1 0x00000000
[76325.608562] dhcpd3        D 0000000000000000     0  3554      1 0x00000000
[76325.608621] plasma-deskto D 0000000000000002     0 32482      1 0x00000000
[76325.608713] kaccess       D 0000000000000001     0 32488      1 0x00000000
[76325.608752] mail          D 0000000000000000     0  7397   7386 0x00000000
[76325.608830] hal-acl-tool  D 0000000000000000     0  7430   3399 0x00000004
[76325.608888] mrtg          D 0000000000000000     0  7444   7433 0x00000000
[76325.608981] cron          D 0000000000000000     0  7500   3630 0x00000000
[76325.609000] alpine        D 0000000000000000     0  7659  29120 0x00000000

List of functions underneath the D-state processes (sorted/uniqued)--

     121 [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
      77 [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
      62 [<ffffffff814543a5>] ? schedule_timeout+0x165/0x1a0
      60 [<ffffffff813bc1f6>] ? __alloc_skb+0x66/0x170
      60 [<ffffffff813b3e59>] ? sys_sendto+0x119/0x180
      59 [<ffffffff81428397>] ? unix_dgram_sendmsg+0x467/0x5c0
      59 [<ffffffff81427ce6>] ? unix_wait_for_peer+0x86/0xd0
      59 [<ffffffff813bd497>] ? memcpy_fromiovec+0x57/0x80
      59 [<ffffffff813b6c29>] ? sock_alloc_send_pskb+0x1d9/0x2f0
      59 [<ffffffff813b3a4b>] ? sock_sendmsg+0xcb/0x100
      59 [<ffffffff813b3062>] ? sockfd_lookup_light+0x22/0x80
      58 [<ffffffff814287ed>] ? unix_dgram_connect+0xad/0x270
      58 [<ffffffff813b3336>] ? sys_connect+0x86/0xe0
      57 [<ffffffff81427ed5>] ? unix_find_other+0x1a5/0x200
      57 [<ffffffff810c9d13>] ? mntput_no_expire+0x23/0xf0
      57 [<ffffffff810a3e74>] ? page_add_new_anon_rmap+0x54/0x90
      57 [<ffffffff8105947e>] ? current_fs_time+0x1e/0x30
      55 [<ffffffff81085445>] ? filemap_fault+0x95/0x3e0
       8 [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
       7 [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
       7 [<ffffffff810af727>] ? do_sys_open+0x97/0x150
       6 [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
       5 [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
       4 [<ffffffff811f5284>] ? xfs_file_fsync+0x54/0x70
       4 [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
       4 [<ffffffff811f0242>] ? kmem_zone_zalloc+0x32/0x50
       4 [<ffffffff811f01d3>] ? kmem_zone_alloc+0x83/0xc0
       4 [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
       4 [<ffffffff810d3a4b>] ? sys_fsync+0xb/0x20
       4 [<ffffffff810d39f6>] ? do_fsync+0x36/0x60
       4 [<ffffffff810d394e>] ? vfs_fsync+0x9e/0x110
       4 [<ffffffff810bbcde>] ? __link_path_walk+0x7e/0x1000
       3 [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
       3 [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
       3 [<ffffffff811f7b82>] ? xfs_vn_mknod+0x82/0x130
       3 [<ffffffff811eeab1>] ? xfs_fsync+0x141/0x190
       3 [<ffffffff811e8f1b>] ? _xfs_trans_commit+0x38b/0x3a0
       3 [<ffffffff811ddfac>] ? xlog_grant_log_space+0x28c/0x3c0
       3 [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
       3 [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
       3 [<ffffffff811dc1b0>] ? xfs_log_release_iclog+0x10/0x40
       3 [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
       3 [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
       3 [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
       3 [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
       3 [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
       3 [<ffffffff810c80ca>] ? alloc_fd+0x4a/0x140
       3 [<ffffffff810c2110>] ? pollwake+0x0/0x60
       3 [<ffffffff810c0b88>] ? poll_freewait+0x48/0xb0
       3 [<ffffffff810be8ee>] ? do_filp_open+0x9ee/0xac0
       3 [<ffffffff810be134>] ? do_filp_open+0x234/0xac0
       3 [<ffffffff810baeb6>] ? vfs_create+0xa6/0xf0
       3 [<ffffffff810b51d7>] ? vfs_fstatat+0x37/0x80
       3 [<ffffffff810ad46d>] ? kmem_cache_alloc+0x6d/0xa0
       3 [<ffffffff8104aca3>] ? __wake_up+0x43/0x70
       2 [<ffffffff81455797>] ? __down_write_nested+0x17/0xb0
       2 [<ffffffff81455151>] ? __down+0x61/0xa0
       2 [<ffffffff81454e85>] ? do_nanosleep+0x95/0xd0
       2 [<ffffffff81454dbd>] ? schedule_hrtimeout_range+0x11d/0x140
       2 [<ffffffff81454359>] ? schedule_timeout+0x119/0x1a0
       2 [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
       2 [<ffffffff811f4b82>] ? xfs_buf_read_flags+0x12/0xa0
       2 [<ffffffff811f4a4e>] ? xfs_buf_get_flags+0x6e/0x190
       2 [<ffffffff811f48f4>] ? _xfs_buf_find+0x134/0x220
       2 [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
       2 [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
       2 [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
       2 [<ffffffff811ed872>] ? xfs_create+0x312/0x530
       2 [<ffffffff811eb6e8>] ? xfs_dir_ialloc+0xa8/0x340
       2 [<ffffffff811ea4a6>] ? xfs_trans_read_buf+0x1e6/0x360
       2 [<ffffffff811dc337>] ? xlog_state_sync+0x157/0x2a0
       2 [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
       2 [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
       2 [<ffffffff810c31dc>] ? dput+0xac/0x160
       2 [<ffffffff810c29d3>] ? d_kill+0x53/0x70
       2 [<ffffffff810b9b38>] ? generic_permission+0x78/0x130
       2 [<ffffffff8109a9a5>] ? handle_mm_fault+0x1b5/0x780
       2 [<ffffffff810987fa>] ? __do_fault+0x3ca/0x4b0
       2 [<ffffffff8108cc30>] ? pdflush+0x0/0x220
       2 [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
       2 [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
       2 [<ffffffff8108b21a>] ? __writepage+0xa/0x40
       2 [<ffffffff8108b210>] ? __writepage+0x0/0x40
       2 [<ffffffff8108ab88>] ? __alloc_pages_nodemask+0x108/0x5f0
       2 [<ffffffff81084b6b>] ? find_get_page+0x1b/0xb0
       2 [<ffffffff8106e016>] ? down+0x46/0x50
       2 [<ffffffff8106d4e0>] ? sys_nanosleep+0x70/0x80
       2 [<ffffffff8106d3e2>] ? hrtimer_nanosleep+0xa2/0x130
       2 [<ffffffff8106d1ab>] ? __hrtimer_start_range_ns+0x12b/0x2a0
       2 [<ffffffff8106c960>] ? hrtimer_wakeup+0x0/0x30
       2 [<ffffffff81069bd8>] ? __wake_up_bit+0x28/0x30
       2 [<ffffffff81069886>] ? kthread+0xa6/0xb0
       2 [<ffffffff810697e0>] ? kthread+0x0/0xb0
       2 [<ffffffff8105efb0>] ? process_timeout+0x0/0x10
       2 [<ffffffff8105ee14>] ? try_to_del_timer_sync+0x54/0x60
       2 [<ffffffff8105eaa4>] ? lock_timer_base+0x34/0x70
       2 [<ffffffff8102d4ba>] ? child_rip+0xa/0x20
       2 [<ffffffff8102d4b0>] ? child_rip+0x0/0x20
       1 [<ffffffff81455b09>] ? _spin_lock_bh+0x9/0x20
       1 [<ffffffff81455857>] ? __down_read+0x17/0xae
       1 [<ffffffff814545d0>] ? __wait_on_bit+0x50/0x80
       1 [<ffffffff81454144>] ? io_schedule+0x34/0x50
       1 [<ffffffff81453741>] ? wait_for_common+0x151/0x180
       1 [<ffffffff81403c26>] ? tcp_write_xmit+0x206/0xa30
       1 [<ffffffff813f73b9>] ? tcp_sendmsg+0x859/0xb10
       1 [<ffffffff813b675f>] ? sk_reset_timer+0xf/0x20
       1 [<ffffffff813b6273>] ? release_sock+0x13/0xa0
       1 [<ffffffff813b270a>] ? sock_aio_write+0x13a/0x150
       1 [<ffffffff81272408>] ? tty_ldisc_try+0x48/0x60
       1 [<ffffffff8126c391>] ? tty_write+0x221/0x270
       1 [<ffffffff81221960>] ? swiotlb_map_page+0x0/0x100
       1 [<ffffffff81219361>] ? __up_read+0x21/0xc0
       1 [<ffffffff811fca29>] ? xfs_sync_worker+0x49/0x80
       1 [<ffffffff811fc993>] ? xfs_inode_ag_iterator+0x63/0xa0
       1 [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
       1 [<ffffffff811fc0ec>] ? xfssyncd+0x13c/0x1c0
       1 [<ffffffff811fbfb0>] ? xfssyncd+0x0/0x1c0
       1 [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
       1 [<ffffffff811f8645>] ? xfs_bdstrat_cb+0x45/0x50
       1 [<ffffffff811f8076>] ? xfs_vn_setattr+0x16/0x20
       1 [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
       1 [<ffffffff811f5423>] ? xfs_wait_on_pages+0x23/0x30
       1 [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
       1 [<ffffffff811f3f8b>] ? xfs_buf_rele+0x3b/0x100
       1 [<ffffffff811f3d65>] ? _xfs_buf_lookup_pages+0x265/0x340
       1 [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
       1 [<ffffffff811ef5e6>] ? xfs_setattr+0x826/0x880
       1 [<ffffffff811ee9c6>] ? xfs_fsync+0x56/0x190
       1 [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
       1 [<ffffffff811edb20>] ? xfs_lookup+0x90/0xe0
       1 [<ffffffff811ed96b>] ? xfs_create+0x40b/0x530
       1 [<ffffffff811eab8a>] ? xfs_trans_iget+0xda/0x100
       1 [<ffffffff811eaa48>] ? xfs_trans_ijoin+0x38/0xa0
       1 [<ffffffff811ea9d7>] ? xfs_trans_log_inode+0x27/0x60
       1 [<ffffffff811ea948>] ? xfs_trans_get_efd+0x28/0x40
       1 [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
       1 [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
       1 [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
       1 [<ffffffff811d50ff>] ? xfs_ialloc+0x52f/0x6f0
       1 [<ffffffff811d4c8e>] ? xfs_ialloc+0xbe/0x6f0
       1 [<ffffffff811d4c4e>] ? xfs_ialloc+0x7e/0x6f0
       1 [<ffffffff811d483a>] ? xfs_itruncate_finish+0x15a/0x320
       1 [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
       1 [<ffffffff811d17dd>] ? xfs_iget+0xfd/0x480
       1 [<ffffffff811d17cb>] ? xfs_iget+0xeb/0x480
       1 [<ffffffff811d0341>] ? xfs_dialloc+0x2e1/0xa70
       1 [<ffffffff811cee12>] ? xfs_ialloc_ag_select+0x222/0x320
       1 [<ffffffff811ceaaf>] ? xfs_ialloc_read_agi+0x1f/0x80
       1 [<ffffffff811ce9f1>] ? xfs_read_agi+0x71/0x110
       1 [<ffffffff811cbf90>] ? xfs_dir2_sf_addname+0x430/0x5c0
       1 [<ffffffff811c3a4f>] ? xfs_dir2_sf_to_block+0x9f/0x5c0
       1 [<ffffffff811c388a>] ? xfs_dir_createname+0x17a/0x1d0
       1 [<ffffffff811c2bda>] ? xfs_dir2_grow_inode+0x15a/0x3f0
       1 [<ffffffff811b4bf4>] ? xfs_bmap_finish+0x164/0x1b0
       1 [<ffffffff811a76fe>] ? xfs_free_extent+0x7e/0xc0
       1 [<ffffffff811a75a9>] ? xfs_alloc_fix_freelist+0x379/0x450
       1 [<ffffffff811a5450>] ? xfs_alloc_read_agf+0x30/0xd0
       1 [<ffffffff811a52f8>] ? xfs_read_agf+0x68/0x190
       1 [<ffffffff810e38cf>] ? sys_epoll_wait+0x22f/0x2e0
       1 [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
       1 [<ffffffff810d00f6>] ? writeback_inodes+0x46/0xe0
       1 [<ffffffff810cfe46>] ? generic_sync_sb_inodes+0x2e6/0x4b0
       1 [<ffffffff810cf6a9>] ? writeback_single_inode+0x1e9/0x460
       1 [<ffffffff810c7341>] ? notify_change+0x101/0x2f0
       1 [<ffffffff810c47da>] ? __d_lookup+0xaa/0x140
       1 [<ffffffff810c1ff0>] ? __pollwait+0x0/0x120
       1 [<ffffffff810c1f31>] ? sys_select+0x51/0x110
       1 [<ffffffff810c1b9f>] ? core_sys_select+0x1ff/0x310
       1 [<ffffffff810c182f>] ? do_select+0x4ff/0x670
       1 [<ffffffff810c0b1c>] ? poll_schedule_timeout+0x2c/0x50
       1 [<ffffffff810be5a0>] ? do_filp_open+0x6a0/0xac0
       1 [<ffffffff810bb851>] ? may_open+0x1c1/0x1f0
       1 [<ffffffff810b9e50>] ? get_write_access+0x20/0x60
       1 [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
       1 [<ffffffff810b2233>] ? sys_write+0x53/0xa0
       1 [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
       1 [<ffffffff810b060e>] ? do_truncate+0x5e/0x80
       1 [<ffffffff810af636>] ? sys_close+0xa6/0x100
       1 [<ffffffff810af556>] ? filp_close+0x56/0x90
       1 [<ffffffff810ace06>] ? cache_alloc_refill+0x96/0x590
       1 [<ffffffff8108d71a>] ? pagevec_lookup_tag+0x1a/0x30
       1 [<ffffffff8108cd40>] ? pdflush+0x110/0x220
       1 [<ffffffff8108beb6>] ? wb_kupdate+0xb6/0x140
       1 [<ffffffff8108be00>] ? wb_kupdate+0x0/0x140
       1 [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
       1 [<ffffffff810859d3>] ? wait_on_page_writeback_range+0xc3/0x140
       1 [<ffffffff81084fac>] ? wait_on_page_bit+0x6c/0x80
       1 [<ffffffff81084e83>] ? find_lock_page+0x23/0x80
       1 [<ffffffff81084d95>] ? sync_page+0x35/0x60
       1 [<ffffffff81084d60>] ? sync_page+0x0/0x60
       1 [<ffffffff8106ee8e>] ? sched_clock_cpu+0x6e/0x250
       1 [<ffffffff81069c50>] ? wake_bit_function+0x0/0x30
       1 [<ffffffff81069c29>] ? autoremove_wake_function+0x9/0x30
       1 [<ffffffff81064e09>] ? sys_setpriority+0x89/0x240
       1 [<ffffffff8105444e>] ? do_fork+0x16e/0x360
       1 [<ffffffff810512bf>] ? try_to_wake_up+0xaf/0x1d0
       1 [<ffffffff8104ad17>] ? task_rq_lock+0x47/0x90
       1 [<ffffffff8104a99b>] ? __wake_up_common+0x5b/0x90
       1 [<ffffffff81049bcf>] ? sched_slice+0x5f/0x90
       1 [<ffffffff81034200>] ? sys_vfork+0x20/0x30
       1 [<ffffffff8102c853>] ? stub_vfork+0x13/0x20

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-18 20:17 ` Justin Piszcz
@ 2009-10-19  3:04   ` Dave Chinner
  2009-10-19 10:18     ` Justin Piszcz
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2009-10-19  3:04 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush       D 0000000000000001     0   362      2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095]  [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd      D 0000000000000000     0   831      2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216]  [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220]  [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222]  [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225]  [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228]  [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233]  [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236]  [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:

All pointing to log IO not completing.

That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.

Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-19  3:04   ` Dave Chinner
@ 2009-10-19 10:18     ` Justin Piszcz
  2009-10-20  0:33       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-10-19 10:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Mon, 19 Oct 2009, Dave Chinner wrote:

> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>> It has happened again, all sysrq-X output was saved this time.
>>
>> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
>> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
> .....
>
> All pointing to log IO not completing.
>
> That is, all of the D state processes are backed up on locks or
> waiting for IO completion processing. A lot of the processes are
> waiting for _xfs_log_force to complete, others are waiting for
> inodes to be unpinned or are backed up behind locked inodes that are
> waiting on log IO to complete before they can complete the
> transaction and unlock the inode, and so on.
>
> Unfortunately, the xfslogd and xfsdatad kernel threads are not
> present in any of the output given, so I can't tell if these have
> deadlocked themselves and caused the problem. However, my experience
> with such pile-ups is that an I/O completion has not been run for
> some reason and that is the cause of the problem. I don't know if
> you can provide enough information to tell us if this happened or
> not. Instead, do you have a test case that you can share?
>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Hello,

So far I do not have a reproducible test case, the only other thing not posted
was the output of ps auxww during the time of the lockup, not sure if it will
help, but here it is:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2] 
root         2  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/1]
root         6  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/1]
root         7  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/2]
root         8  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/2]
root         9  0.0  0.0      0     0 ?        S<   Oct16   0:00 [migration/3]
root        10  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksoftirqd/3]
root        11  0.0  0.0      0     0 ?        R<   Oct16   0:00 [events/0]
root        12  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/1]
root        13  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/2]
root        14  0.0  0.0      0     0 ?        S<   Oct16   0:00 [events/3]
root        15  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khelper]
root        20  0.0  0.0      0     0 ?        S<   Oct16   0:00 [async/mgr]
root       180  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/0]
root       181  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/1]
root       182  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/2]
root       183  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kblockd/3]
root       185  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpid]
root       186  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpi_notify]
root       187  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kacpi_hotplug]
root       271  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/0]
root       272  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/1]
root       273  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/2]
root       274  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata/3]
root       275  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ata_aux]
root       276  0.0  0.0      0     0 ?        S<   Oct16   0:00 [ksuspend_usbd]
root       280  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khubd]
root       283  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kseriod]
root       318  0.0  0.0      0     0 ?        S<   Oct16   0:00 [khpsbpkt]
root       361  0.0  0.0      0     0 ?        S    Oct16   0:00 [pdflush]
root       362  0.0  0.0      0     0 ?        D    Oct16   0:43 [pdflush]
root       363  0.0  0.0      0     0 ?        S<   Oct16   0:21 [kswapd0]
root       364  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/0]
root       365  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/1]
root       366  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/2]
root       367  0.0  0.0      0     0 ?        S<   Oct16   0:00 [aio/3]
root       368  0.0  0.0      0     0 ?        S<   Oct16   0:00 [nfsiod]
root       369  0.0  0.0      0     0 ?        S<   Oct16   0:00 [cifsoplockd]
root       370  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfs_mru_cache]
root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
root       518  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_0]
root       521  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_1]
root       524  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_2]
root       527  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_3]
root       530  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_4]
root       533  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_5]
root       542  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_6]
root       545  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_7]
root       551  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_8]
root       554  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_9]
root       558  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_10]
root       562  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_11]
root       568  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_12]
root       571  0.0  0.0      0     0 ?        S<   Oct16   0:00 [scsi_eh_13]
root       584  0.0  0.0      0     0 ?        S<   Oct16   0:00 [knodemgrd_0]
root       616  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kpsmoused]
root       666  0.0  0.0      0     0 ?        S<   Oct16   0:00 [usbhid_resumer]
root       683  0.0  0.0      0     0 ?        S<   Oct16   0:00 [hd-audio0]
root       703  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/0]
root       704  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/1]
root       705  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/2]
root       706  0.0  0.0      0     0 ?        S<   Oct16   0:00 [rpciod/3]
root       811  4.0  0.0      0     0 ?        S<   Oct16  81:54 [md3_raid5]
root       817  0.3  0.0      0     0 ?        S<   Oct16   6:30 [md2_raid1]
root       823  0.0  0.0      0     0 ?        S<   Oct16   0:00 [md1_raid1]
root       827  0.0  0.0      0     0 ?        S<   Oct16   0:50 [md0_raid1]
root       829  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsbufd]
root       830  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsaild]
root       831  0.0  0.0      0     0 ?        D<   Oct16   0:00 [xfssyncd]
root       884  0.0  0.0  16736   740 ?        S<s  Oct16   0:00 udevd --daemon
postfix   1649  0.0  0.0  39124  2468 ?        S    05:00   0:00 qmgr -l -t fifo -u -c
www-data  1877  0.0  0.0 146612  5248 ?        S    05:01   0:00 /usr/sbin/apache2 -k start
root      3182  0.0  0.0      0     0 ?        S<   Oct16   0:00 [kjournald]
root      3183  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsbufd]
root      3184  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsaild]
root      3185  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfssyncd]
root      3337  0.0  0.0   8072   132 ?        Ss   Oct16   0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth0.conf -d
root      3339  0.0  0.0   8072   424 ?        Ds   Oct16   0:00 /app/ulogd-1.24-x86_64/sbin/ulogd -c /app/ulogd-1.24-x86_64/etc/ulogd-eth1.conf -d
daemon    3526  0.0  0.0   8016   564 ?        Ss   Oct16   0:00 /sbin/portmap
statd     3538  0.0  0.0  10148   776 ?        Ss   Oct16   0:00 /sbin/rpc.statd
root      3547  0.0  0.0  26956   572 ?        Ss   Oct16   0:00 /usr/sbin/rpc.idmapd
root      3732  0.0  0.0   5900   676 ?        Ds   Oct16   0:00 /sbin/syslogd
root      3741  0.0  0.0   3796   452 ?        Ss   Oct16   0:00 /sbin/klogd -x
root      3750  0.0  0.0   3804   640 ?        Ss   Oct16   0:00 /usr/sbin/acpid
110       3760  0.0  0.0  23560  1476 ?        Ss   Oct16   0:02 /usr/bin/dbus-daemon --system
bind      3773  0.0  0.8 264364 70004 ?        Ssl  Oct16   0:05 /usr/sbin/named -u bind -S 1024
root      3796  0.0  0.0  49028  1260 ?        Ss   Oct16   0:00 /usr/sbin/sshd
root      3827  0.0  0.0 104804  7452 ?        Ssl  Oct16   0:08 /usr/sbin/console-kit-daemon
amavis    3909  0.0  1.1 217948 91004 ?        Ss   Oct16   0:00 amavisd (master)
nobody    3913  0.0  0.2 118612 21632 ?        Sl   Oct16   0:02 /app/gross-1.0.1-x86_64/sbin/grossd -f /etc/grossd/grossd.conf -p /var/run/grossd/grossd.pid
polw      3932  0.0  0.1  51868 12084 ?        Ss   Oct16   0:00 policyd-weight (master)
polw      3933  0.0  0.1  51868 11744 ?        Ss   Oct16   0:00 policyd-weight (cache)
postfw    3937  0.0  0.1  55248 14064 ?        Ss   Oct16   0:00 /usr/sbin/postfwd --summary=0 --cache=600 --cache-rdomain-only --cache-no-size --daemon --file=/etc/postfwd/postfwd.cf --interface=127.0.0.1 --port=10040 --user=postfw --group=postfw --pidfile=/var/run/postfwd.pid
postgrey  3940  0.0  0.1  57900 13232 ?        Ss   Oct16   0:00 /usr/sbin/postgrey --pidfile=/var/run/postgrey.pid --daemonize --inet=127.0.0.1:60000 --greylist-action=421 
clamav    4166  0.0  1.3 215576 112772 ?       Ssl  Oct16   0:04 /usr/sbin/clamd
clamav    4262  0.0  0.0  25776  1328 ?        Ss   Oct16   0:00 /usr/bin/freshclam -d --quiet
amavis    4307  0.0  1.1 224072 96732 ?        S    Oct16   0:02 amavisd (ch14-avail)
amavis    4308  0.0  1.2 228168 100820 ?       S    Oct16   0:05 amavisd (ch14-avail)
root      4353  0.0  0.0   7224  3944 ?        S    Oct16   0:02 /usr/sbin/hddtemp -d -l 127.0.0.1 -p 7634 -s | /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
root      4362  0.0  0.0   8864   520 ?        Ss   Oct16   0:00 /usr/sbin/irqbalance
daemon    4377  0.0  0.0  38488   840 ?        Ss   Oct16   0:00 lpd Waiting 
root      4412  0.0  0.0      0     0 ?        S<   Oct16   0:00 [lockd]
root      4413  0.0  0.0      0     0 ?        S<   Oct16   0:10 [nfsd]
root      4414  0.0  0.0      0     0 ?        D<   Oct16   0:03 [nfsd]
root      4415  0.0  0.0      0     0 ?        D<   Oct16   0:13 [nfsd]
root      4416  0.0  0.0      0     0 ?        S<   Oct16   0:07 [nfsd]
root      4417  0.0  0.0      0     0 ?        D<   Oct16   0:11 [nfsd]
root      4418  0.0  0.0      0     0 ?        D<   Oct16   0:01 [nfsd]
root      4419  0.0  0.0      0     0 ?        D<   Oct16   0:04 [nfsd]
root      4420  0.0  0.0      0     0 ?        D<   Oct16   0:02 [nfsd]
root      4424  0.0  0.0  18812  1208 ?        Ss   Oct16   0:00 /usr/sbin/rpc.mountd --manage-gids
oident    4432  0.0  0.0  12232   576 ?        Ss   Oct16   0:00 /usr/sbin/oidentd -m -u oident -g oident
root      4444  0.0  0.0  10124   660 ?        Ss   Oct16   0:00 /usr/sbin/inetd
root      4521  0.0  0.0  37020  2388 ?        Ss   Oct16   0:00 /usr/lib/postfix/master
root      4545  0.0  0.0  58084  1796 ?        Ds   Oct16   0:00 /usr/sbin/nmbd -D
root      4547  0.0  0.0  93724  3044 ?        Ss   Oct16   0:00 /usr/sbin/smbd -D
root      4568  0.0  0.0  93752  1776 ?        S    Oct16   0:00 /usr/sbin/smbd -D
root      4700  0.0  0.0  18768  1348 ?        S    Oct16   0:00 /usr/sbin/smartd --pidfile /var/run/smartd.pid
asterisk  4714  1.3  0.2 666972 24460 ?        Ssl  Oct16  26:36 /usr/sbin/asterisk -p -U asterisk
asterisk  4715  0.0  0.0  13656   880 ?        S    Oct16   0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
ntp       4734  0.0  0.0  23428  1400 ?        Ss   Oct16   0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 105:106 -g
root      4789  0.0  0.0  16644   780 ?        Ss   Oct16   0:00 /usr/sbin/dovecot
root      4790  0.0  0.0  74664  3196 ?        S    Oct16   0:00 dovecot-auth
root      4792  0.0  0.0  74664  3240 ?        S    Oct16   0:00 dovecot-auth -w
111       4797  0.0  0.0  38084  4616 ?        Ss   Oct16   0:07 /usr/sbin/hald
root      4798  0.0  0.0  20040  1432 ?        S    Oct16   0:01 hald-runner
root      4895  0.0  0.0  22028  1240 ?        S    Oct16   0:00 hald-addon-input: Listening on /dev/input/event1 /dev/input/event0 /dev/input/event3
root      4899  0.0  0.0  22032  1232 ?        S    Oct16   0:00 hald-addon-storage: no polling on /dev/sr0 because it is explicitly disabled
root      4900  0.0  0.0  22028  1228 ?        S    Oct16   0:00 hald-addon-storage: no polling on /dev/fd0 because it is explicitly disabled
111       4902  0.0  0.0  25920  1204 ?        S    Oct16   0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
root      4911  0.0  0.0  12484   656 ?        Ss   Oct16   0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root      4929  0.0  0.0  24020   584 ?        Ss   Oct16   0:00 /usr/sbin/squid -D -YC
proxy     4931  0.0  0.0  26900  5136 ?        S    Oct16   0:00 (squid) -D -YC
root      4942  0.0  0.0  28924  3548 ?        Ss   Oct16   0:00 /usr/bin/perl -T /usr/lib/postfix/p0f-analyzer.pl 2345
root      4943  0.0  0.0  17392  1380 ?        S    Oct16   0:00 sh -c p0f -u daemon -i eth1 -l 'tcp dst port 25' 2>&1
daemon    4945  0.3  0.0  16596  3172 ?        S    Oct16   7:01 p0f -u daemon -i eth1 -l tcp dst port 25
root      4947  0.0  0.0  93604  5832 ?        SNs  Oct16   0:10 /usr/bin/perl -w /app/mailgraph-1.14/bin/mailgraph.pl -l /var/log/mail.log -d --ignore-localhost --rbl-is-spam --daemon-pid=/var/run/mailgraph.pid --daemon-rrd=/var/lib/mailgraph
root      4951  0.0  0.0   9296  3716 ?        Ss   Oct16   0:00 /usr/sbin/dhcpd3 -q eth0
nut       4964  0.0  0.0  14384   744 ?        Ss   Oct16   0:15 /lib/nut/usbhid-ups -a apc
nut       4966  0.0  0.0  14316   600 ?        Ss   Oct16   0:01 /sbin/upsd
root      4968  0.0  0.0  14284   728 ?        Ss   Oct16   0:00 /sbin/upsmon
nut       4970  0.0  0.0  14284   712 ?        S    Oct16   0:01 /sbin/upsmon
daemon    5007  0.0  0.0  16356   416 ?        Ss   Oct16   0:00 /usr/sbin/atd
root      5027  0.0  0.0  20988  1060 ?        Ss   Oct16   0:00 /usr/sbin/cron
dovecot   5887  0.0  0.0  18504  2108 ?        S    15:31   0:00 imap-login
root      6184  0.0  0.0  73496  6380 ?        Sl   Oct16   0:00 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock
root      6205  0.0  0.0  26292   720 ?        Ss   Oct16   0:00 /usr/bin/kdm -config /var/run/kdm/kdmrc
root      6213  0.1  1.3 192020 112852 tty7    Ss+  Oct16   2:34 /usr/bin/X -br -nolisten tcp :0 vt7 -auth /var/run/xauth/A:0-y1Rrr6
root      6284  0.0  0.0   5852   600 tty1     Ss+  Oct16   0:00 /sbin/getty 38400 tty1
root      6285  0.0  0.0   5852   596 tty2     Ss+  Oct16   0:00 /sbin/getty 38400 tty2
root      6286  0.0  0.0   5852   596 tty3     Ss+  Oct16   0:00 /sbin/getty 38400 tty3
root      6287  0.0  0.0   5852   600 tty4     Ss+  Oct16   0:00 /sbin/getty 38400 tty4
root      6288  0.0  0.0   5852   596 tty5     Ss+  Oct16   0:00 /sbin/getty 38400 tty5
root      6289  0.0  0.0   5852   600 tty6     Ss+  Oct16   0:00 /sbin/getty 38400 tty6
root      6297  0.0  0.0  55716  1864 ?        S    Oct16   0:00 -:0 
postfix   6362  0.0  0.0  39076  2308 ?        S    15:36   0:00 anvil -l -t unix -u -o max_idle=3600s
root      7003  0.0  0.0  80904  3412 ?        Ss   Oct16   0:00 sshd: ap [priv] 
polw      7443  0.0  0.1  52000 12516 ?        S    Oct16   0:00 policyd-weight (child)
dovecot   8922  0.0  0.0  18504  2108 ?        S    16:01   0:00 imap-login
dovecot   8923  0.0  0.0  18504  2112 ?        S    16:01   0:00 imap-login
postfix  18407  0.0  0.0  39076  2352 ?        D    17:50   0:00 pickup -l -t fifo -u -c -o receive_override_options=no_header_body_checks
asterisk 18424  0.0  0.0  36996  2260 ?        D    17:54   0:00 /usr/sbin/postdrop -r
root     18425  0.0  0.0  46372  1440 ?        S    17:55   0:00 /USR/SBIN/CRON
root     18459  0.0  0.0  11404  1328 ?        Ss   17:55   0:00 /bin/sh -c if [ -x /usr/bin/mrtg ] && [ -r /etc/mrtg.cfg ]; then mkdir -p /var/log/mrtg ; env LANG=C /usr/bin/mrtg /etc/mrtg.cfg 2>&1 | tee -a /var/log/mrtg/mrtg.log ; fi
root     18460  0.0  0.0  21888  1432 ?        D    17:55   0:00 /usr/lib/hal/hal-acl-tool --reconfigure
root     18465  0.0  0.2  43416 16892 ?        S    17:55   0:00 /usr/bin/perl -w /usr/bin/mrtg /etc/mrtg.cfg
root     18479  0.0  0.0   4140   612 ?        S    17:55   0:00 tee -a /var/log/mrtg/mrtg.log
root     18538  0.0  0.0  38476  1976 ?        D    17:55   0:00 /usr/bin/rateup /var/www/monitor/mrtg/ eth0 1255816502 -Z u 21723041186 43737397048 125000000 c #00cc00 #0000ff #006600 #ff00ff k 1000 i /var/www/monitor/mrtg/eth0-day.png -125000000 -125000000 400 100 1 1 1 300 0 4 1 %Y-%m-%d %H:%M 0 i /var/www/monitor/mrtg/eth0-week.png -125000000 -125000000 400 100 1 1 1 1800 0 4 1 %Y-%m-%d %H:%M 0
postfix  18539  0.0  0.1  50740  9836 ?        S    17:55   0:00 cleanup -z -t unix -u -c
root     18555  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18556  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18557  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18558  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18559  0.0  0.0  46368  1172 ?        S    18:00   0:00 /USR/SBIN/CRON
root     18560  0.0  0.0  46368  1172 ?        S    18:02   0:00 /USR/SBIN/CRON
root     18561  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18562  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18563  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18564  0.0  0.0  46368  1172 ?        S    18:05   0:00 /USR/SBIN/CRON
root     18565  0.0  0.0  46368  1172 ?        S    18:09   0:00 /USR/SBIN/CRON
root     18566  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18567  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18568  0.0  0.0  46368  1172 ?        S    18:10   0:00 /USR/SBIN/CRON
root     18578  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18579  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18580  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18581  0.0  0.0  46368  1172 ?        S    18:15   0:00 /USR/SBIN/CRON
root     18582  0.0  0.0  46368  1172 ?        S    18:17   0:00 /USR/SBIN/CRON
root     18583  0.0  0.0  46368  1172 ?        S    18:18   0:00 /USR/SBIN/CRON
postfix  18584  0.0  0.0  71452  8104 ?        S    18:18   0:00 smtpd -n 75.144.35.65:smtp -t inet -u -c -o stress= -o stress=yes -o content_filter=amavisfeed:[127.0.0.1]:10024 -o receive_override_options=no_address_mappings
postfix  18585  0.0  0.0  39076  2300 ?        S    18:18   0:00 proxymap -t unix -u
sshd     18597  0.0  0.0      0     0 ?        Z    18:19   0:00 [sshd] <defunct>
root     18601  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     18602  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     18603  0.0  0.0  46368  1172 ?        S    18:20   0:00 /USR/SBIN/CRON
root     30866  0.0  0.0  54392   944 ?        Ss   04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30867  0.0  0.0  54392   652 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30868  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30869  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     30870  0.0  0.0  54392   540 ?        S    04:31   0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 5
root     31018  0.0  0.0 146612  7856 ?        Ss   04:31   0:00 /usr/sbin/apache2 -k start
www-data 31025  0.0  0.0 146612  5248 ?        S    04:31   0:00 /usr/sbin/apache2 -k start
www-data 31044  0.0  0.0 146612  5248 ?        S    04:31   0:00 /usr/sbin/apache2 -k start
root     31076  0.0  0.0  23128   756 ?        Ss   04:31   0:00 pure-ftpd (SERVER)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-19 10:18     ` Justin Piszcz
@ 2009-10-20  0:33       ` Dave Chinner
  2009-10-20  8:33         ` Justin Piszcz
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2009-10-20  0:33 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
> On Mon, 19 Oct 2009, Dave Chinner wrote:
>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>> It has happened again, all sysrq-X output was saved this time.
>> .....
>>
>> All pointing to log IO not completing.
>>
....
> So far I do not have a reproducible test case,

Ok. What sort of load is being placed on the machine?

> the only other thing not posted was the output of ps auxww during
> the time of the lockup, not sure if it will help, but here it is:
>
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2] 
....
> root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
> root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
> root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
> root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
> root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
> root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
> root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
> root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
> root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
> root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
> root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
> root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
.....

It appears that both the xfslogd and the xfsdatad on CPU 0 are in
the running state but don't appear to be consuming any significant
CPU time. If they remain like this then I think that means they are
stuck waiting on the run queue.  Do these XFS threads always appear
like this when the hang occurs? If so, is there something else that
is hogging CPU 0 preventing these threads from getting the CPU?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-20  0:33       ` Dave Chinner
@ 2009-10-20  8:33         ` Justin Piszcz
  2009-10-21 10:19           ` Justin Piszcz
  0 siblings, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-10-20  8:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Tue, 20 Oct 2009, Dave Chinner wrote:

> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>> It has happened again, all sysrq-X output was saved this time.
>>> .....
>>>
>>> All pointing to log IO not completing.
>>>
> ....
>> So far I do not have a reproducible test case,
>
> Ok. What sort of load is being placed on the machine?
Hello, generally the load is low, it mainly serves out some samba shares.

>
>> the only other thing not posted was the output of ps auxww during
>> the time of the lockup, not sure if it will help, but here it is:
>>
>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>> root         1  0.0  0.0  10320   684 ?        Ss   Oct16   0:00 init [2]
> ....
>> root       371  0.0  0.0      0     0 ?        R<   Oct16   0:01 [xfslogd/0]
>> root       372  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/1]
>> root       373  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/2]
>> root       374  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfslogd/3]
>> root       375  0.0  0.0      0     0 ?        R<   Oct16   0:00 [xfsdatad/0]
>> root       376  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsdatad/1]
>> root       377  0.0  0.0      0     0 ?        S<   Oct16   0:03 [xfsdatad/2]
>> root       378  0.0  0.0      0     0 ?        S<   Oct16   0:01 [xfsdatad/3]
>> root       379  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/0]
>> root       380  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/1]
>> root       381  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/2]
>> root       382  0.0  0.0      0     0 ?        S<   Oct16   0:00 [xfsconvertd/3]
> .....
>
> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
> the running state but don't appear to be consuming any significant
> CPU time. If they remain like this then I think that means they are
> stuck waiting on the run queue.  Do these XFS threads always appear
> like this when the hang occurs? If so, is there something else that
> is hogging CPU 0 preventing these threads from getting the CPU?
Yes, the XFS threads show up like this on each time the kernel crashed.  So far
with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some issue
between 2.6.30.9 and 2.6.31.x when this began happening.  Any recommendations
on how to catch this bug w/certain options enabled/etc?


>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-20  8:33         ` Justin Piszcz
@ 2009-10-21 10:19           ` Justin Piszcz
  2009-10-22 22:49             ` Justin Piszcz
  2009-11-20 20:39             ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz
  0 siblings, 2 replies; 19+ messages in thread
From: Justin Piszcz @ 2009-10-21 10:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Tue, 20 Oct 2009, Justin Piszcz wrote:


>
>
> On Tue, 20 Oct 2009, Dave Chinner wrote:
>
>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>> It has happened again, all sysrq-X output was saved this time.
>>>> .....
>>>> 
>>>> All pointing to log IO not completing.
>>>> 
>> ....
>>> So far I do not have a reproducible test case,
>> 
>> Ok. What sort of load is being placed on the machine?
> Hello, generally the load is low, it mainly serves out some samba shares.
>
>> 
>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>> the running state but don't appear to be consuming any significant
>> CPU time. If they remain like this then I think that means they are
>> stuck waiting on the run queue.  Do these XFS threads always appear
>> like this when the hang occurs? If so, is there something else that
>> is hogging CPU 0 preventing these threads from getting the CPU?
> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
> far
> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
> issue
> between 2.6.30.9 and 2.6.31.x when this began happening.  Any recommendations
> on how to catch this bug w/certain options enabled/etc?
>
>
>> 
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> 
>

Uptime with 2.6.30.9:

  06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07

No issues yet, so it first started happening in 2.6.(31).(x).

Any further recommendations on how to debug this issue?  BTW: Do you view this
as an XFS bug or MD/VFS layer issue based on the logs/output thus far?

Justin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-21 10:19           ` Justin Piszcz
@ 2009-10-22 22:49             ` Justin Piszcz
  2009-10-22 23:00               ` Dave Chinner
  2009-10-26 11:24               ` Justin Piszcz
  2009-11-20 20:39             ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz
  1 sibling, 2 replies; 19+ messages in thread
From: Justin Piszcz @ 2009-10-22 22:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>> 
>> 
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>> 
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>> 
>>>>> All pointing to log IO not completing.
>>>>> 
>>> ....
>>>> So far I do not have a reproducible test case,
>>> 
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>> 
>>> 
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any 
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>>> 
>> 
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue?  BTW: Do you view 
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Any other ideas?

Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has 
no load at all either.. Has anyone else reported similar problems?

Justin.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-22 22:49             ` Justin Piszcz
@ 2009-10-22 23:00               ` Dave Chinner
  2009-10-26 11:24               ` Justin Piszcz
  1 sibling, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2009-10-22 23:00 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

On Thu, Oct 22, 2009 at 06:49:46PM -0400, Justin Piszcz wrote:
> On Wed, 21 Oct 2009, Justin Piszcz wrote:
>> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>>> the running state but don't appear to be consuming any significant
>>>> CPU time. If they remain like this then I think that means they are
>>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>>> like this when the hang occurs? If so, is there something else that
>>>> is hogging CPU 0 preventing these threads from getting the CPU?
>>> Yes, the XFS threads show up like this on each time the kernel 
>>> crashed.  So far
>>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be 
>>> some issue
>>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any  
>>> recommendations
>>> on how to catch this bug w/certain options enabled/etc?
>>
>> Uptime with 2.6.30.9:
>>
>> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>>
>> No issues yet, so it first started happening in 2.6.(31).(x).

Ok.

>> Any further recommendations on how to debug this issue?  BTW: Do
>> you view this as an XFS bug or MD/VFS layer issue based on the
>> logs/output thus far?

Could be either. Nothing so far points at a cause.

> Any other ideas?

If it is relatively quick to reproduce, you could run a git bisect
to try to find the offending commit. Or when it has locked up, run
oprofile with callgraph sampling and so we can get an idea of what
is actually running when XFS appears to hang.

> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has  
> no load at all either.. Has anyone else reported similar problems?

Not that I know of.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-22 22:49             ` Justin Piszcz
  2009-10-22 23:00               ` Dave Chinner
@ 2009-10-26 11:24               ` Justin Piszcz
  2009-11-02 21:46                 ` Justin Piszcz
  1 sibling, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-10-26 11:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Thu, 22 Oct 2009, Justin Piszcz wrote:

>
> Any other ideas?
>
> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no 
> load at all either.. Has anyone else reported similar problems?
>
> Justin.
>

--

Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 
later today and wait another 48 hours.

Justin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available)
  2009-10-26 11:24               ` Justin Piszcz
@ 2009-11-02 21:46                 ` Justin Piszcz
  0 siblings, 0 replies; 19+ messages in thread
From: Justin Piszcz @ 2009-11-02 21:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz



On Mon, 26 Oct 2009, Justin Piszcz wrote:

>
>
> On Thu, 22 Oct 2009, Justin Piszcz wrote:
>
>> 
>> Any other ideas?
>> 
>> Currently stuck on 2.6.30.9.. (no issues, no lockups)-- Box normally has no 
>> load at all either.. Has anyone else reported similar problems?
>> 
>> Justin.
>> 
>
> --
>
> Currently running 2.6.31-rc1 for 2 days now, no crashes, will go to -rc2 
> later today and wait another 48 hours.
>
> Justin.
>
>

Kernel report:

2.6.31-rc1: no crash - uptime 2+ days
2.6.31-rc2: no crash - uptime 2+ days
2.6.31-rc3: no crash - uptime 2+ days
2.6.31-rc4: no crash but network kept dropping out
2.6.31-rc5: cannot test, service owner needs host available
2.6.31-rc6: cannot test, service owner needs host available
2.6.31-rc7: cannot test, service owner needs host available
2.6.31-rc8: cannot test, service owner needs host available
2.6.31-rc9: cannot test, service owner needs host available
   2.6.31.x: locks up D-state

It would be somewhere between -rc4 and 2.6.31.x.

Justin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-10-21 10:19           ` Justin Piszcz
  2009-10-22 22:49             ` Justin Piszcz
@ 2009-11-20 20:39             ` Justin Piszcz
  2009-11-20 23:44               ` Bug#557262: " Faidon Liambotis
  1 sibling, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-11-20 20:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-raid, xfs, Alan Piszcz, asterisk-users, submit

Package: asterisk
Version: 1.6.2.0~dfsg~rc1-1

See below for issue:

On Wed, 21 Oct 2009, Justin Piszcz wrote:

>
>
> On Tue, 20 Oct 2009, Justin Piszcz wrote:
>
>
>> 
>> 
>> On Tue, 20 Oct 2009, Dave Chinner wrote:
>> 
>>> On Mon, Oct 19, 2009 at 06:18:58AM -0400, Justin Piszcz wrote:
>>>> On Mon, 19 Oct 2009, Dave Chinner wrote:
>>>>> On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
>>>>>> It has happened again, all sysrq-X output was saved this time.
>>>>> .....
>>>>> 
>>>>> All pointing to log IO not completing.
>>>>> 
>>> ....
>>>> So far I do not have a reproducible test case,
>>> 
>>> Ok. What sort of load is being placed on the machine?
>> Hello, generally the load is low, it mainly serves out some samba shares.
>> 
>>> 
>>> It appears that both the xfslogd and the xfsdatad on CPU 0 are in
>>> the running state but don't appear to be consuming any significant
>>> CPU time. If they remain like this then I think that means they are
>>> stuck waiting on the run queue.  Do these XFS threads always appear
>>> like this when the hang occurs? If so, is there something else that
>>> is hogging CPU 0 preventing these threads from getting the CPU?
>> Yes, the XFS threads show up like this on each time the kernel crashed.  So 
>> far
>> with 2.6.30.9 after ~48hrs+ it has not crashed.  So it appears to be some 
>> issue
>> between 2.6.30.9 and 2.6.31.x when this began happening.  Any 
>> recommendations
>> on how to catch this bug w/certain options enabled/etc?
>> 
>> 
>>> 
>>> Cheers,
>>> 
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com
>>> 
>> 
>
> Uptime with 2.6.30.9:
>
> 06:18:41 up 2 days, 14:10, 14 users,  load average: 0.41, 0.21, 0.07
>
> No issues yet, so it first started happening in 2.6.(31).(x).
>
> Any further recommendations on how to debug this issue?  BTW: Do you view 
> this
> as an XFS bug or MD/VFS layer issue based on the logs/output thus far?
>
> Justin.
>
>

Found root cause-- root cause is asterisk PBX software.  I use an SPA3102.
When someone called me, they accidentally dropped the connection, I called
them back in a short period.  It is during this time (and the last time)
this happened that the box froze under multiple(!) kernels, always when
someone was calling.

I have removed asterisk but this is the version I was running:
~$ dpkg -l | grep -i asterisk
rc  asterisk                             1:1.6.2.0~dfsg~rc1-1             Open S

I don't know what asterisk is doing but top did run before the crash
and asterisk was using 100% CPU and as I noted before all other processes
were in D-state.

When this bug occurs, it freezes I/O to all devices and the only way to recover
is to reboot the system.

Just FYI if anyone else out there has their system crash when running asterisk.

Just out of curiosity, has anyone else running asterisk had such an issue? 
I was not running any special VoIP PCI cards/etc.

Justin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-11-20 20:39             ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz
@ 2009-11-20 23:44               ` Faidon Liambotis
  2009-11-20 23:51                 ` Justin Piszcz
  2009-11-21 14:29                 ` Roger Heflin
  0 siblings, 2 replies; 19+ messages in thread
From: Faidon Liambotis @ 2009-11-20 23:44 UTC (permalink / raw)
  To: Justin Piszcz, 557262
  Cc: Dave Chinner, submit, linux-kernel, xfs, linux-raid,
	asterisk-users, Alan Piszcz

Justin Piszcz wrote:
 > Found root cause-- root cause is asterisk PBX software.  I use an
SPA3102.
> When someone called me, they accidentally dropped the connection, I called
> them back in a short period.  It is during this time (and the last time)
> this happened that the box froze under multiple(!) kernels, always when
> someone was calling.
<snip>
> I don't know what asterisk is doing but top did run before the crash
> and asterisk was using 100% CPU and as I noted before all other processes
> were in D-state.
> 
> When this bug occurs, it freezes I/O to all devices and the only way to
> recover
> is to reboot the system.
That's obviously *not* the root cause.

It's not normal for an application that isn't even privileged to hang
all I/O and, subsequently everything on a system.

This is almost probably a kernel issue and asterisk just does something
that triggers this bug.

Regards,
Faidon

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-11-20 23:44               ` Bug#557262: " Faidon Liambotis
@ 2009-11-20 23:51                 ` Justin Piszcz
  2009-11-21 14:29                 ` Roger Heflin
  1 sibling, 0 replies; 19+ messages in thread
From: Justin Piszcz @ 2009-11-20 23:51 UTC (permalink / raw)
  To: Faidon Liambotis
  Cc: 557262, Dave Chinner, submit, linux-kernel, xfs, linux-raid,
	asterisk-users, Alan Piszcz



On Sat, 21 Nov 2009, Faidon Liambotis wrote:

> Justin Piszcz wrote:
> > Found root cause-- root cause is asterisk PBX software.  I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period.  It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
>
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
>
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
>
> Regards,
> Faidon
>

It is possible although I tried with several kernels (2.6.30.[0-9] & 
2.6.31+ (never had a crash with earlier versions, I installed asterisk long
ago) but it always used to be 1.4.x until recently..  Nasty bug :\

Justin.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bug#557262: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk
  2009-11-20 23:44               ` Bug#557262: " Faidon Liambotis
  2009-11-20 23:51                 ` Justin Piszcz
@ 2009-11-21 14:29                 ` Roger Heflin
  1 sibling, 0 replies; 19+ messages in thread
From: Roger Heflin @ 2009-11-21 14:29 UTC (permalink / raw)
  To: Faidon Liambotis
  Cc: Justin Piszcz, 557262, Dave Chinner, submit, linux-kernel, xfs,
	linux-raid, asterisk-users, Alan Piszcz

Faidon Liambotis wrote:
> Justin Piszcz wrote:
>  > Found root cause-- root cause is asterisk PBX software.  I use an
> SPA3102.
>> When someone called me, they accidentally dropped the connection, I called
>> them back in a short period.  It is during this time (and the last time)
>> this happened that the box froze under multiple(!) kernels, always when
>> someone was calling.
> <snip>
>> I don't know what asterisk is doing but top did run before the crash
>> and asterisk was using 100% CPU and as I noted before all other processes
>> were in D-state.
>>
>> When this bug occurs, it freezes I/O to all devices and the only way to
>> recover
>> is to reboot the system.
> That's obviously *not* the root cause.
> 
> It's not normal for an application that isn't even privileged to hang
> all I/O and, subsequently everything on a system.
> 
> This is almost probably a kernel issue and asterisk just does something
> that triggers this bug.
> 
> Regards,
> Faidon


I had an application in 2.6.5 (SLES9)...that would hang XFS.

The underlying application was multi-threaded and both threads were 
doing full disks syncs every so often, and sometimes when doing the 
full disk sync the XFS subsystem would deadlock, it appeared to me tha 
one sync had a lock and was waiting for another, and the other process 
had the second lock and was waiting for the first...   We were able to 
disable the full disk sync from the application and the deadlock went 
away.   All non-xfs filesytems still worked and could still be 
accessed.    I did report the bug with some traces but I don't believe 
anyone ever determined where the underlying issues was.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Which kernel options should be enabled to find the root cause of this bug?
  2009-10-17 22:34 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
  2009-10-18 20:17 ` Justin Piszcz
@ 2009-11-24 13:08 ` Justin Piszcz
  2009-11-24 15:14   ` Eric Sandeen
  1 sibling, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-11-24 13:08 UTC (permalink / raw)
  To: linux-kernel, linux-raid, xfs; +Cc: Alan Piszcz



On Sat, 17 Oct 2009, Justin Piszcz wrote:

> Hello,
>
> I have a system I recently upgraded from 2.6.30.x and after approximately 
> 24-48 hours--sometimes longer, the system cannot write any more files to disk 
> (luckily though I can still write to /dev/shm) -- to which I have
> saved the sysrq-t and sysrq-w output:
>
> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Configuration:
>
> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : 
> active raid1 sdb2[1] sda2[0]
>      136448 blocks [2/2] [UU]
>
> md2 : active raid1 sdb3[1] sda3[0]
>      129596288 blocks [2/2] [UU]
>
> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2] sdd1[1] 
> sdc1[0]
>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> md0 : active raid1 sdb1[1] sda1[0]
>      16787776 blocks [2/2] [UU]
>
> $ mount
> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> proc on /proc type proc (rw,noexec,nosuid,nodev)
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> udev on /dev type tmpfs (rw,mode=0755)
> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> /dev/md1 on /boot type ext3 (rw,noatime)
> /dev/md3 on /r/1 type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Distribution: Debian Testing
> Arch: x86_64
>
> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
> persists.
>
> Here is a snippet of two processes in D-state, the first was not doing 
> anything, the second was mrtg.
>
> [121444.684000] pickup        D 0000000000000003     0 18407   4521 
> 0x00000000
> [121444.684000]  ffff880231dd2290 0000000000000086 0000000000000000 
> 0000000000000000
> [121444.684000]  000000000000ff40 000000000000c8c8 ffff880176794d10 
> ffff880176794f90
> [121444.684000]  000000032266dd08 ffff8801407a87f0 ffff8800280878d8 
> ffff880176794f90
> [121444.684000] Call Trace:
> [121444.684000]  [<ffffffff810a742d>] ? free_pages_and_swap_cache+0x9d/0xc0
> [121444.684000]  [<ffffffff81454866>] ? __mutex_lock_slowpath+0xd6/0x160
> [121444.684000]  [<ffffffff814546ba>] ? mutex_lock+0x1a/0x40
> [121444.684000]  [<ffffffff810b26ef>] ? generic_file_llseek+0x2f/0x70
> [121444.684000]  [<ffffffff810b119e>] ? sys_lseek+0x7e/0x90
> [121444.684000]  [<ffffffff8109ffd2>] ? sys_munmap+0x52/0x80
> [121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> [121444.684000] rateup        D 0000000000000000     0 18538  18465 
> 0x00000000
> [121444.684000]  ffff88023f8a8c10 0000000000000082 0000000000000000 
> ffff88023ea09ec8
> [121444.684000]  000000000000ff40 000000000000c8c8 ffff88023faace50 
> ffff88023faad0d0
> [121444.684000]  0000000300003e00 000000010720cc78 0000000000003e00 
> ffff88023faad0d0
> [121444.684000] Call Trace:
> [121444.684000]  [<ffffffff811f42e2>] ? xfs_buf_iorequest+0x42/0x90
> [121444.684000]  [<ffffffff811dd66d>] ? xlog_bdstrat_cb+0x3d/0x50
> [121444.684000]  [<ffffffff811db05b>] ? xlog_sync+0x20b/0x4e0
> [121444.684000]  [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [121444.684000]  [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [121444.684000]  [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [121444.684000]  [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
> [121444.684000]  [<ffffffff811a7223>] ? xfs_alloc_ag_vextent+0x123/0x130
> [121444.684000]  [<ffffffff811a7aa8>] ? xfs_alloc_vextent+0x368/0x4b0
> [121444.684000]  [<ffffffff811b41e8>] ? xfs_bmap_btalloc+0x598/0xa40
> [121444.684000]  [<ffffffff811b6a42>] ? xfs_bmapi+0x9e2/0x11a0
> [121444.684000]  [<ffffffff811dd7f0>] ? xlog_grant_push_ail+0x30/0xf0
> [121444.684000]  [<ffffffff811e8fd8>] ? xfs_trans_reserve+0xa8/0x220
> [121444.684000]  [<ffffffff811d805e>] ? xfs_iomap_write_allocate+0x23e/0x3b0
> [121444.684000]  [<ffffffff811f0daf>] ? __xfs_get_blocks+0x8f/0x220
> [121444.684000]  [<ffffffff811d8c00>] ? xfs_iomap+0x2c0/0x300
> [121444.684000]  [<ffffffff810d5b76>] ? __set_page_dirty+0x66/0xd0
> [121444.684000]  [<ffffffff811f0d15>] ? xfs_map_blocks+0x25/0x30
> [121444.684000]  [<ffffffff811f1e04>] ? xfs_page_state_convert+0x414/0x6c0
> [121444.684000]  [<ffffffff811f23b7>] ? xfs_vm_writepage+0x77/0x130
> [121444.684000]  [<ffffffff8108b21a>] ? __writepage+0xa/0x40
> [121444.684000]  [<ffffffff8108baff>] ? write_cache_pages+0x1df/0x3c0
> [121444.684000]  [<ffffffff8108b210>] ? __writepage+0x0/0x40
> [121444.684000]  [<ffffffff810b1533>] ? do_sync_write+0xe3/0x130
> [121444.684000]  [<ffffffff8108bd30>] ? do_writepages+0x20/0x40
> [121444.684000]  [<ffffffff81085abd>] ? __filemap_fdatawrite_range+0x4d/0x60
> [121444.684000]  [<ffffffff811f54dd>] ? xfs_flush_pages+0xad/0xc0
> [121444.684000]  [<ffffffff811ee907>] ? xfs_release+0x167/0x1d0
> [121444.684000]  [<ffffffff811f52b0>] ? xfs_file_release+0x10/0x20
> [121444.684000]  [<ffffffff810b2c0d>] ? __fput+0xcd/0x1e0
> [121444.684000]  [<ffffffff810af556>] ? filp_close+0x56/0x90
> [121444.684000]  [<ffffffff810af636>] ? sys_close+0xa6/0x100
> [121444.684000]  [<ffffffff8102c52b>] ? system_call_fastpath+0x16/0x1b
>
> Anyone know what is going on here?
>
> Justin.
>

In addition to using netconsole, which kernel options should be enabled
to better diagnose this issue?

Should I enable these to help track down this bug?

[ ]   XFS Debugging support (EXPERIMENTAL)
[ ] Compile the kernel with frame pointers

Are there any other options that will help determine the root cause of this
bug that are recommended?

Justin.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
  2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz
@ 2009-11-24 15:14   ` Eric Sandeen
  2009-11-24 16:20     ` Justin Piszcz
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2009-11-24 15:14 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz

Justin Piszcz wrote:
> 
> 
> On Sat, 17 Oct 2009, Justin Piszcz wrote:
> 
>> Hello,
>>
>> I have a system I recently upgraded from 2.6.30.x and after
>> approximately 24-48 hours--sometimes longer, the system cannot write
>> any more files to disk (luckily though I can still write to /dev/shm)
>> -- to which I have
>> saved the sysrq-t and sysrq-w output:
>>
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt

Unfortunately it looks like a lot of the sysrq-t, at least, was lost.

The sysrq-w trace has the "show blocked state" start a ways down the file,
for anyone playing along at home ;)

Other things you might try are a sysrq-m to get memory state...

>> Configuration:
>>
>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>> : active raid1 sdb2[1] sda2[0]
>>      136448 blocks [2/2] [UU]
>>
>> md2 : active raid1 sdb3[1] sda3[0]
>>      129596288 blocks [2/2] [UU]
>>
>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>> sdd1[1] sdc1[0]
>>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>
>> md0 : active raid1 sdb1[1] sda1[0]
>>      16787776 blocks [2/2] [UU]
>>
>> $ mount
>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>> udev on /dev type tmpfs (rw,mode=0755)
>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>> /dev/md1 on /boot type ext3 (rw,noatime)
>> /dev/md3 on /r/1 type xfs
>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>> nfsd on /proc/fs/nfsd type nfsd (rw)

Do you get the same behavior if you don't add the log options at mount time?
Kind of grasping at straws here for now ...

>> Distribution: Debian Testing
>> Arch: x86_64
>>
>> The problem occurs with 2.6.31 and I upgraded to 2.6.31.4 and the problem
>> persists.
>>

...

> In addition to using netconsole, which kernel options should be enabled
> to better diagnose this issue?
> 
> Should I enable these to help track down this bug?
> 
> [ ]   XFS Debugging support (EXPERIMENTAL)
> [ ] Compile the kernel with frame pointers

The former probably won't hurt; the latter might gibe us better backtraces.

> Are there any other options that will help determine the root cause of this
> bug that are recommended?

Not that I can think of off hand ...

-Eric

> Justin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
  2009-11-24 15:14   ` Eric Sandeen
@ 2009-11-24 16:20     ` Justin Piszcz
  2009-11-24 16:23       ` Eric Sandeen
  0 siblings, 1 reply; 19+ messages in thread
From: Justin Piszcz @ 2009-11-24 16:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs



On Tue, 24 Nov 2009, Eric Sandeen wrote:

> Justin Piszcz wrote:
>>
>>
>> On Sat, 17 Oct 2009, Justin Piszcz wrote:
>>
>>> Hello,
>>>
>>> I have a system I recently upgraded from 2.6.30.x and after
>>> approximately 24-48 hours--sometimes longer, the system cannot write
>>> any more files to disk (luckily though I can still write to /dev/shm)
>>> -- to which I have
>>> saved the sysrq-t and sysrq-w output:
>>>
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-w.txt
>>> http://home.comcast.net/~jpiszcz/20091017/sysrq-t.txt
>
> Unfortunately it looks like a lot of the sysrq-t, at least, was lost.
Yes, when this occurred the first few times, I can only grab whats in dmesg
to the ramdisk, trying to access any file system other than the ramdisk
(tmpfs) /dev/shm, will cause the process to be locked.

>
> The sysrq-w trace has the "show blocked state" start a ways down the file,
> for anyone playing along at home ;)
>
> Other things you might try are a sysrq-m to get memory state...
I actually performed most of the useful sysrq-commands, please see
the following:

wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt

>
>>> Configuration:
>>>
>>> $ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1
>>> : active raid1 sdb2[1] sda2[0]
>>>      136448 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[1] sda3[0]
>>>      129596288 blocks [2/2] [UU]
>>>
>>> md3 : active raid5 sdj1[7] sdi1[6] sdh1[5] sdf1[3] sdg1[4] sde1[2]
>>> sdd1[1] sdc1[0]
>>>      5128001536 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]
>>>
>>> md0 : active raid1 sdb1[1] sda1[0]
>>>      16787776 blocks [2/2] [UU]
>>>
>>> $ mount
>>> /dev/md2 on / type xfs (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
>>> proc on /proc type proc (rw,noexec,nosuid,nodev)
>>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>> udev on /dev type tmpfs (rw,mode=0755)
>>> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
>>> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
>>> /dev/md1 on /boot type ext3 (rw,noatime)
>>> /dev/md3 on /r/1 type xfs
>>> (rw,noatime,nobarrier,logbufs=8,logbsize=262144)
>>> rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
>>> nfsd on /proc/fs/nfsd type nfsd (rw)
>
> Do you get the same behavior if you don't add the log options at mount time?
I have not tried disabling the log options, although they have been in effect
for a long time, (the logsbufs and bufsize and recently) the nobarrier
support.  Could there be an issue using -o nobarrier on a raid1+xfs?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Which kernel options should be enabled to find the root cause of this bug?
  2009-11-24 16:20     ` Justin Piszcz
@ 2009-11-24 16:23       ` Eric Sandeen
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Sandeen @ 2009-11-24 16:23 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Alan Piszcz, linux-kernel, xfs

Justin Piszcz wrote:
> 
> On Tue, 24 Nov 2009, Eric Sandeen wrote:

...

>> Do you get the same behavior if you don't add the log options at mount time?
> I have not tried disabling the log options, although they have been in effect
> for a long time, (the logsbufs and bufsize and recently) the nobarrier
> support.  Could there be an issue using -o nobarrier on a raid1+xfs?

nobarrier should not cause problems.

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-11-24 16:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-17 22:34 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Justin Piszcz
2009-10-18 20:17 ` Justin Piszcz
2009-10-19  3:04   ` Dave Chinner
2009-10-19 10:18     ` Justin Piszcz
2009-10-20  0:33       ` Dave Chinner
2009-10-20  8:33         ` Justin Piszcz
2009-10-21 10:19           ` Justin Piszcz
2009-10-22 22:49             ` Justin Piszcz
2009-10-22 23:00               ` Dave Chinner
2009-10-26 11:24               ` Justin Piszcz
2009-11-02 21:46                 ` Justin Piszcz
2009-11-20 20:39             ` 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) - root cause found = asterisk Justin Piszcz
2009-11-20 23:44               ` Bug#557262: " Faidon Liambotis
2009-11-20 23:51                 ` Justin Piszcz
2009-11-21 14:29                 ` Roger Heflin
2009-11-24 13:08 ` Which kernel options should be enabled to find the root cause of this bug? Justin Piszcz
2009-11-24 15:14   ` Eric Sandeen
2009-11-24 16:20     ` Justin Piszcz
2009-11-24 16:23       ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).