* XFS blocked task in xlog_cil_force_lsn @ 2012-04-16 6:48 Stefan Priebe - Profihost AG 2012-04-16 23:56 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Stefan Priebe - Profihost AG @ 2012-04-16 6:48 UTC (permalink / raw) To: stable; +Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, xfs Hi, today i've seen this call trace / block in XFS while running 3.0.27. INFO: task xfsaild/sdc3:1818 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. xfsaild/sdc3 D ffff88041f374dc8 0 1818 2 0x00000000 ffff88041edd3d30 0000000000000046 ffff8803ecaa8dc0 ffff88041f23fe28 ffff88041f374860 0000000000010800 ffff88041edd3fd8 ffff88041edd2010 ffff88041edd3fd8 0000000000010800 ffff88041f15c860 ffff88041f374860 Call Trace: [<ffffffff8159c94a>] schedule+0x3a/0x60 [<ffffffff812763a9>] xlog_cil_force_lsn+0xe9/0x110 [<ffffffff8104caf0>] ? try_to_wake_up+0x2a0/0x2a0 [<ffffffff81273922>] _xfs_log_force+0x62/0x250 [<ffffffff81273d23>] xfs_log_force+0x13/0x40 [<ffffffff81281ac7>] ? xfs_trans_ail_cursor_next+0x27/0x30 [<ffffffff81282237>] xfsaild+0x447/0x460 [<ffffffff81281df0>] ? xfs_trans_ail_init+0xb0/0xb0 [<ffffffff8106f646>] kthread+0x96/0xa0 [<ffffffff815a0214>] kernel_thread_helper+0x4/0x10 [<ffffffff8106f5b0>] ? kthread_worker_fn+0x130/0x130 [<ffffffff815a0210>] ? gs_change+0xb/0xb Is there a known fix / patch available? Stefan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2012-04-16 6:48 XFS blocked task in xlog_cil_force_lsn Stefan Priebe - Profihost AG @ 2012-04-16 23:56 ` Dave Chinner 2012-04-17 8:19 ` Stefan Priebe - Profihost AG 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2012-04-16 23:56 UTC (permalink / raw) To: Stefan Priebe - Profihost AG Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs On Mon, Apr 16, 2012 at 08:48:25AM +0200, Stefan Priebe - Profihost AG wrote: > Hi, > > today i've seen this call trace / block in XFS while running 3.0.27. > > INFO: task xfsaild/sdc3:1818 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > xfsaild/sdc3 D ffff88041f374dc8 0 1818 2 0x00000000 > ffff88041edd3d30 0000000000000046 ffff8803ecaa8dc0 ffff88041f23fe28 > ffff88041f374860 0000000000010800 ffff88041edd3fd8 ffff88041edd2010 > ffff88041edd3fd8 0000000000010800 ffff88041f15c860 ffff88041f374860 > Call Trace: > [<ffffffff8159c94a>] schedule+0x3a/0x60 > [<ffffffff812763a9>] xlog_cil_force_lsn+0xe9/0x110 > [<ffffffff8104caf0>] ? try_to_wake_up+0x2a0/0x2a0 > [<ffffffff81273922>] _xfs_log_force+0x62/0x250 > [<ffffffff81273d23>] xfs_log_force+0x13/0x40 > [<ffffffff81281ac7>] ? xfs_trans_ail_cursor_next+0x27/0x30 > [<ffffffff81282237>] xfsaild+0x447/0x460 > [<ffffffff81281df0>] ? xfs_trans_ail_init+0xb0/0xb0 > [<ffffffff8106f646>] kthread+0x96/0xa0 > [<ffffffff815a0214>] kernel_thread_helper+0x4/0x10 > [<ffffffff8106f5b0>] ? kthread_worker_fn+0x130/0x130 > [<ffffffff815a0210>] ? gs_change+0xb/0xb > > Is there a known fix / patch available? Your first question should be: is this a known problem? And the answer to that is: I don't think so. So, is the system actually hung? Were there IO errors around this time? did the system recover and continue to work normally? What's your storage hardware, application, workload, etc? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2012-04-16 23:56 ` Dave Chinner @ 2012-04-17 8:19 ` Stefan Priebe - Profihost AG 2012-04-17 8:32 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Stefan Priebe - Profihost AG @ 2012-04-17 8:19 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs Am 17.04.2012 01:56, schrieb Dave Chinner: > Your first question should be: is this a known problem? > And the answer to that is: I don't think so. Sorry absolutely correct. > So, is the system actually hung? No. The kernel has ended the stalled request. > Were there IO errors around this time? No. > did the system recover and continue to work normally? Yes. > What's your storage hardware, application, workload, etc? It a Raid 10 of 1TB SATA Drives in front of a 3Ware Controller doing heavy rsync backups. Greets Stefan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2012-04-17 8:19 ` Stefan Priebe - Profihost AG @ 2012-04-17 8:32 ` Dave Chinner 2012-04-17 9:19 ` Stefan Priebe - Profihost AG 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2012-04-17 8:32 UTC (permalink / raw) To: Stefan Priebe - Profihost AG Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs On Tue, Apr 17, 2012 at 10:19:45AM +0200, Stefan Priebe - Profihost AG wrote: > Am 17.04.2012 01:56, schrieb Dave Chinner: > > Your first question should be: is this a known problem? > > And the answer to that is: I don't think so. > Sorry absolutely correct. > > > So, is the system actually hung? > No. The kernel has ended the stalled request. > > > Were there IO errors around this time? > No. > > > did the system recover and continue to work normally? > Yes. > > > What's your storage hardware, application, workload, etc? > It a Raid 10 of 1TB SATA Drives in front of a 3Ware Controller doing > heavy rsync backups. Hmmm. It sounds like it is just taking a long time for the IO to complete, not that there is any other problem. If the IO is realy, really slow, the a log force is definitely a place it coul dstall for some time. Does the RAID controller have a BBWC? What IO scheduler are you using? Does mounting with nobarriers reduce the stall time? Can you reduce the IO load so that it doesn't end up in this state? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2012-04-17 8:32 ` Dave Chinner @ 2012-04-17 9:19 ` Stefan Priebe - Profihost AG 0 siblings, 0 replies; 22+ messages in thread From: Stefan Priebe - Profihost AG @ 2012-04-17 9:19 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs Am 17.04.2012 10:32, schrieb Dave Chinner: >>> What's your storage hardware, application, workload, etc? >> It a Raid 10 of 1TB SATA Drives in front of a 3Ware Controller doing >> heavy rsync backups. > > Hmmm. It sounds like it is just taking a long time for the IO to > complete, not that there is any other problem. If the IO is realy, > really slow, the a log force is definitely a place it coul dstall > for some time. OK as i've seen this just once it just might be that the controller / disks were too slow in this case. > Does the RAID controller have a BBWC? No but doesn't matter ;-) If it crashes i can just repair the FS and restart the backup again. > What IO scheduler are you using? [root@backup1 ~]# cat /sys/block/sdc/queue/scheduler [noop] deadline cfq > Does mounting with nobarriers reduce the stall time? It is already mounted with nobarrier: /dev/sdc3 on /sdc3 type xfs (rw,noatime,nodiratime,nobarrier,prjquota) > Can you reduce the IO load so that it doesn't end up in this state? Can try that by just doing less backups per time period. Stefan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* XFS blocked task in xlog_cil_force_lsn @ 2013-12-18 2:05 Kevin Richter 2013-12-18 3:38 ` Stan Hoeppner ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Kevin Richter @ 2013-12-18 2:05 UTC (permalink / raw) To: xfs Hi, around April 2012 there was a similar thread on this list which I have found via Google, so my mail topic is the same. I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying under heavy load there are always these blocks. At the bottom of this message I have included some line from the syslog. Even a reboot is now not possible anymore, because the whole system hangs while executing the "sync" command in one of the shutdown scripts. So... first I have thought that my disks are faulty. But with smartmontools I have started a short and a long test on all of the 5 disks: no errors Then I have even recreated the whole array, but no improvement. Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS Details about the array: soft array with mdadm v3.2.5, no hardware raid controller in the server The scheduler of the raid disks: > $ cat /sys/block/sd[cdefg]/queue/scheduler > noop deadline [cfq] > noop deadline [cfq] > noop deadline [cfq] > noop deadline [cfq] > noop deadline [cfq] Any ideas what I can do? Thanks, Kevin > Dec 18 02:19:19 kernel: [390470.193560] INFO: task kswapd0:51 blocked for more than 120 seconds. > Dec 18 02:19:19 kernel: [390470.194541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Dec 18 02:19:19 kernel: [390470.195795] kswapd0 D ffffffff81806240 0 51 2 0x00000000 > Dec 18 02:19:19 kernel: [390470.195804] ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78 > Dec 18 02:19:19 kernel: [390470.195813] ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0 > Dec 18 02:19:19 kernel: [390470.195820] ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28 > Dec 18 02:19:19 kernel: [390470.195827] Call Trace: > Dec 18 02:19:19 kernel: [390470.195844] [<ffffffff8165f62f>] schedule+0x3f/0x60 > Dec 18 02:19:19 kernel: [390470.195850] [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150 > Dec 18 02:19:19 kernel: [390470.195856] [<ffffffff8166004a>] mutex_lock+0x2a/0x50 > Dec 18 02:19:19 kernel: [390470.195904] [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs] > Dec 18 02:19:19 kernel: [390470.195928] [<ffffffffa029d83a>] ? __xfs_inode_set_reclaim_tag+0x4a/0x140 [xfs] > Dec 18 02:19:19 kernel: [390470.195951] [<ffffffffa029d96d>] ? xfs_inode_set_reclaim_tag+0x3d/0xa0 [xfs] > Dec 18 02:19:19 kernel: [390470.195973] [<ffffffffa029d9b4>] ? xfs_inode_set_reclaim_tag+0x84/0xa0 [xfs] > Dec 18 02:19:19 kernel: [390470.195995] [<ffffffffa029c6a5>] ? xfs_fs_destroy_inode+0x65/0x150 [xfs] > Dec 18 02:19:19 kernel: [390470.196003] [<ffffffff81194dfc>] ? destroy_inode+0x3c/0x70 > Dec 18 02:19:19 kernel: [390470.196008] [<ffffffff81194f5a>] ? evict+0x12a/0x1c0 > Dec 18 02:19:19 kernel: [390470.196016] [<ffffffff8166154e>] ? _raw_spin_lock+0xe/0x20 > Dec 18 02:19:19 kernel: [390470.196037] [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] > Dec 18 02:19:19 kernel: [390470.196059] [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] > Dec 18 02:19:19 kernel: [390470.196067] [<ffffffff8117d71e>] prune_super+0x10e/0x1a0 > Dec 18 02:19:19 kernel: [390470.196073] [<ffffffff8112b924>] shrink_slab+0x154/0x300 > Dec 18 02:19:19 kernel: [390470.196079] [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0 > Dec 18 02:19:19 kernel: [390470.196085] [<ffffffff8112ef01>] kswapd+0x121/0x220 > Dec 18 02:19:19 kernel: [390470.196090] [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0 > Dec 18 02:19:19 kernel: [390470.196097] [<ffffffff8108b64c>] kthread+0x8c/0xa0 > Dec 18 02:19:19 kernel: [390470.196105] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10 > Dec 18 02:19:19 kernel: [390470.196111] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0 > Dec 18 02:19:19 kernel: [390470.196117] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13 > Dec 18 02:19:19 kernel: [390470.196121] INFO: task kswapd1:52 blocked for more than 120 seconds. > Dec 18 02:19:19 kernel: [390470.197086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Dec 18 02:19:19 kernel: [390470.198427] kswapd1 D 0000000000000007 0 52 2 0x00000000 > Dec 18 02:19:19 kernel: [390470.198434] ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff > Dec 18 02:19:19 kernel: [390470.198441] ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0 > Dec 18 02:19:19 kernel: [390470.198447] ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004 > Dec 18 02:19:19 kernel: [390470.198454] Call Trace: > Dec 18 02:19:19 kernel: [390470.198462] [<ffffffff8165f62f>] schedule+0x3f/0x60 > Dec 18 02:19:19 kernel: [390470.198498] [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs] > Dec 18 02:19:19 kernel: [390470.198510] [<ffffffff811643af>] ? unfreeze_partials+0x24f/0x2b0 > Dec 18 02:19:19 kernel: [390470.198521] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200 > Dec 18 02:19:19 kernel: [390470.198554] [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs] > Dec 18 02:19:19 kernel: [390470.198587] [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs] > Dec 18 02:19:19 kernel: [390470.198617] [<ffffffffa02ba398>] ? xfs_bmbt_get_all+0x18/0x20 [xfs] > Dec 18 02:19:19 kernel: [390470.198646] [<ffffffffa02ae972>] ? xfs_bmap_search_multi_extents+0x92/0x110 [xfs] > Dec 18 02:19:19 kernel: [390470.198677] [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs] > Dec 18 02:19:19 kernel: [390470.198710] [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs] > Dec 18 02:19:19 kernel: [390470.198740] [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs] > Dec 18 02:19:19 kernel: [390470.198769] [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs] > Dec 18 02:19:19 kernel: [390470.198780] [<ffffffff8104ef9b>] ? try_wait_for_completion+0x4b/0x60 > Dec 18 02:19:19 kernel: [390470.198806] [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs] > Dec 18 02:19:19 kernel: [390470.198818] [<ffffffff81314a4c>] ? radix_tree_gang_lookup_tag.part.7+0x6c/0xf0 > Dec 18 02:19:19 kernel: [390470.198844] [<ffffffffa029dc5f>] xfs_reclaim_inodes_ag+0x23f/0x3a0 [xfs] > Dec 18 02:19:19 kernel: [390470.198871] [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] > Dec 18 02:19:19 kernel: [390470.198895] [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] > Dec 18 02:19:19 kernel: [390470.198905] [<ffffffff8117d71e>] prune_super+0x10e/0x1a0 > Dec 18 02:19:19 kernel: [390470.198913] [<ffffffff8112b924>] shrink_slab+0x154/0x300 > Dec 18 02:19:19 kernel: [390470.198922] [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0 > Dec 18 02:19:19 kernel: [390470.198931] [<ffffffff8112ef01>] kswapd+0x121/0x220 > Dec 18 02:19:19 kernel: [390470.198938] [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0 > Dec 18 02:19:19 kernel: [390470.198947] [<ffffffff8108b64c>] kthread+0x8c/0xa0 > Dec 18 02:19:19 kernel: [390470.198956] [<ffffffff8105f750>] ? __migrate_task+0x120/0x120 > Dec 18 02:19:19 kernel: [390470.198966] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10 > Dec 18 02:19:19 kernel: [390470.198974] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0 > Dec 18 02:19:19 kernel: [390470.198984] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13 > Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds. > Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0 D 0000000000000001 0 19192 2 0x00000000 > Dec 18 02:19:19 kernel: [390470.200268] ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000 > Dec 18 02:19:19 kernel: [390470.200274] ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0 > Dec 18 02:19:19 kernel: [390470.200281] ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080 > Dec 18 02:19:19 kernel: [390470.200288] Call Trace: > Dec 18 02:19:19 kernel: [390470.200293] [<ffffffff8165f62f>] schedule+0x3f/0x60 > Dec 18 02:19:19 kernel: [390470.200323] [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs] > Dec 18 02:19:19 kernel: [390470.200330] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200 > Dec 18 02:19:19 kernel: [390470.200358] [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs] > Dec 18 02:19:19 kernel: [390470.200382] [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs] > Dec 18 02:19:19 kernel: [390470.200410] [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs] > Dec 18 02:19:19 kernel: [390470.200438] [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs] > Dec 18 02:19:19 kernel: [390470.200446] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360 > Dec 18 02:19:19 kernel: [390470.200473] [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs] > Dec 18 02:19:19 kernel: [390470.200481] [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130 > Dec 18 02:19:19 kernel: [390470.200507] [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs] > Dec 18 02:19:19 kernel: [390470.200535] [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs] > Dec 18 02:19:19 kernel: [390470.200540] [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320 > Dec 18 02:19:19 kernel: [390470.200568] [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs] > Dec 18 02:19:19 kernel: [390470.200593] [<ffffffffa02e7bb0>] ? xfsaild_push+0x680/0x680 [xfs] > Dec 18 02:19:19 kernel: [390470.200599] [<ffffffff8108b64c>] kthread+0x8c/0xa0 > Dec 18 02:19:19 kernel: [390470.200605] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10 > Dec 18 02:19:19 kernel: [390470.200611] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0 > Dec 18 02:19:19 kernel: [390470.200617] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-18 2:05 Kevin Richter @ 2013-12-18 3:38 ` Stan Hoeppner 2013-12-18 10:27 ` Kevin Richter 2013-12-18 8:33 ` Stefan Ring 2013-12-18 22:21 ` Dave Chinner 2 siblings, 1 reply; 22+ messages in thread From: Stan Hoeppner @ 2013-12-18 3:38 UTC (permalink / raw) To: xfs, xfs On 12/17/2013 8:05 PM, Kevin Richter wrote: > Hi, > > around April 2012 there was a similar thread on this list which I have > found via Google, so my mail topic is the same. > > I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying > under heavy load there are always these blocks. At the bottom of this > message I have included some line from the syslog. > > Even a reboot is now not possible anymore, because the whole system > hangs while executing the "sync" command in one of the shutdown scripts. > > So... first I have thought that my disks are faulty. > But with smartmontools I have started a short and a long test on all of > the 5 disks: no errors > > Then I have even recreated the whole array, but no improvement. > > Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS > Details about the array: soft array with mdadm v3.2.5, no hardware raid > controller in the server > > The scheduler of the raid disks: >> $ cat /sys/block/sd[cdefg]/queue/scheduler >> noop deadline [cfq] >> noop deadline [cfq] >> noop deadline [cfq] >> noop deadline [cfq] >> noop deadline [cfq] > > > Any ideas what I can do? Your workload is seeking the disks to death, which is why you're getting these timeouts. The actuators simply can't keep up. 1. Switch to deadline. CFQ is not suitable for RAID storage, and not suitable for XFS. This may not be a silver bullet but it will help. 2. Post your chunk size and RAID6 stripe_cache_size value. They may be sub optimal for your workload. For the latter $ cat /sys/block/mdX/md/stripe_cache_size 3. Post 'xfs_info /dev/mdX' 4. You're getting a lot of kswapd timeouts because you have swap and the md/RAID6 array on the same disks. Relocate swap to disks that are not part of this RAID6. Small SSDs are cheap and fast. Buy one and put swap on it. Or install more RAM in the machine. Going the SSD route is better as it gives flexibility. For instance, you can also relocate your syslog files to it and anything else that does IO without eating lots of space. This decreases the IOPS load on your rust. 5. Describe in some detail the workload(s) causing the heavy IO, and thus these timeouts. > > Thanks, > Kevin > > > >> Dec 18 02:19:19 kernel: [390470.193560] INFO: task kswapd0:51 blocked for more than 120 seconds. >> Dec 18 02:19:19 kernel: [390470.194541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Dec 18 02:19:19 kernel: [390470.195795] kswapd0 D ffffffff81806240 0 51 2 0x00000000 >> Dec 18 02:19:19 kernel: [390470.195804] ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78 >> Dec 18 02:19:19 kernel: [390470.195813] ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0 >> Dec 18 02:19:19 kernel: [390470.195820] ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28 >> Dec 18 02:19:19 kernel: [390470.195827] Call Trace: >> Dec 18 02:19:19 kernel: [390470.195844] [<ffffffff8165f62f>] schedule+0x3f/0x60 >> Dec 18 02:19:19 kernel: [390470.195850] [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150 >> Dec 18 02:19:19 kernel: [390470.195856] [<ffffffff8166004a>] mutex_lock+0x2a/0x50 >> Dec 18 02:19:19 kernel: [390470.195904] [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs] >> Dec 18 02:19:19 kernel: [390470.195928] [<ffffffffa029d83a>] ? __xfs_inode_set_reclaim_tag+0x4a/0x140 [xfs] >> Dec 18 02:19:19 kernel: [390470.195951] [<ffffffffa029d96d>] ? xfs_inode_set_reclaim_tag+0x3d/0xa0 [xfs] >> Dec 18 02:19:19 kernel: [390470.195973] [<ffffffffa029d9b4>] ? xfs_inode_set_reclaim_tag+0x84/0xa0 [xfs] >> Dec 18 02:19:19 kernel: [390470.195995] [<ffffffffa029c6a5>] ? xfs_fs_destroy_inode+0x65/0x150 [xfs] >> Dec 18 02:19:19 kernel: [390470.196003] [<ffffffff81194dfc>] ? destroy_inode+0x3c/0x70 >> Dec 18 02:19:19 kernel: [390470.196008] [<ffffffff81194f5a>] ? evict+0x12a/0x1c0 >> Dec 18 02:19:19 kernel: [390470.196016] [<ffffffff8166154e>] ? _raw_spin_lock+0xe/0x20 >> Dec 18 02:19:19 kernel: [390470.196037] [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] >> Dec 18 02:19:19 kernel: [390470.196059] [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] >> Dec 18 02:19:19 kernel: [390470.196067] [<ffffffff8117d71e>] prune_super+0x10e/0x1a0 >> Dec 18 02:19:19 kernel: [390470.196073] [<ffffffff8112b924>] shrink_slab+0x154/0x300 >> Dec 18 02:19:19 kernel: [390470.196079] [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0 >> Dec 18 02:19:19 kernel: [390470.196085] [<ffffffff8112ef01>] kswapd+0x121/0x220 >> Dec 18 02:19:19 kernel: [390470.196090] [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0 >> Dec 18 02:19:19 kernel: [390470.196097] [<ffffffff8108b64c>] kthread+0x8c/0xa0 >> Dec 18 02:19:19 kernel: [390470.196105] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10 >> Dec 18 02:19:19 kernel: [390470.196111] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0 >> Dec 18 02:19:19 kernel: [390470.196117] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13 >> Dec 18 02:19:19 kernel: [390470.196121] INFO: task kswapd1:52 blocked for more than 120 seconds. >> Dec 18 02:19:19 kernel: [390470.197086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Dec 18 02:19:19 kernel: [390470.198427] kswapd1 D 0000000000000007 0 52 2 0x00000000 >> Dec 18 02:19:19 kernel: [390470.198434] ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff >> Dec 18 02:19:19 kernel: [390470.198441] ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0 >> Dec 18 02:19:19 kernel: [390470.198447] ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004 >> Dec 18 02:19:19 kernel: [390470.198454] Call Trace: >> Dec 18 02:19:19 kernel: [390470.198462] [<ffffffff8165f62f>] schedule+0x3f/0x60 >> Dec 18 02:19:19 kernel: [390470.198498] [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs] >> Dec 18 02:19:19 kernel: [390470.198510] [<ffffffff811643af>] ? unfreeze_partials+0x24f/0x2b0 >> Dec 18 02:19:19 kernel: [390470.198521] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200 >> Dec 18 02:19:19 kernel: [390470.198554] [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198587] [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198617] [<ffffffffa02ba398>] ? xfs_bmbt_get_all+0x18/0x20 [xfs] >> Dec 18 02:19:19 kernel: [390470.198646] [<ffffffffa02ae972>] ? xfs_bmap_search_multi_extents+0x92/0x110 [xfs] >> Dec 18 02:19:19 kernel: [390470.198677] [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198710] [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs] >> Dec 18 02:19:19 kernel: [390470.198740] [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198769] [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198780] [<ffffffff8104ef9b>] ? try_wait_for_completion+0x4b/0x60 >> Dec 18 02:19:19 kernel: [390470.198806] [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198818] [<ffffffff81314a4c>] ? radix_tree_gang_lookup_tag.part.7+0x6c/0xf0 >> Dec 18 02:19:19 kernel: [390470.198844] [<ffffffffa029dc5f>] xfs_reclaim_inodes_ag+0x23f/0x3a0 [xfs] >> Dec 18 02:19:19 kernel: [390470.198871] [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] >> Dec 18 02:19:19 kernel: [390470.198895] [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] >> Dec 18 02:19:19 kernel: [390470.198905] [<ffffffff8117d71e>] prune_super+0x10e/0x1a0 >> Dec 18 02:19:19 kernel: [390470.198913] [<ffffffff8112b924>] shrink_slab+0x154/0x300 >> Dec 18 02:19:19 kernel: [390470.198922] [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0 >> Dec 18 02:19:19 kernel: [390470.198931] [<ffffffff8112ef01>] kswapd+0x121/0x220 >> Dec 18 02:19:19 kernel: [390470.198938] [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0 >> Dec 18 02:19:19 kernel: [390470.198947] [<ffffffff8108b64c>] kthread+0x8c/0xa0 >> Dec 18 02:19:19 kernel: [390470.198956] [<ffffffff8105f750>] ? __migrate_task+0x120/0x120 >> Dec 18 02:19:19 kernel: [390470.198966] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10 >> Dec 18 02:19:19 kernel: [390470.198974] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0 >> Dec 18 02:19:19 kernel: [390470.198984] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13 >> Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds. >> Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0 D 0000000000000001 0 19192 2 0x00000000 >> Dec 18 02:19:19 kernel: [390470.200268] ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000 >> Dec 18 02:19:19 kernel: [390470.200274] ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0 >> Dec 18 02:19:19 kernel: [390470.200281] ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080 >> Dec 18 02:19:19 kernel: [390470.200288] Call Trace: >> Dec 18 02:19:19 kernel: [390470.200293] [<ffffffff8165f62f>] schedule+0x3f/0x60 >> Dec 18 02:19:19 kernel: [390470.200323] [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs] >> Dec 18 02:19:19 kernel: [390470.200330] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200 >> Dec 18 02:19:19 kernel: [390470.200358] [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs] >> Dec 18 02:19:19 kernel: [390470.200382] [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs] >> Dec 18 02:19:19 kernel: [390470.200410] [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs] >> Dec 18 02:19:19 kernel: [390470.200438] [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs] >> Dec 18 02:19:19 kernel: [390470.200446] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360 >> Dec 18 02:19:19 kernel: [390470.200473] [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs] >> Dec 18 02:19:19 kernel: [390470.200481] [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130 >> Dec 18 02:19:19 kernel: [390470.200507] [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs] >> Dec 18 02:19:19 kernel: [390470.200535] [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs] >> Dec 18 02:19:19 kernel: [390470.200540] [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320 >> Dec 18 02:19:19 kernel: [390470.200568] [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs] >> Dec 18 02:19:19 kernel: [390470.200593] [<ffffffffa02e7bb0>] ? xfsaild_push+0x680/0x680 [xfs] >> Dec 18 02:19:19 kernel: [390470.200599] [<ffffffff8108b64c>] kthread+0x8c/0xa0 >> Dec 18 02:19:19 kernel: [390470.200605] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10 >> Dec 18 02:19:19 kernel: [390470.200611] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0 >> Dec 18 02:19:19 kernel: [390470.200617] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13 -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-18 3:38 ` Stan Hoeppner @ 2013-12-18 10:27 ` Kevin Richter 2013-12-19 14:11 ` Stan Hoeppner 0 siblings, 1 reply; 22+ messages in thread From: Kevin Richter @ 2013-12-18 10:27 UTC (permalink / raw) To: xfs Thanks for your mails! > This is unusual. How long have you waited? For the reboot? One night. After the copy process hangs: several hours. But mostly it recovers after several minutes. > 1. Switch to deadline. CFQ is not suitable for RAID storage, and not > suitable for XFS. This may not be a silver bullet but it will help. Can I switch it while my copy process (from a separate hd to this raid) is running... without data loss? Otherwise I would wait a bit, because now it is actually running for 8 hours without kernel panics. > 2. Post your chunk size and RAID6 stripe_cache_size value. They may be > sub optimal for your workload. $ cat /sys/block/md2/md/stripe_cache_size 256 $ mdadm --detail /dev/md2 | grep Chunk Chunk Size : 512K > 3. Post 'xfs_info /dev/mdX' There is a LUKS volume around /dev/md2, named '6tb'. > $ xfs_info /dev/md2 > xfs_info: /dev/md2 is not a mounted XFS filesystem > $ xfs_info /dev/mapper/6tb > meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=1460203520, imaxpct=5 > = sunit=128 swidth=384 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > 4. You're getting a lot of kswapd timeouts because you have swap and > the md/RAID6 array on the same disks. Relocate swap to disks that are > not part of this RAID6. Small SSDs are cheap and fast. Buy one and put > swap on it. Or install more RAM in the machine. Going the SSD route is > better as it gives flexibility. For instance, you can also relocate > your syslog files to it and anything else that does IO without eating > lots of space. This decreases the IOPS load on your rust. No no, swap is not on any of the raid disks. > # cat /proc/swaps > Filename Type Size Used Priority > /dev/sda3 partition 7812496 0 -1 sda is not in the raid. In the raid there are sd[cdefg]. > 5. Describe in some detail the workload(s) causing the heavy IO, and > thus these timeouts. cd /olddharddisk cp -av . /raid/ oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from above. Heavy workload while this copy process (2 CPUs, each 4 cores): > top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32 > Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st > Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers > Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6 Dont know what consumes all of this 32GB RAM... 'top' sorted by memory consumption does not tell me. All entries are only 0.0% and 0.1% Thanks, Kevin Am 18.12.2013 04:38, schrieb Stan Hoeppner: > On 12/17/2013 8:05 PM, Kevin Richter wrote: >> Hi, >> >> around April 2012 there was a similar thread on this list which I have >> found via Google, so my mail topic is the same. >> >> I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying >> under heavy load there are always these blocks. At the bottom of this >> message I have included some line from the syslog. >> >> Even a reboot is now not possible anymore, because the whole system >> hangs while executing the "sync" command in one of the shutdown scripts. >> >> So... first I have thought that my disks are faulty. >> But with smartmontools I have started a short and a long test on all of >> the 5 disks: no errors >> >> Then I have even recreated the whole array, but no improvement. >> >> Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS >> Details about the array: soft array with mdadm v3.2.5, no hardware raid >> controller in the server >> >> The scheduler of the raid disks: >>> $ cat /sys/block/sd[cdefg]/queue/scheduler >>> noop deadline [cfq] >>> noop deadline [cfq] >>> noop deadline [cfq] >>> noop deadline [cfq] >>> noop deadline [cfq] >> >> >> Any ideas what I can do? > > Your workload is seeking the disks to death, which is why you're getting > these timeouts. The actuators simply can't keep up. > > 1. Switch to deadline. CFQ is not suitable for RAID storage, and not > suitable for XFS. This may not be a silver bullet but it will help. > > 2. Post your chunk size and RAID6 stripe_cache_size value. They may be > sub optimal for your workload. For the latter > > $ cat /sys/block/mdX/md/stripe_cache_size > > 3. Post 'xfs_info /dev/mdX' > > 4. You're getting a lot of kswapd timeouts because you have swap and > the md/RAID6 array on the same disks. Relocate swap to disks that are > not part of this RAID6. Small SSDs are cheap and fast. Buy one and put > swap on it. Or install more RAM in the machine. Going the SSD route is > better as it gives flexibility. For instance, you can also relocate > your syslog files to it and anything else that does IO without eating > lots of space. This decreases the IOPS load on your rust. > > 5. Describe in some detail the workload(s) causing the heavy IO, and > thus these timeouts. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-18 10:27 ` Kevin Richter @ 2013-12-19 14:11 ` Stan Hoeppner 2013-12-20 10:26 ` Kevin Richter 0 siblings, 1 reply; 22+ messages in thread From: Stan Hoeppner @ 2013-12-19 14:11 UTC (permalink / raw) To: xfs, xfs On 12/18/2013 4:27 AM, Kevin Richter wrote: > Thanks for your mails! > >> This is unusual. How long have you waited? > > For the reboot? One night. > After the copy process hangs: several hours. But mostly it recovers > after several minutes. > >> 1. Switch to deadline. CFQ is not suitable for RAID storage, and not >> suitable for XFS. This may not be a silver bullet but it will help. > > Can I switch it while my copy process (from a separate hd to this raid) > is running... without data loss? Otherwise I would wait a bit, because > now it is actually running for 8 hours without kernel panics. Yes, you can safely change the elevator on the fly. Repeat the following once for each disk, replacing X appropriately. $ echo deadline > /sys/block/sdX/queue/scheduler This on the fly change is not permanent. Add "elevator=deadline" to your kernel command line in your bootloader config file. Or you can add the commands to a startup script. Or if you build your own kernels simply make it the default, and/or remove all the other elevators from your kernel config, which is what I do. >> 2. Post your chunk size and RAID6 stripe_cache_size value. They may be >> sub optimal for your workload. > > $ cat /sys/block/md2/md/stripe_cache_size > 256 256 is the default and it is way too low. This is limiting your write througput. Increase this to a minimum of 1024 which will give you a 20MB stripe cache buffer. This should become active immediately. Add it to a startup script to make it permanent. $ echo 1024 > /sys/block/md2/md/stripe_cache_size ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache At some future point when the system is idle, play with FIO write testing and multiple jobs (threads) with stripe_cache_size values up to 2048 to see which value gives the best throughput. Test 1024, 2048, 4096. Going higher probably won't gain anything. And at some point as you go higher your throughput will decrease. When you see that, select the previous value and make it permanent. > $ mdadm --detail /dev/md2 | grep Chunk > Chunk Size : 512K > >> 3. Post 'xfs_info /dev/mdX' > > There is a LUKS volume around /dev/md2, named '6tb'. Ok, this adds a huge fly to the ointment. LUKS is single threaded per device. You've encrypted the md device instead of the individual disks. This limits your encryption throughput to a single CPU core. You've not provided any data on your LUKS thread. Is it pegging a core during large file IO? I'd bet it is. That would explain all of this. >> $ xfs_info /dev/md2 >> xfs_info: /dev/md2 is not a mounted XFS filesystem >> $ xfs_info /dev/mapper/6tb >> meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks >> = sectsz=512 attr=2 >> data = bsize=4096 blocks=1460203520, imaxpct=5 >> = sunit=128 swidth=384 blks >> naming =version 2 bsize=4096 ascii-ci=0 >> log =internal bsize=4096 blocks=521728, version=2 >> = sectsz=512 sunit=8 blks, lazy-count=1 >> realtime =none extsz=4096 blocks=0, rtextents=0 md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The XFS alignment matches the md device. This may be hurting you with LUKS between XFS and md. I don't know how LUKS handles 1.5MB inbound writes and if it passes an aligned write to md after encrypting. I also don't know how it handles writes barriers passed down the stack by XFS, or if it does at all. >> 4. You're getting a lot of kswapd timeouts because you have swap and >> the md/RAID6 array on the same disks. Relocate swap to disks that are >> not part of this RAID6. Small SSDs are cheap and fast. Buy one and put >> swap on it. Or install more RAM in the machine. Going the SSD route is >> better as it gives flexibility. For instance, you can also relocate >> your syslog files to it and anything else that does IO without eating >> lots of space. This decreases the IOPS load on your rust. > > No no, swap is not on any of the raid disks. > >> # cat /proc/swaps >> Filename Type Size Used Priority >> /dev/sda3 partition 7812496 0 -1 > sda is not in the raid. In the raid there are sd[cdefg]. I was wrong here. After reading a little about kswapd I now know it doesn't simply move pages to/from disk. It frees memory in other ways as well. In this case it's apparently waiting to free an inode being reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So kswapd times out as a result. A slow LUKS thread would explain this as well. >> 5. Describe in some detail the workload(s) causing the heavy IO, and >> thus these timeouts. > > cd /olddharddisk > cp -av . /raid/ > > oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from > above. So you are simply doing a huge file copy from an apparently slower single disk to an apparently faster striped RAID. But you get horrible throughput and kernel timeout errors, and the RAID disks are apparently fairly idle. > Heavy workload while this copy process (2 CPUs, each 4 cores): Model # of the CPUs so I can look up the specs? >> top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32 >> Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie >> Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st >> Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers >> Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6 > > Dont know what consumes all of this 32GB RAM... 'top' sorted by memory > consumption does not tell me. All entries are only 0.0% and 0.1% ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB) and page cache (20GB). Linux uses nearly all free memory for cache and buffers to prevent disk accesses, which speeds up access to frequently used data considerably. This is normal. A huge copy like this is going to cause a huge amount caching. I assume this 1TB file copy has finally completed. Start another large test copy and paste the CPU burn for the LUKS thread so we can confirm whether this is the root cause of your problem. -- Stan > > > > Thanks, > Kevin > > > > > Am 18.12.2013 04:38, schrieb Stan Hoeppner: >> On 12/17/2013 8:05 PM, Kevin Richter wrote: >>> Hi, >>> >>> around April 2012 there was a similar thread on this list which I have >>> found via Google, so my mail topic is the same. >>> >>> I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying >>> under heavy load there are always these blocks. At the bottom of this >>> message I have included some line from the syslog. >>> >>> Even a reboot is now not possible anymore, because the whole system >>> hangs while executing the "sync" command in one of the shutdown scripts. >>> >>> So... first I have thought that my disks are faulty. >>> But with smartmontools I have started a short and a long test on all of >>> the 5 disks: no errors >>> >>> Then I have even recreated the whole array, but no improvement. >>> >>> Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS >>> Details about the array: soft array with mdadm v3.2.5, no hardware raid >>> controller in the server >>> >>> The scheduler of the raid disks: >>>> $ cat /sys/block/sd[cdefg]/queue/scheduler >>>> noop deadline [cfq] >>>> noop deadline [cfq] >>>> noop deadline [cfq] >>>> noop deadline [cfq] >>>> noop deadline [cfq] >>> >>> >>> Any ideas what I can do? >> >> Your workload is seeking the disks to death, which is why you're getting >> these timeouts. The actuators simply can't keep up. >> >> 1. Switch to deadline. CFQ is not suitable for RAID storage, and not >> suitable for XFS. This may not be a silver bullet but it will help. >> >> 2. Post your chunk size and RAID6 stripe_cache_size value. They may be >> sub optimal for your workload. For the latter >> >> $ cat /sys/block/mdX/md/stripe_cache_size >> >> 3. Post 'xfs_info /dev/mdX' >> >> 4. You're getting a lot of kswapd timeouts because you have swap and >> the md/RAID6 array on the same disks. Relocate swap to disks that are >> not part of this RAID6. Small SSDs are cheap and fast. Buy one and put >> swap on it. Or install more RAM in the machine. Going the SSD route is >> better as it gives flexibility. For instance, you can also relocate >> your syslog files to it and anything else that does IO without eating >> lots of space. This decreases the IOPS load on your rust. >> >> 5. Describe in some detail the workload(s) causing the heavy IO, and >> thus these timeouts. >> > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-19 14:11 ` Stan Hoeppner @ 2013-12-20 10:26 ` Kevin Richter 2013-12-20 12:36 ` Stan Hoeppner 2013-12-20 22:43 ` Arkadiusz Miśkiewicz 0 siblings, 2 replies; 22+ messages in thread From: Kevin Richter @ 2013-12-20 10:26 UTC (permalink / raw) To: stan, xfs Thanks a lot for your huge replies! > Oh, that's quite old. I'd suggest upgrading to a much more recent > kernel as we've fixed lots of issues in this area since then. First I have switched to the newer kernel from Ubuntu Saucy: $ uname -a Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP Thus, it seems that the default scheduler has been changed to deadline. I did not change anything. After a reboot the schedulers of all disks are now deadline. > Model # of the CPUs so I can look up the specs? Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz I have prepared a folder with 60GB for the tests. This is nearly twice of the available memory, so the process should be forced to actually write the stuff to the disk - and not only hold in the memory. > $ echo 256 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /6tb/foo1/ > real 25m38.925s > user 0m0.595s > sys 1m23.182s > > $ echo 1024 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /raid/foo2/ > real 7m32.824s > user 0m0.438s > sys 1m6.759s > > $ echo 2048 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /raid/foo3/ > real 5m32.847s > user 0m0.418s > sys 1m5.671s > > $ echo 4096 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /raid/foo4/ > real 5m54.554s > user 0m0.437s > sys 1m6.268s The difference is really amazing! So 2048 seems to be the best choice. 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it? The RAID only consist of 5 SATA disks with 7200rpm. 'top' while copying with stripe size of 2048 (the source disk is ntfs): > top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 > Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st > Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers > Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 > 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 > 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 > 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 > 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs > 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 > 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp > 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 > 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 > 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 > 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 And the best thing at all: While all of these tests there are no warnings/panics in the syslog. With best regards, Kevin Am 19.12.2013 15:11, schrieb Stan Hoeppner: > On 12/18/2013 4:27 AM, Kevin Richter wrote: >> Thanks for your mails! >> >>> This is unusual. How long have you waited? >> >> For the reboot? One night. >> After the copy process hangs: several hours. But mostly it recovers >> after several minutes. >> >>> 1. Switch to deadline. CFQ is not suitable for RAID storage, and not >>> suitable for XFS. This may not be a silver bullet but it will help. >> >> Can I switch it while my copy process (from a separate hd to this raid) >> is running... without data loss? Otherwise I would wait a bit, because >> now it is actually running for 8 hours without kernel panics. > > Yes, you can safely change the elevator on the fly. Repeat the > following once for each disk, replacing X appropriately. > > $ echo deadline > /sys/block/sdX/queue/scheduler > > This on the fly change is not permanent. Add "elevator=deadline" to > your kernel command line in your bootloader config file. Or you can add > the commands to a startup script. Or if you build your own kernels > simply make it the default, and/or remove all the other elevators from > your kernel config, which is what I do. > >>> 2. Post your chunk size and RAID6 stripe_cache_size value. They may be >>> sub optimal for your workload. >> >> $ cat /sys/block/md2/md/stripe_cache_size >> 256 > > 256 is the default and it is way too low. This is limiting your write > througput. Increase this to a minimum of 1024 which will give you a > 20MB stripe cache buffer. This should become active immediately. Add > it to a startup script to make it permanent. > > $ echo 1024 > /sys/block/md2/md/stripe_cache_size > > ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache > > At some future point when the system is idle, play with FIO write > testing and multiple jobs (threads) with stripe_cache_size values up to > 2048 to see which value gives the best throughput. Test 1024, 2048, > 4096. Going higher probably won't gain anything. And at some point as > you go higher your throughput will decrease. When you see that, select > the previous value and make it permanent. > >> $ mdadm --detail /dev/md2 | grep Chunk >> Chunk Size : 512K >> >>> 3. Post 'xfs_info /dev/mdX' >> >> There is a LUKS volume around /dev/md2, named '6tb'. > > Ok, this adds a huge fly to the ointment. LUKS is single threaded per > device. You've encrypted the md device instead of the individual disks. > This limits your encryption throughput to a single CPU core. You've > not provided any data on your LUKS thread. Is it pegging a core during > large file IO? I'd bet it is. That would explain all of this. > >>> $ xfs_info /dev/md2 >>> xfs_info: /dev/md2 is not a mounted XFS filesystem >>> $ xfs_info /dev/mapper/6tb >>> meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks >>> = sectsz=512 attr=2 >>> data = bsize=4096 blocks=1460203520, imaxpct=5 >>> = sunit=128 swidth=384 blks >>> naming =version 2 bsize=4096 ascii-ci=0 >>> log =internal bsize=4096 blocks=521728, version=2 >>> = sectsz=512 sunit=8 blks, lazy-count=1 >>> realtime =none extsz=4096 blocks=0, rtextents=0 > > md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The > XFS alignment matches the md device. This may be hurting you with LUKS > between XFS and md. I don't know how LUKS handles 1.5MB inbound writes > and if it passes an aligned write to md after encrypting. I also don't > know how it handles writes barriers passed down the stack by XFS, or if > it does at all. > >>> 4. You're getting a lot of kswapd timeouts because you have swap and >>> the md/RAID6 array on the same disks. Relocate swap to disks that are >>> not part of this RAID6. Small SSDs are cheap and fast. Buy one and put >>> swap on it. Or install more RAM in the machine. Going the SSD route is >>> better as it gives flexibility. For instance, you can also relocate >>> your syslog files to it and anything else that does IO without eating >>> lots of space. This decreases the IOPS load on your rust. >> >> No no, swap is not on any of the raid disks. >> >>> # cat /proc/swaps >>> Filename Type Size Used Priority >>> /dev/sda3 partition 7812496 0 -1 >> sda is not in the raid. In the raid there are sd[cdefg]. > > I was wrong here. After reading a little about kswapd I now know it > doesn't simply move pages to/from disk. It frees memory in other ways > as well. In this case it's apparently waiting to free an inode being > reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So > kswapd times out as a result. A slow LUKS thread would explain this as > well. > >>> 5. Describe in some detail the workload(s) causing the heavy IO, and >>> thus these timeouts. >> >> cd /olddharddisk >> cp -av . /raid/ >> >> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from >> above. > > So you are simply doing a huge file copy from an apparently slower > single disk to an apparently faster striped RAID. But you get horrible > throughput and kernel timeout errors, and the RAID disks are apparently > fairly idle. > >> Heavy workload while this copy process (2 CPUs, each 4 cores): > > Model # of the CPUs so I can look up the specs? > >>> top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32 >>> Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st >>> Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers >>> Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6 >> >> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory >> consumption does not tell me. All entries are only 0.0% and 0.1% > > ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB) > and page cache (20GB). Linux uses nearly all free memory for cache and > buffers to prevent disk accesses, which speeds up access to frequently > used data considerably. This is normal. A huge copy like this is going > to cause a huge amount caching. > > I assume this 1TB file copy has finally completed. Start another large > test copy and paste the CPU burn for the LUKS thread so we can confirm > whether this is the root cause of your problem. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-20 10:26 ` Kevin Richter @ 2013-12-20 12:36 ` Stan Hoeppner 2013-12-21 5:30 ` Dave Chinner 2013-12-22 2:35 ` Kevin Richter 2013-12-20 22:43 ` Arkadiusz Miśkiewicz 1 sibling, 2 replies; 22+ messages in thread From: Stan Hoeppner @ 2013-12-20 12:36 UTC (permalink / raw) To: xfs, xfs On 12/20/2013 4:26 AM, Kevin Richter wrote: > Thanks a lot for your huge replies! You're welcome. >> Oh, that's quite old. I'd suggest upgrading to a much more recent >> kernel as we've fixed lots of issues in this area since then. > > First I have switched to the newer kernel from Ubuntu Saucy: > $ uname -a > Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP > > Thus, it seems that the default scheduler has been changed to deadline. > I did not change anything. After a reboot the schedulers of all disks > are now deadline. Good move Ubuntu. >> Model # of the CPUs so I can look up the specs? > Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz Strong CPUs. > I have prepared a folder with 60GB for the tests. This is nearly twice > of the available memory, so the process should be forced to actually > write the stuff to the disk - and not only hold in the memory. > >> $ echo 256 > /sys/block/md2/md/stripe_cache_size >> $ time cp -a /olddisk/testfolder /6tb/foo1/ >> real 25m38.925s >> user 0m0.595s >> sys 1m23.182s >> >> $ echo 1024 > /sys/block/md2/md/stripe_cache_size >> $ time cp -a /olddisk/testfolder /raid/foo2/ >> real 7m32.824s >> user 0m0.438s >> sys 1m6.759s >> >> $ echo 2048 > /sys/block/md2/md/stripe_cache_size >> $ time cp -a /olddisk/testfolder /raid/foo3/ >> real 5m32.847s >> user 0m0.418s >> sys 1m5.671s >> >> $ echo 4096 > /sys/block/md2/md/stripe_cache_size >> $ time cp -a /olddisk/testfolder /raid/foo4/ >> real 5m54.554s >> user 0m0.437s >> sys 1m6.268s > > The difference is really amazing! So 2048 seems to be the best choice. > 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it? > The RAID only consist of 5 SATA disks with 7200rpm. A lot of the source data is being cached between runs so these numbers aren't accurate. The throughput of this copy operation will be limited by the speed of the single source disk, not the array. To make the elapsed times of this copy test accurate you need to execute something like these commands after each run: # sync # echo 3 > /proc/sys/vm/drop_caches But this copy test will not inform you about the potential peak performance of your array. That's why I suggested you test with FIO, the flexible IO tester. # aptitude install fio # man fio Sample job file suitable for your system: [global] directory=/your/XFS/test/directory zero_buffers numjobs=8 group_reporting blocksize=512k ioengine=libaio iodepth=16 direct=1 size=10g [read] rw=read stonewall [write] rw=write stonewall This should give you a relatively accurate picture of the actual potential throughput of your array and filesystem. > 'top' while copying with stripe size of 2048 (the source disk is ntfs): >> top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 >> Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie >> Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st >> Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers >> Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 >> 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 >> 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 >> 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 >> 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs >> 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 >> 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp >> 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 >> 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 >> 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 >> 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a SandyBridge core at only ~180 MB/s throughput is it? > And the best thing at all: > While all of these tests there are no warnings/panics in the syslog. > > With best regards, > Kevin Even though XFS wasn't the cause of the problem I'm glad we were able to help you fix it nonetheless. I'm really curious to see what kind of throughput you can achieve with FIO, and whether crypto is a bottleneck at the 250-350 MB/s your array should be capable of. It would be great if you would play around a bit with FIO and post some numbers. -- Stan > Am 19.12.2013 15:11, schrieb Stan Hoeppner: >> On 12/18/2013 4:27 AM, Kevin Richter wrote: >>> Thanks for your mails! >>> >>>> This is unusual. How long have you waited? >>> >>> For the reboot? One night. >>> After the copy process hangs: several hours. But mostly it recovers >>> after several minutes. >>> >>>> 1. Switch to deadline. CFQ is not suitable for RAID storage, and not >>>> suitable for XFS. This may not be a silver bullet but it will help. >>> >>> Can I switch it while my copy process (from a separate hd to this raid) >>> is running... without data loss? Otherwise I would wait a bit, because >>> now it is actually running for 8 hours without kernel panics. >> >> Yes, you can safely change the elevator on the fly. Repeat the >> following once for each disk, replacing X appropriately. >> >> $ echo deadline > /sys/block/sdX/queue/scheduler >> >> This on the fly change is not permanent. Add "elevator=deadline" to >> your kernel command line in your bootloader config file. Or you can add >> the commands to a startup script. Or if you build your own kernels >> simply make it the default, and/or remove all the other elevators from >> your kernel config, which is what I do. >> >>>> 2. Post your chunk size and RAID6 stripe_cache_size value. They may be >>>> sub optimal for your workload. >>> >>> $ cat /sys/block/md2/md/stripe_cache_size >>> 256 >> >> 256 is the default and it is way too low. This is limiting your write >> througput. Increase this to a minimum of 1024 which will give you a >> 20MB stripe cache buffer. This should become active immediately. Add >> it to a startup script to make it permanent. >> >> $ echo 1024 > /sys/block/md2/md/stripe_cache_size >> >> ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache >> >> At some future point when the system is idle, play with FIO write >> testing and multiple jobs (threads) with stripe_cache_size values up to >> 2048 to see which value gives the best throughput. Test 1024, 2048, >> 4096. Going higher probably won't gain anything. And at some point as >> you go higher your throughput will decrease. When you see that, select >> the previous value and make it permanent. >> >>> $ mdadm --detail /dev/md2 | grep Chunk >>> Chunk Size : 512K >>> >>>> 3. Post 'xfs_info /dev/mdX' >>> >>> There is a LUKS volume around /dev/md2, named '6tb'. >> >> Ok, this adds a huge fly to the ointment. LUKS is single threaded per >> device. You've encrypted the md device instead of the individual disks. >> This limits your encryption throughput to a single CPU core. You've >> not provided any data on your LUKS thread. Is it pegging a core during >> large file IO? I'd bet it is. That would explain all of this. >> >>>> $ xfs_info /dev/md2 >>>> xfs_info: /dev/md2 is not a mounted XFS filesystem >>>> $ xfs_info /dev/mapper/6tb >>>> meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks >>>> = sectsz=512 attr=2 >>>> data = bsize=4096 blocks=1460203520, imaxpct=5 >>>> = sunit=128 swidth=384 blks >>>> naming =version 2 bsize=4096 ascii-ci=0 >>>> log =internal bsize=4096 blocks=521728, version=2 >>>> = sectsz=512 sunit=8 blks, lazy-count=1 >>>> realtime =none extsz=4096 blocks=0, rtextents=0 >> >> md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The >> XFS alignment matches the md device. This may be hurting you with LUKS >> between XFS and md. I don't know how LUKS handles 1.5MB inbound writes >> and if it passes an aligned write to md after encrypting. I also don't >> know how it handles writes barriers passed down the stack by XFS, or if >> it does at all. >> >>>> 4. You're getting a lot of kswapd timeouts because you have swap and >>>> the md/RAID6 array on the same disks. Relocate swap to disks that are >>>> not part of this RAID6. Small SSDs are cheap and fast. Buy one and put >>>> swap on it. Or install more RAM in the machine. Going the SSD route is >>>> better as it gives flexibility. For instance, you can also relocate >>>> your syslog files to it and anything else that does IO without eating >>>> lots of space. This decreases the IOPS load on your rust. >>> >>> No no, swap is not on any of the raid disks. >>> >>>> # cat /proc/swaps >>>> Filename Type Size Used Priority >>>> /dev/sda3 partition 7812496 0 -1 >>> sda is not in the raid. In the raid there are sd[cdefg]. >> >> I was wrong here. After reading a little about kswapd I now know it >> doesn't simply move pages to/from disk. It frees memory in other ways >> as well. In this case it's apparently waiting to free an inode being >> reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So >> kswapd times out as a result. A slow LUKS thread would explain this as >> well. >> >>>> 5. Describe in some detail the workload(s) causing the heavy IO, and >>>> thus these timeouts. >>> >>> cd /olddharddisk >>> cp -av . /raid/ >>> >>> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from >>> above. >> >> So you are simply doing a huge file copy from an apparently slower >> single disk to an apparently faster striped RAID. But you get horrible >> throughput and kernel timeout errors, and the RAID disks are apparently >> fairly idle. >> >>> Heavy workload while this copy process (2 CPUs, each 4 cores): >> >> Model # of the CPUs so I can look up the specs? >> >>>> top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32 >>>> Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie >>>> Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st >>>> Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers >>>> Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached >>>> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>> 699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6 >>> >>> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory >>> consumption does not tell me. All entries are only 0.0% and 0.1% >> >> ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB) >> and page cache (20GB). Linux uses nearly all free memory for cache and >> buffers to prevent disk accesses, which speeds up access to frequently >> used data considerably. This is normal. A huge copy like this is going >> to cause a huge amount caching. >> >> I assume this 1TB file copy has finally completed. Start another large >> test copy and paste the CPU burn for the LUKS thread so we can confirm >> whether this is the root cause of your problem. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-20 12:36 ` Stan Hoeppner @ 2013-12-21 5:30 ` Dave Chinner 2013-12-22 9:18 ` Stan Hoeppner 2013-12-22 2:35 ` Kevin Richter 1 sibling, 1 reply; 22+ messages in thread From: Dave Chinner @ 2013-12-21 5:30 UTC (permalink / raw) To: Stan Hoeppner; +Cc: xfs, xfs On Fri, Dec 20, 2013 at 06:36:33AM -0600, Stan Hoeppner wrote: > On 12/20/2013 4:26 AM, Kevin Richter wrote: > > 'top' while copying with stripe size of 2048 (the source disk is ntfs): > >> top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 > >> Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie > >> Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st > >> Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers > >> Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 > >> 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 > >> 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 > >> 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 > >> 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs > >> 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 > >> 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp > >> 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 > >> 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 > >> 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 > >> 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 > > Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a > SandyBridge core at only ~180 MB/s throughput is it? Kworkers are an anonymous kernel worker threads that do work that has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread on CPU 3 (3:0 is the first). The kworker is a thread pool that grows and shrinks according to demand. As the naming suggests, kworker threads are per-CPU, and there can be hundreds of them per CPU is there is enough workqueue work blocks during execution of the work (e.g. on locks, waiting for IO, etc). If there is little blocking occurring, there might on ly be a couple of kworker threads that do all the work, and hence you see them consuming huge amounts of CPU on behalf of other systems... XFS uses workqueues for lots of things, so it's not unusual to see an IO or metadata heavy workload end up with this huge numbers of kworker threads doing work: <run parallel fsmark test> .... $ ps -ef |grep kworker |wc -l 91 $ Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-21 5:30 ` Dave Chinner @ 2013-12-22 9:18 ` Stan Hoeppner 2013-12-22 20:14 ` Dave Chinner 2013-12-22 21:01 ` Michael L. Semon 0 siblings, 2 replies; 22+ messages in thread From: Stan Hoeppner @ 2013-12-22 9:18 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, xfs, xfs On 12/20/2013 11:30 PM, Dave Chinner wrote: > On Fri, Dec 20, 2013 at 06:36:33AM -0600, Stan Hoeppner wrote: >> On 12/20/2013 4:26 AM, Kevin Richter wrote: >>> 'top' while copying with stripe size of 2048 (the source disk is ntfs): >>>> top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 >>>> Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie >>>> Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st >>>> Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers >>>> Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached >>>> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>> 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 >>>> 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 >>>> 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 >>>> 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 >>>> 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs >>>> 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 >>>> 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp >>>> 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 >>>> 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 >>>> 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 >>>> 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 >> >> Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a >> SandyBridge core at only ~180 MB/s throughput is it? > > Kworkers are an anonymous kernel worker threads that do work that > has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread > on CPU 3 (3:0 is the first). The kworker is a thread pool that grows > and shrinks according to demand. > > As the naming suggests, kworker threads are per-CPU, and there can > be hundreds of them per CPU is there is enough workqueue work blocks > during execution of the work (e.g. on locks, waiting for IO, etc). > If there is little blocking occurring, there might on ly be a couple > of kworker threads that do all the work, and hence you see them > consuming huge amounts of CPU on behalf of other systems... > > XFS uses workqueues for lots of things, so it's not unusual to see > an IO or metadata heavy workload end up with this huge numbers of > kworker threads doing work: > > <run parallel fsmark test> > .... > $ ps -ef |grep kworker |wc -l > 91 For the record, again, I've never used dm-crypt. I'm just trying to work through the layers to identify the source of Kevin's problems. So please don't clobber me too hard for going OT, or trying to speak (somewhat) intelligently about something I'm just learning about... So, how does one identify who is doing work inside a worker thread? In this case we're trying to figure out if dm-crypt might be eating a CPU core thus limiting its throughput. However... I just read a thread on dm-devel about dm-crypt performance dated late March 2013. It wasn't clear to me if dm-crypt yet uses unbound workqueues, if it dispatches work to multiple queues, or if it still uses the hand built dispatcher, as of kernel 3.11, which Kevin just updated to. What was clear is that as of March dm-crypt was not preserving IO ordering. If this is still the case, that would imply RMW with md parity arrays, which would tend to explain Kevin's write throughput being ~1/3rd of his read, and less than a single disk. Mikulas Patocka posted 400-620 MB/s dm-crypt throughput with a RAMdisk. It was not stated which CPU/clock he was using. Even if it were a 4GHz core, Kevin's 2.4GHz SandyBridge should be capable of far more than ~80 MB/s of encryption. So if dm-crypt is contributing to Kevin's low throughput I'd think it would be due to the IO ordering causing RMW, not due to CPU starvation. Is there an easy way for a mere mortal to peek under the hood and see if dm-crypt is causing RMW? Christoph you were involved in that thread WRT IO ordering. Can you shed more light on this, and if it may be a factor here. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-22 9:18 ` Stan Hoeppner @ 2013-12-22 20:14 ` Dave Chinner 2013-12-22 21:01 ` Michael L. Semon 1 sibling, 0 replies; 22+ messages in thread From: Dave Chinner @ 2013-12-22 20:14 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Christoph Hellwig, xfs, xfs On Sun, Dec 22, 2013 at 03:18:05AM -0600, Stan Hoeppner wrote: > On 12/20/2013 11:30 PM, Dave Chinner wrote: > > On Fri, Dec 20, 2013 at 06:36:33AM -0600, Stan Hoeppner wrote: > >> On 12/20/2013 4:26 AM, Kevin Richter wrote: > >>> 'top' while copying with stripe size of 2048 (the source disk is ntfs): > >>>> top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 > >>>> Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie > >>>> Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st > >>>> Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers > >>>> Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached > >>>> > >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>>> 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 > >>>> 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 > >>>> 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 > >>>> 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 > >>>> 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs > >>>> 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 > >>>> 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp > >>>> 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 > >>>> 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 > >>>> 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 > >>>> 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 > >> > >> Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a > >> SandyBridge core at only ~180 MB/s throughput is it? > > > > Kworkers are an anonymous kernel worker threads that do work that > > has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread > > on CPU 3 (3:0 is the first). The kworker is a thread pool that grows > > and shrinks according to demand. > > > > As the naming suggests, kworker threads are per-CPU, and there can > > be hundreds of them per CPU is there is enough workqueue work blocks > > during execution of the work (e.g. on locks, waiting for IO, etc). > > If there is little blocking occurring, there might on ly be a couple > > of kworker threads that do all the work, and hence you see them > > consuming huge amounts of CPU on behalf of other systems... > > > > XFS uses workqueues for lots of things, so it's not unusual to see > > an IO or metadata heavy workload end up with this huge numbers of > > kworker threads doing work: > > > > <run parallel fsmark test> > > .... > > $ ps -ef |grep kworker |wc -l > > 91 > > For the record, again, I've never used dm-crypt. I'm just trying to > work through the layers to identify the source of Kevin's problems. So > please don't clobber me too hard for going OT, or trying to speak > (somewhat) intelligently about something I'm just learning about... > > So, how does one identify who is doing work inside a worker thread? In > this case we're trying to figure out if dm-crypt might be eating a CPU > core thus limiting its throughput. However... perf top is your friend for finding out where the kernel is spending it's CPU time. It will tell you functions (and call chains) that are consuming CPU time rather than what individual threads are doing. If you really want to know what worker threads are doing, /proc/sysrq-trigger dumps the kworker threads with the name of the workqueue they are doing work on behalf of as well as the stack trace. > > I just read a thread on dm-devel about dm-crypt performance dated late > March 2013. It wasn't clear to me if dm-crypt yet uses unbound > workqueues, if it dispatches work to multiple queues, or if it still > uses the hand built dispatcher, as of kernel 3.11, which Kevin just > updated to. dm-crypt uses workqueues with a concurrency depth of 1, and kcryptd is defined as CPU intensive and so is scheduled like a long running kernel thread rather than a workqueue. Reads decrypting is done on IO completion, so if you've only got one CPU processing IO completions, then read decryption will effectively be single threaded due to the nature of per-cpu work queues and a concurrency depth of 1. > What was clear is that as of March dm-crypt was not preserving IO > ordering. If this is still the case, that would imply RMW with md > parity arrays, which would tend to explain Kevin's write throughput > being ~1/3rd of his read, and less than a single disk. encryption adds IO latency, so unless you have lots of concurrency to hide the latency, dm-crypt will almost always be slower than the same IO an equivalent unencrypted disk. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-22 9:18 ` Stan Hoeppner 2013-12-22 20:14 ` Dave Chinner @ 2013-12-22 21:01 ` Michael L. Semon 1 sibling, 0 replies; 22+ messages in thread From: Michael L. Semon @ 2013-12-22 21:01 UTC (permalink / raw) To: xfs On 12/22/2013 04:18 AM, Stan Hoeppner wrote: > For the record, again, I've never used dm-crypt. I'm just trying to > work through the layers to identify the source of Kevin's problems. So > please don't clobber me too hard for going OT, or trying to speak > (somewhat) intelligently about something I'm just learning about... This stuff isn't easy. There are a lot of security-vs-speed tradeoffs, and kernel support varies by processor and architecture. Take this example from x86: # info from fdisk: /dev/sda5 8404992 10502143 1048576 83 Linux # cryptsetup status /dev/mapper/wCrypt /dev/mapper/wCrypt is active. type: LUKS1 cipher: aes-xts-essiv:sha256 keysize: 512 bits device: /dev/sda5 offset: 4096 sectors size: 2093056 sectors mode: read/write What this jibberish means from a speed standpoint is this: aes: had assembly speedups in the kernel; NIST approved; xts: almost as fast as cbc while being more secure; essiv: "plain" is faster, but essiv is more secure; sha256: it seemed like an ok hash; 512 bits: much slower than 256 or 128 bits, but hopefully more secure, fine for a small partition like this. A similar "cryptsetup status <cryptname>" from Kevin might help you benchmark things, and Kevin might explain his rationale without blowing his cover ;-) A `zcat /proc/config.gz` | grep CRYPTO` might help you see what kernel support he has as well. Dave's recent post about I/O latency with dm-crypt reads/writes might be the answer that you're seeking. I just wanted to put in my two cents. Good luck! Michael _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-20 12:36 ` Stan Hoeppner 2013-12-21 5:30 ` Dave Chinner @ 2013-12-22 2:35 ` Kevin Richter 2013-12-22 8:12 ` Stan Hoeppner 1 sibling, 1 reply; 22+ messages in thread From: Kevin Richter @ 2013-12-22 2:35 UTC (permalink / raw) To: stan, xfs Okay, well, I've run your fio config... but with so many results and abbrevations I currently feel a bit overchallenged :) So, please dont take badly, if I just paste the whole results. I have tried with a stripe cache size of 1024, 2048 and 4096. Btw, I also have /dev/md1, this is a 15GB unencrypted array using the same 5 disks like my LUKS-md2 array, so nearly the same just without this LUKS layer. If helpful, I can execute some fio tests on this filesystem, too. So long :) Kevin $ echo 1024 > /sys/block/md2/md/stripe_cache_size > Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s] > read: (groupid=0, jobs=8): err= 0: pid=12987 > read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec > slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66 > clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18 > lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18 > bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29 > cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=163840/0/0, short=0/0/0 > > lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59% > lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25% > write: (groupid=1, jobs=8): err= 0: pid=13202 > write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec > slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68 > clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82 > lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82 > bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35 > cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/163840/0, short=0/0/0 > > lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20% > lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64% > > Run status group 0 (all jobs): > READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec > > Run status group 1 (all jobs): > WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec > > Disk stats (read/write): > dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% > md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78% > sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78% > sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42% > sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90% > sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36% > sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26% $ echo 2048 > /sys/block/md2/md/stripe_cache_size > Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s] > read: (groupid=0, jobs=8): err= 0: pid=6392 > read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec > slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00 > clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76 > lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76 > bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64 > cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=163840/0/0, short=0/0/0 > > lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36% > lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35% > write: (groupid=1, jobs=8): err= 0: pid=6663 > write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec > slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22 > clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37 > lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82 > bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80 > cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/163840/0, short=0/0/0 > > lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64% > lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23% > > Run status group 0 (all jobs): > READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec > > Run status group 1 (all jobs): > WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec > > Disk stats (read/write): > dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% > md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20% > sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77% > sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44% > sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20% > sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89% > sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96% $ echo 4096 > /sys/block/md2/md/stripe_cache_size > Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s] > read: (groupid=0, jobs=8): err= 0: pid=11787 > read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec > slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74 > clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14 > lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14 > bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71 > cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=163840/0/0, short=0/0/0 > > lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56% > lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33% > write: (groupid=1, jobs=8): err= 0: pid=12060 > write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec > slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87 > clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56 > lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56 > bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28 > cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/163840/0, short=0/0/0 > > lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37% > lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16% > > Run status group 0 (all jobs): > READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec > > Run status group 1 (all jobs): > WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec > > Disk stats (read/write): > dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% > md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60% > sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57% > sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08% > sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86% > sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60% > sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84% Am 20.12.2013 13:36, schrieb Stan Hoeppner: > On 12/20/2013 4:26 AM, Kevin Richter wrote: >> Thanks a lot for your huge replies! > > You're welcome. > >>> Oh, that's quite old. I'd suggest upgrading to a much more recent >>> kernel as we've fixed lots of issues in this area since then. >> >> First I have switched to the newer kernel from Ubuntu Saucy: >> $ uname -a >> Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP >> >> Thus, it seems that the default scheduler has been changed to deadline. >> I did not change anything. After a reboot the schedulers of all disks >> are now deadline. > > Good move Ubuntu. > >>> Model # of the CPUs so I can look up the specs? >> Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz > > Strong CPUs. > >> I have prepared a folder with 60GB for the tests. This is nearly twice >> of the available memory, so the process should be forced to actually >> write the stuff to the disk - and not only hold in the memory. >> >>> $ echo 256 > /sys/block/md2/md/stripe_cache_size >>> $ time cp -a /olddisk/testfolder /6tb/foo1/ >>> real 25m38.925s >>> user 0m0.595s >>> sys 1m23.182s >>> >>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size >>> $ time cp -a /olddisk/testfolder /raid/foo2/ >>> real 7m32.824s >>> user 0m0.438s >>> sys 1m6.759s >>> >>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size >>> $ time cp -a /olddisk/testfolder /raid/foo3/ >>> real 5m32.847s >>> user 0m0.418s >>> sys 1m5.671s >>> >>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size >>> $ time cp -a /olddisk/testfolder /raid/foo4/ >>> real 5m54.554s >>> user 0m0.437s >>> sys 1m6.268s >> >> The difference is really amazing! So 2048 seems to be the best choice. >> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it? >> The RAID only consist of 5 SATA disks with 7200rpm. > > A lot of the source data is being cached between runs so these numbers > aren't accurate. The throughput of this copy operation will be limited > by the speed of the single source disk, not the array. To make the > elapsed times of this copy test accurate you need to execute something > like these commands after each run: > > # sync > # echo 3 > /proc/sys/vm/drop_caches > > But this copy test will not inform you about the potential peak > performance of your array. That's why I suggested you test with FIO, > the flexible IO tester. > > # aptitude install fio > # man fio > > Sample job file suitable for your system: > > [global] > directory=/your/XFS/test/directory > zero_buffers > numjobs=8 > group_reporting > blocksize=512k > ioengine=libaio > iodepth=16 > direct=1 > size=10g > > [read] > rw=read > stonewall > > [write] > rw=write > stonewall > > This should give you a relatively accurate picture of the actual > potential throughput of your array and filesystem. > >> 'top' while copying with stripe size of 2048 (the source disk is ntfs): >>> top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 >>> Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st >>> Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers >>> Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 >>> 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 >>> 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 >>> 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 >>> 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs >>> 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 >>> 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp >>> 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 >>> 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 >>> 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 >>> 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 > > Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a > SandyBridge core at only ~180 MB/s throughput is it? > >> And the best thing at all: >> While all of these tests there are no warnings/panics in the syslog. >> >> With best regards, >> Kevin > > Even though XFS wasn't the cause of the problem I'm glad we were able to > help you fix it nonetheless. I'm really curious to see what kind of > throughput you can achieve with FIO, and whether crypto is a bottleneck > at the 250-350 MB/s your array should be capable of. It would be great > if you would play around a bit with FIO and post some numbers. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-22 2:35 ` Kevin Richter @ 2013-12-22 8:12 ` Stan Hoeppner 2013-12-22 14:10 ` Kevin Richter 0 siblings, 1 reply; 22+ messages in thread From: Stan Hoeppner @ 2013-12-22 8:12 UTC (permalink / raw) To: xfs, xfs On 12/21/2013 8:35 PM, Kevin Richter wrote: > Okay, well, I've run your fio config... but with so many results and I'm glad you changed to size=1G instead of size=10G lest the test would have read/written 80GB instead of 8GB, and taken forever. I often forget the size parameter is per job (thread) not global. > abbrevations I currently feel a bit overchallenged :) > So, please dont take badly, if I just paste the whole results. > I have tried with a stripe cache size of 1024, 2048 and 4096. The full output is good. Going from 1024 to 2048 gained you ~17MB/s throughput with this job file. But you didn't run with the default 256 for comparison, which is very useful to know. That's ok, as you can do so with the next set of test runs. Given the huge latencies I'm thinking this config may be seeking the disks too much. Modify your fio job file as follows: numjobs=4 blocksize=1536k iodepth=4 size=1g as this may yield a little better results. ~190MB/s read and 75MB/s write is very low for three effective 7.2K RPM SATA spindles, especially the write throughput. Compare to a similar FIO job with 2 threads, iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC disabled, w/deadline elevator: Run status group 0 (all jobs): READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s Run status group 1 (all jobs): WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s 88MB/s read, 108MB/s write. Write is faster apparently due to the small 16MB drive write cache. With your eight 2.4GHz SandyBridge cores and md-RAID6 on much newer larger 7.2K drives w/larger caches, you should have significantly greater write speed than this old single SATA drive. Currently you're showing 30MB/s slower write throughput, some 40% slower than this single drive. With aligned writes, thus no RMW, you should be getting at least 2 spindles worth of throughput, or ~200MB/s. So, run FIO again with the mods above, once with stripe_cache_size 256 and once with 2048. Something seems to be seriously dragging down your throughput. I'm sure you'd like to know what, and how to fix it. > Btw, I also have /dev/md1, this is a 15GB unencrypted array using the > same 5 disks like my LUKS-md2 array, so nearly the same just without > this LUKS layer. If helpful, I can execute some fio tests on this > filesystem, too. This is also a RAID6 array, without dm-crypt? If so this will afford a great comparison, and may be very informative. Run FIO against this filesystem (I assume it's also XFS) as well with stripe_cache_size 512 and 2048. Four FIO tests total including the two on the big array. It'll be interesting to see where this leads. At this point the candidates I see are a single dm-crypt thread peaking a core, dm-crypt taking aligned writes and misaligning them causing all writes to be RMWs, or AF drives with misaligned partitions causing every write to be an internal RMW. Or maybe something I've not considered. > > So long :) > Kevin > > > $ echo 1024 > /sys/block/md2/md/stripe_cache_size >> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s] >> read: (groupid=0, jobs=8): err= 0: pid=12987 >> read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec >> slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66 >> clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18 >> lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18 >> bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29 >> cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=163840/0/0, short=0/0/0 >> >> lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59% >> lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25% >> write: (groupid=1, jobs=8): err= 0: pid=13202 >> write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec >> slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68 >> clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82 >> lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82 >> bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35 >> cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/163840/0, short=0/0/0 >> >> lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20% >> lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64% >> >> Run status group 0 (all jobs): >> READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec >> >> Run status group 1 (all jobs): >> WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec >> >> Disk stats (read/write): >> dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >> md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78% >> sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78% >> sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42% >> sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90% >> sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36% >> sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26% > > $ echo 2048 > /sys/block/md2/md/stripe_cache_size >> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s] >> read: (groupid=0, jobs=8): err= 0: pid=6392 >> read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec >> slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00 >> clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76 >> lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76 >> bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64 >> cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=163840/0/0, short=0/0/0 >> >> lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36% >> lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35% >> write: (groupid=1, jobs=8): err= 0: pid=6663 >> write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec >> slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22 >> clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37 >> lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82 >> bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80 >> cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/163840/0, short=0/0/0 >> >> lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64% >> lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23% >> >> Run status group 0 (all jobs): >> READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec >> >> Run status group 1 (all jobs): >> WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec >> >> Disk stats (read/write): >> dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >> md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20% >> sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77% >> sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44% >> sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20% >> sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89% >> sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96% > > $ echo 4096 > /sys/block/md2/md/stripe_cache_size >> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s] >> read: (groupid=0, jobs=8): err= 0: pid=11787 >> read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec >> slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74 >> clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14 >> lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14 >> bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71 >> cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=163840/0/0, short=0/0/0 >> >> lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56% >> lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33% >> write: (groupid=1, jobs=8): err= 0: pid=12060 >> write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec >> slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87 >> clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56 >> lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56 >> bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28 >> cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723 >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/163840/0, short=0/0/0 >> >> lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37% >> lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16% >> >> Run status group 0 (all jobs): >> READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec >> >> Run status group 1 (all jobs): >> WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec >> >> Disk stats (read/write): >> dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >> md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60% >> sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57% >> sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08% >> sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86% >> sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60% >> sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84% > > > > > > Am 20.12.2013 13:36, schrieb Stan Hoeppner: >> On 12/20/2013 4:26 AM, Kevin Richter wrote: >>> Thanks a lot for your huge replies! >> >> You're welcome. >> >>>> Oh, that's quite old. I'd suggest upgrading to a much more recent >>>> kernel as we've fixed lots of issues in this area since then. >>> >>> First I have switched to the newer kernel from Ubuntu Saucy: >>> $ uname -a >>> Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP >>> >>> Thus, it seems that the default scheduler has been changed to deadline. >>> I did not change anything. After a reboot the schedulers of all disks >>> are now deadline. >> >> Good move Ubuntu. >> >>>> Model # of the CPUs so I can look up the specs? >>> Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz >> >> Strong CPUs. >> >>> I have prepared a folder with 60GB for the tests. This is nearly twice >>> of the available memory, so the process should be forced to actually >>> write the stuff to the disk - and not only hold in the memory. >>> >>>> $ echo 256 > /sys/block/md2/md/stripe_cache_size >>>> $ time cp -a /olddisk/testfolder /6tb/foo1/ >>>> real 25m38.925s >>>> user 0m0.595s >>>> sys 1m23.182s >>>> >>>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size >>>> $ time cp -a /olddisk/testfolder /raid/foo2/ >>>> real 7m32.824s >>>> user 0m0.438s >>>> sys 1m6.759s >>>> >>>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size >>>> $ time cp -a /olddisk/testfolder /raid/foo3/ >>>> real 5m32.847s >>>> user 0m0.418s >>>> sys 1m5.671s >>>> >>>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size >>>> $ time cp -a /olddisk/testfolder /raid/foo4/ >>>> real 5m54.554s >>>> user 0m0.437s >>>> sys 1m6.268s >>> >>> The difference is really amazing! So 2048 seems to be the best choice. >>> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it? >>> The RAID only consist of 5 SATA disks with 7200rpm. >> >> A lot of the source data is being cached between runs so these numbers >> aren't accurate. The throughput of this copy operation will be limited >> by the speed of the single source disk, not the array. To make the >> elapsed times of this copy test accurate you need to execute something >> like these commands after each run: >> >> # sync >> # echo 3 > /proc/sys/vm/drop_caches >> >> But this copy test will not inform you about the potential peak >> performance of your array. That's why I suggested you test with FIO, >> the flexible IO tester. >> >> # aptitude install fio >> # man fio >> >> Sample job file suitable for your system: >> >> [global] >> directory=/your/XFS/test/directory >> zero_buffers >> numjobs=8 >> group_reporting >> blocksize=512k >> ioengine=libaio >> iodepth=16 >> direct=1 >> size=10g >> >> [read] >> rw=read >> stonewall >> >> [write] >> rw=write >> stonewall >> >> This should give you a relatively accurate picture of the actual >> potential throughput of your array and filesystem. >> >>> 'top' while copying with stripe size of 2048 (the source disk is ntfs): >>>> top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 >>>> Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie >>>> Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st >>>> Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers >>>> Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached >>>> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>> 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 >>>> 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 >>>> 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 >>>> 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 >>>> 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs >>>> 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 >>>> 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp >>>> 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 >>>> 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 >>>> 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 >>>> 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 >> >> Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a >> SandyBridge core at only ~180 MB/s throughput is it? >> >>> And the best thing at all: >>> While all of these tests there are no warnings/panics in the syslog. >>> >>> With best regards, >>> Kevin >> >> Even though XFS wasn't the cause of the problem I'm glad we were able to >> help you fix it nonetheless. I'm really curious to see what kind of >> throughput you can achieve with FIO, and whether crypto is a bottleneck >> at the 250-350 MB/s your array should be capable of. It would be great >> if you would play around a bit with FIO and post some numbers. >> > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-22 8:12 ` Stan Hoeppner @ 2013-12-22 14:10 ` Kevin Richter 2013-12-22 17:29 ` Stan Hoeppner 0 siblings, 1 reply; 22+ messages in thread From: Kevin Richter @ 2013-12-22 14:10 UTC (permalink / raw) To: stan, xfs > I'm glad you changed to size=1G instead of size=10G No no, I didnt change that. It has been written 160GB on these disks. 80 for the read part, 80 for the write part. Each test took round 30 minutes. > This is also a RAID6 array, without dm-crypt? Yeah, right. The same 5 disks in RAID6 mode, just in each case another partition on these disks. And yes, of course, XFS. So, with your new config I have made 7 fio runs: 1) 15GB RAID6 array, stripe 256 2) 15GB RAID6 array, stripe 1024 3) 15GB RAID6 array, stripe 2048 4) 15GB RAID6 array, stripe 4096 5) 3TB LUKS volume (no raid, just one single disk) 6) 6TB RAID6+LUKS array, stripe 256 7) 6TB RAID6+LUKS array, stripe 2048 In the 15GB array a stripe size of 4096 seems to get slightly more throughput than 2048. See below for the results :-) Perhaps, I should give you another information about the server hardware, which might (but hopefully do not) explain the low performance: The server is that one: http://www.supermicro.com/products/system/2u/6027/ssg-6027r-e1r12l.cfm with a 12-port SAS expander, which is that one: http://www.thomas-krenn.com/de/wiki/BPN-SAS-826EL1_SAS_Expander_Backplane But the backplane should have a thoughput of 3Gb/s However... I am very curious about your comments :-) Bye, Kevin ################################# 1) 15GB RAID6 volume, stripe 256: > read: (groupid=0, jobs=4): err= 0: pid=3959 > read : io=4092.0MB, bw=217900KB/s, iops=141 , runt= 19230msec > slat (usec): min=190 , max=1325 , avg=231.32, stdev=77.13 > clat (msec): min=7 , max=326 , avg=111.45, stdev=67.48 > lat (msec): min=7 , max=326 , avg=111.69, stdev=67.48 > bw (KB/s) : min=29425, max=109280, per=24.98%, avg=54423.27, stdev=15887.11 > cpu : usr=0.03%, sys=0.90%, ctx=2614, majf=0, minf=6233 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 10=0.66%, 20=4.84%, 50=16.35%, 100=25.70%, 250=47.91% > lat (msec): 500=4.55% > write: (groupid=1, jobs=4): err= 0: pid=3964 > write: io=4092.0MB, bw=54044KB/s, iops=35 , runt= 77533msec > slat (msec): min=9 , max=339 , avg=113.48, stdev=36.86 > clat (msec): min=2 , max=744 , avg=339.51, stdev=70.02 > lat (msec): min=23 , max=927 , avg=453.00, stdev=82.81 > bw (KB/s) : min= 5475, max=20756, per=24.92%, avg=13467.27, stdev=2107.66 > cpu : usr=0.58%, sys=6.26%, ctx=343060, majf=0, minf=550 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 4=0.04%, 10=0.04%, 50=0.11%, 100=0.26%, 250=5.94% > lat (msec): 500=91.50%, 750=2.13% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=217899KB/s, minb=223129KB/s, maxb=223129KB/s, mint=19230msec, maxt=19230msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=54044KB/s, minb=55341KB/s, maxb=55341KB/s, mint=77533msec, maxt=77533msec > > Disk stats (read/write): > md1: ios=10912/10904, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=22836/29872, aggrmerge=364645/574313, aggrticks=216408/101340, aggrin_queue=317789, aggrutil=69.36% > sdc: ios=23542/29233, merge=377270/563331, ticks=217520/152300, in_queue=369860, util=67.99% > sdd: ios=23462/28930, merge=378915/557686, ticks=214104/142224, in_queue=356428, util=67.81% > sde: ios=22264/29920, merge=355831/576114, ticks=149184/99724, in_queue=248924, util=58.22% > sdf: ios=22240/30724, merge=352471/588412, ticks=126740/52916, in_queue=179696, util=55.44% > sdg: ios=22676/30557, merge=358741/586025, ticks=374496/59540, in_queue=434040, util=69.36% ################################# 2) 15GB RAID6 array, stripe 1024: > read: (groupid=0, jobs=4): err= 0: pid=4334 > read : io=4092.0MB, bw=237702KB/s, iops=154 , runt= 17628msec > slat (usec): min=196 , max=1469 , avg=232.84, stdev=88.56 > clat (msec): min=6 , max=391 , avg=101.62, stdev=56.61 > lat (msec): min=7 , max=391 , avg=101.85, stdev=56.61 > bw (KB/s) : min=26482, max=107305, per=24.94%, avg=59289.94, stdev=14153.87 > cpu : usr=0.04%, sys=0.98%, ctx=2581, majf=0, minf=6241 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 10=0.22%, 20=3.08%, 50=16.83%, 100=31.01%, 250=47.10% > lat (msec): 500=1.76% > write: (groupid=1, jobs=4): err= 0: pid=4340 > write: io=4092.0MB, bw=120759KB/s, iops=78 , runt= 34699msec > slat (msec): min=1 , max=156 , avg=47.70, stdev=27.94 > clat (msec): min=15 , max=520 , avg=152.37, stdev=53.34 > lat (msec): min=28 , max=564 , avg=200.07, stdev=57.63 > bw (KB/s) : min= 0, max=62270, per=25.04%, avg=30239.54, stdev=5435.51 > cpu : usr=1.34%, sys=5.68%, ctx=134184, majf=0, minf=94 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 20=0.04%, 50=1.10%, 100=12.02%, 250=82.18%, 500=4.62% > lat (msec): 750=0.04% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=237701KB/s, minb=243406KB/s, maxb=243406KB/s, mint=17628msec, maxt=17628msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=120758KB/s, minb=123656KB/s, maxb=123656KB/s, mint=34699msec, maxt=34699msec > > Disk stats (read/write): > md1: ios=10912/10901, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4062/6768, aggrmerge=109645/369712, aggrticks=157952/111924, aggrin_queue=270487, aggrutil=89.91% > sdc: ios=4070/6893, merge=109432/378682, ticks=173784/179392, in_queue=353780, util=89.88% > sdd: ios=4041/6961, merge=110017/380798, ticks=164960/160236, in_queue=325820, util=89.91% > sde: ios=4039/6727, merge=110658/366163, ticks=169804/166052, in_queue=336484, util=88.98% > sdf: ios=4080/6591, merge=108922/359343, ticks=39460/24636, in_queue=64820, util=52.33% > sdg: ios=4081/6668, merge=109197/363574, ticks=241752/29308, in_queue=271532, util=73.63% ################################# 3) 15GB RAID6 array, stripe 2048: > read: (groupid=0, jobs=4): err= 0: pid=4102 > read : io=4092.0MB, bw=207827KB/s, iops=135 , runt= 20162msec > slat (usec): min=122 , max=3904 , avg=230.75, stdev=99.04 > clat (msec): min=5 , max=363 , avg=117.03, stdev=70.26 > lat (msec): min=5 , max=363 , avg=117.27, stdev=70.25 > bw (KB/s) : min=26533, max=104239, per=25.00%, avg=51947.24, stdev=13121.27 > cpu : usr=0.05%, sys=0.83%, ctx=2640, majf=0, minf=6241 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 10=0.55%, 20=4.99%, 50=14.96%, 100=21.33%, 250=53.01% > lat (msec): 500=5.17% > write: (groupid=1, jobs=4): err= 0: pid=4108 > write: io=4092.0MB, bw=145833KB/s, iops=94 , runt= 28733msec > slat (usec): min=768 , max=94184 , avg=1838.20, stdev=4140.24 > clat (msec): min=15 , max=1155 , avg=145.02, stdev=128.70 > lat (msec): min=16 , max=1156 , avg=146.86, stdev=128.77 > bw (KB/s) : min= 6144, max=104448, per=29.22%, avg=42617.70, stdev=17118.84 > cpu : usr=1.63%, sys=2.52%, ctx=4086, majf=0, minf=94 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 20=0.37%, 50=9.02%, 100=38.42%, 250=39.22%, 500=10.12% > lat (msec): 750=2.09%, 1000=0.59%, 2000=0.18% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s, mint=20162msec, maxt=20162msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s, mint=28733msec, maxt=28733msec > > Disk stats (read/write): > md1: ios=10912/10873, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3018/4990, aggrmerge=79606/371231, aggrticks=152741/83319, aggrin_queue=236284, aggrutil=92.45% > sdc: ios=2988/5046, merge=79033/376737, ticks=129600/103552, in_queue=233412, util=84.14% > sdd: ios=3010/5140, merge=77101/390954, ticks=119032/103236, in_queue=222536, util=85.87% > sde: ios=3039/5009, merge=79935/374563, ticks=196200/159792, in_queue=356236, util=92.45% > sdf: ios=3021/4883, merge=80483/356643, ticks=29976/24956, in_queue=55100, util=49.53% > sdg: ios=3033/4872, merge=81481/357261, ticks=288900/25060, in_queue=314136, util=76.17% ################################# 4) 15GB RAID6 array, stripe 4096: > read: (groupid=0, jobs=4): err= 0: pid=4411 > read : io=4092.0MB, bw=214938KB/s, iops=139 , runt= 19495msec > slat (usec): min=192 , max=1468 , avg=226.97, stdev=87.18 > clat (msec): min=4 , max=339 , avg=112.51, stdev=63.05 > lat (msec): min=4 , max=339 , avg=112.73, stdev=63.05 > bw (KB/s) : min= 0, max=96000, per=24.87%, avg=53457.49, stdev=13728.88 > cpu : usr=0.02%, sys=0.89%, ctx=2591, majf=0, minf=6241 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 10=0.15%, 20=4.62%, 50=13.56%, 100=26.43%, 250=52.20% > lat (msec): 500=3.04% > write: (groupid=1, jobs=4): err= 0: pid=4419 > write: io=4092.0MB, bw=146526KB/s, iops=95 , runt= 28597msec > slat (usec): min=822 , max=36057 , avg=1585.60, stdev=1174.08 > clat (msec): min=14 , max=1132 , avg=142.49, stdev=126.05 > lat (msec): min=15 , max=1134 , avg=144.08, stdev=126.08 > bw (KB/s) : min= 0, max=83836, per=28.88%, avg=42315.69, stdev=18036.84 > cpu : usr=1.42%, sys=2.85%, ctx=2937, majf=0, minf=94 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 20=0.18%, 50=10.08%, 100=37.61%, 250=40.62%, 500=8.50% > lat (msec): 750=2.46%, 1000=0.44%, 2000=0.11% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=214937KB/s, minb=220096KB/s, maxb=220096KB/s, mint=19495msec, maxt=19495msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=146526KB/s, minb=150042KB/s, maxb=150042KB/s, mint=28597msec, maxt=28597msec > > Disk stats (read/write): > md1: ios=10912/10895, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3013/5053, aggrmerge=78083/371841, aggrticks=151707/88554, aggrin_queue=240554, aggrutil=92.72% > sdc: ios=2981/5124, merge=77153/379556, ticks=142800/116248, in_queue=259356, util=87.13% > sdd: ios=2990/5182, merge=75443/390196, ticks=128184/107752, in_queue=236264, util=86.87% > sde: ios=3034/5057, merge=77882/372755, ticks=185048/159352, in_queue=344636, util=92.72% > sdf: ios=3035/4941, merge=79423/356692, ticks=29196/28100, in_queue=57520, util=50.56% > sdg: ios=3026/4963, merge=80518/360009, ticks=273308/31320, in_queue=304996, util=76.24% ################################################### 5) 3TB LUKS volume (no raid, just one single disk): > read: (groupid=0, jobs=4): err= 0: pid=31359 > read : io=4092.0MB, bw=125032KB/s, iops=81 , runt= 33513msec > slat (usec): min=86 , max=1404 , avg=125.10, stdev=97.21 > clat (msec): min=26 , max=323 , avg=195.52, stdev=39.04 > lat (msec): min=26 , max=323 , avg=195.65, stdev=39.03 > bw (KB/s) : min=21982, max=41890, per=25.12%, avg=31405.62, stdev=3662.77 > cpu : usr=0.02%, sys=0.31%, ctx=2738, majf=0, minf=6233 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 50=0.51%, 100=2.38%, 250=93.07%, 500=4.03% > write: (groupid=1, jobs=4): err= 0: pid=31364 > write: io=4092.0MB, bw=158193KB/s, iops=102 , runt= 26488msec > slat (usec): min=181 , max=673 , avg=442.78, stdev=67.26 > clat (msec): min=21 , max=555 , avg=154.33, stdev=45.35 > lat (msec): min=21 , max=555 , avg=154.77, stdev=45.35 > bw (KB/s) : min=19097, max=45988, per=25.04%, avg=39617.62, stdev=5714.79 > cpu : usr=1.00%, sys=0.25%, ctx=2737, majf=0, minf=86 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 50=0.44%, 100=1.54%, 250=94.61%, 500=3.15%, 750=0.26% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s, mint=33513msec, maxt=33513msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s, mint=26488msec, maxt=26488msec > > Disk stats (read/write): > dm-1: ios=8184/8196, merge=0/0, ticks=1521200/1235264, in_queue=2759104, util=99.58%, aggrios=8184/8225, aggrmerge=0/31, aggrticks=1421660/1137984, aggrin_queue=2559612, aggrutil=99.49% > sda: ios=8184/8225, merge=0/31, ticks=1421660/1137984, in_queue=2559612, util=99.49% ############################## 6) 6TB LUKS array, stripe 256: > read: (groupid=0, jobs=4): err= 0: pid=31121 > read : io=4092.0MB, bw=137700KB/s, iops=89 , runt= 30430msec > slat (usec): min=123 , max=1353 , avg=157.35, stdev=91.53 > clat (msec): min=15 , max=545 , avg=148.80, stdev=61.30 > lat (msec): min=15 , max=545 , avg=148.96, stdev=61.29 > bw (KB/s) : min= 0, max=129024, per=29.75%, avg=40959.72, stdev=21470.47 > cpu : usr=0.01%, sys=0.51%, ctx=2797, majf=0, minf=6241 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 20=0.29%, 50=9.90%, 100=12.76%, 250=75.07%, 500=1.58% > lat (msec): 750=0.40% > write: (groupid=1, jobs=4): err= 0: pid=31131 > write: io=4092.0MB, bw=41225KB/s, iops=26 , runt=101643msec > slat (usec): min=197 , max=1309 , avg=557.69, stdev=124.14 > clat (msec): min=42 , max=2648 , avg=586.07, stdev=393.18 > lat (msec): min=43 , max=2649 , avg=586.63, stdev=393.19 > bw (KB/s) : min= 0, max=35583, per=26.36%, avg=10867.00, stdev=5232.75 > cpu : usr=0.33%, sys=0.09%, ctx=2699, majf=0, minf=94 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 50=0.07%, 100=1.39%, 250=18.55%, 500=31.82%, 750=20.49% > lat (msec): 1000=12.57%, 2000=14.66%, >=2000=0.44% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=137699KB/s, minb=141004KB/s, maxb=141004KB/s, mint=30430msec, maxt=30430msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=41224KB/s, minb=42214KB/s, maxb=42214KB/s, mint=101643msec, maxt=101643msec > > Disk stats (read/write): > dm-0: ios=10912/10906, merge=0/0, ticks=1385524/5596828, in_queue=6983364, util=99.89%, aggrios=10912/10940, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% > md2: ios=10912/10940, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20710/27494, aggrmerge=358521/573067, aggrticks=208334/119186, aggrin_queue=327551, aggrutil=61.95% > sdc: ios=21005/27276, merge=366645/567533, ticks=224772/179916, in_queue=404728, util=61.95% > sdd: ios=21154/26926, merge=369215/564213, ticks=212876/160624, in_queue=373552, util=60.42% > sde: ios=20460/27449, merge=354440/571659, ticks=127008/102188, in_queue=229220, util=48.28% > sdf: ios=20464/27908, merge=350591/581422, ticks=241608/78180, in_queue=319784, util=55.24% > sdg: ios=20470/27915, merge=351718/580509, ticks=235408/75024, in_queue=310472, util=55.91% ############################### 7) 6TB LUKS array, stripe 2048: > read: (groupid=0, jobs=4): err= 0: pid=30910 > read : io=4092.0MB, bw=237500KB/s, iops=154 , runt= 17643msec > slat (usec): min=134 , max=1511 , avg=165.25, stdev=96.45 > clat (msec): min=16 , max=859 , avg=101.07, stdev=61.40 > lat (msec): min=17 , max=859 , avg=101.23, stdev=61.40 > bw (KB/s) : min=32807, max=97523, per=25.40%, avg=60325.02, stdev=15396.63 > cpu : usr=0.02%, sys=0.76%, ctx=2709, majf=0, minf=6233 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=2728/0/0, short=0/0/0 > > lat (msec): 20=0.07%, 50=13.38%, 100=48.90%, 250=34.38%, 500=3.08% > lat (msec): 750=0.15%, 1000=0.04% > write: (groupid=1, jobs=4): err= 0: pid=30915 > write: io=4092.0MB, bw=90652KB/s, iops=59 , runt= 46223msec > slat (usec): min=173 , max=983 , avg=599.60, stdev=141.06 > clat (msec): min=34 , max=1185 , avg=258.48, stdev=135.97 > lat (msec): min=34 , max=1185 , avg=259.08, stdev=135.98 > bw (KB/s) : min= 0, max=88722, per=25.95%, avg=23524.06, stdev=9036.31 > cpu : usr=0.77%, sys=0.22%, ctx=2576, majf=0, minf=86 > IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w/d: total=0/2728/0, short=0/0/0 > > lat (msec): 50=0.48%, 100=7.00%, 250=49.01%, 500=37.72%, 750=5.10% > lat (msec): 1000=0.59%, 2000=0.11% > > Run status group 0 (all jobs): > READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s, mint=17643msec, maxt=17643msec > > Run status group 1 (all jobs): > WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s, mint=46223msec, maxt=46223msec > > Disk stats (read/write): > dm-0: ios=10912/10875, merge=0/0, ticks=544880/2481824, in_queue=3026996, util=99.57%, aggrios=10912/10930, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% > md2: ios=10912/10930, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5058/6798, aggrmerge=153871/404343, aggrticks=195870/113487, aggrin_queue=309447, aggrutil=92.15% > sdc: ios=5024/7100, merge=152571/421782, ticks=332612/204936, in_queue=537660, util=92.15% > sdd: ios=4961/7109, merge=150591/422166, ticks=257308/161200, in_queue=418608, util=88.11% > sde: ios=5019/6697, merge=154781/395583, ticks=226120/152068, in_queue=378296, util=85.49% > sdf: ios=5118/6463, merge=156068/385905, ticks=65716/23932, in_queue=89724, util=55.85% > sdg: ios=5172/6624, merge=155347/396280, ticks=97596/25300, in_queue=122948, util=60.00% Am 22.12.2013 09:12, schrieb Stan Hoeppner: > On 12/21/2013 8:35 PM, Kevin Richter wrote: >> Okay, well, I've run your fio config... but with so many results and > > I'm glad you changed to size=1G instead of size=10G lest the test would > have read/written 80GB instead of 8GB, and taken forever. I often > forget the size parameter is per job (thread) not global. > >> abbrevations I currently feel a bit overchallenged :) >> So, please dont take badly, if I just paste the whole results. >> I have tried with a stripe cache size of 1024, 2048 and 4096. > > The full output is good. Going from 1024 to 2048 gained you ~17MB/s > throughput with this job file. But you didn't run with the default 256 > for comparison, which is very useful to know. That's ok, as you can do > so with the next set of test runs. Given the huge latencies I'm > thinking this config may be seeking the disks too much. Modify your fio > job file as follows: > > numjobs=4 > blocksize=1536k > iodepth=4 > size=1g > > as this may yield a little better results. ~190MB/s read and 75MB/s > write is very low for three effective 7.2K RPM SATA spindles, especially > the write throughput. Compare to a similar FIO job with 2 threads, > iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC > disabled, w/deadline elevator: > > Run status group 0 (all jobs): > READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s > Run status group 1 (all jobs): > WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s > > 88MB/s read, 108MB/s write. Write is faster apparently due to the small > 16MB drive write cache. With your eight 2.4GHz SandyBridge cores and > md-RAID6 on much newer larger 7.2K drives w/larger caches, you should > have significantly greater write speed than this old single SATA drive. > Currently you're showing 30MB/s slower write throughput, some 40% > slower than this single drive. With aligned writes, thus no RMW, you > should be getting at least 2 spindles worth of throughput, or ~200MB/s. > > So, run FIO again with the mods above, once with stripe_cache_size 256 > and once with 2048. Something seems to be seriously dragging down your > throughput. I'm sure you'd like to know what, and how to fix it. > >> Btw, I also have /dev/md1, this is a 15GB unencrypted array using the >> same 5 disks like my LUKS-md2 array, so nearly the same just without >> this LUKS layer. If helpful, I can execute some fio tests on this >> filesystem, too. > > This is also a RAID6 array, without dm-crypt? If so this will afford a > great comparison, and may be very informative. Run FIO against this > filesystem (I assume it's also XFS) as well with stripe_cache_size 512 > and 2048. Four FIO tests total including the two on the big array. > > It'll be interesting to see where this leads. At this point the > candidates I see are a single dm-crypt thread peaking a core, dm-crypt > taking aligned writes and misaligning them causing all writes to be > RMWs, or AF drives with misaligned partitions causing every write to be > an internal RMW. Or maybe something I've not considered. > > >> >> So long :) >> Kevin >> >> >> $ echo 1024 > /sys/block/md2/md/stripe_cache_size >>> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s] >>> read: (groupid=0, jobs=8): err= 0: pid=12987 >>> read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec >>> slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66 >>> clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18 >>> lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18 >>> bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29 >>> cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued r/w/d: total=163840/0/0, short=0/0/0 >>> >>> lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59% >>> lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25% >>> write: (groupid=1, jobs=8): err= 0: pid=13202 >>> write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec >>> slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68 >>> clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82 >>> lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82 >>> bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35 >>> cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued r/w/d: total=0/163840/0, short=0/0/0 >>> >>> lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20% >>> lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64% >>> >>> Run status group 0 (all jobs): >>> READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec >>> >>> Run status group 1 (all jobs): >>> WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec >>> >>> Disk stats (read/write): >>> dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >>> md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78% >>> sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78% >>> sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42% >>> sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90% >>> sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36% >>> sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26% >> >> $ echo 2048 > /sys/block/md2/md/stripe_cache_size >>> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s] >>> read: (groupid=0, jobs=8): err= 0: pid=6392 >>> read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec >>> slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00 >>> clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76 >>> lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76 >>> bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64 >>> cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued r/w/d: total=163840/0/0, short=0/0/0 >>> >>> lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36% >>> lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35% >>> write: (groupid=1, jobs=8): err= 0: pid=6663 >>> write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec >>> slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22 >>> clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37 >>> lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82 >>> bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80 >>> cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued r/w/d: total=0/163840/0, short=0/0/0 >>> >>> lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64% >>> lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23% >>> >>> Run status group 0 (all jobs): >>> READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec >>> >>> Run status group 1 (all jobs): >>> WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec >>> >>> Disk stats (read/write): >>> dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >>> md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20% >>> sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77% >>> sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44% >>> sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20% >>> sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89% >>> sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96% >> >> $ echo 4096 > /sys/block/md2/md/stripe_cache_size >>> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s] >>> read: (groupid=0, jobs=8): err= 0: pid=11787 >>> read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec >>> slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74 >>> clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14 >>> lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14 >>> bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71 >>> cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued r/w/d: total=163840/0/0, short=0/0/0 >>> >>> lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56% >>> lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33% >>> write: (groupid=1, jobs=8): err= 0: pid=12060 >>> write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec >>> slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87 >>> clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56 >>> lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56 >>> bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28 >>> cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>> issued r/w/d: total=0/163840/0, short=0/0/0 >>> >>> lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37% >>> lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16% >>> >>> Run status group 0 (all jobs): >>> READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec >>> >>> Run status group 1 (all jobs): >>> WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec >>> >>> Disk stats (read/write): >>> dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >>> md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60% >>> sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57% >>> sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08% >>> sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86% >>> sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60% >>> sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84% _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-22 14:10 ` Kevin Richter @ 2013-12-22 17:29 ` Stan Hoeppner 0 siblings, 0 replies; 22+ messages in thread From: Stan Hoeppner @ 2013-12-22 17:29 UTC (permalink / raw) To: xfs, xfs On 12/22/2013 8:10 AM, Kevin Richter wrote: >> I'm glad you changed to size=1G instead of size=10G > No no, I didnt change that. It has been written 160GB on these disks. 80 > for the read part, 80 for the write part. Each test took round 30 minutes. I read this "read : io=81920MB" without the zero hence my incorrect assumption. Again, my apologies for the length test runs. Use "1g" from now on as that should be plenty to get good data, and run quickly. >> This is also a RAID6 array, without dm-crypt? > Yeah, right. The same 5 disks in RAID6 mode, just in each case another > partition on these disks. And yes, of course, XFS. > > So, with your new config I have made 7 fio runs: > 1) 15GB RAID6 array, stripe 256 > 2) 15GB RAID6 array, stripe 1024 > 3) 15GB RAID6 array, stripe 2048 > 4) 15GB RAID6 array, stripe 4096 > 5) 3TB LUKS volume (no raid, just one single disk) > 6) 6TB RAID6+LUKS array, stripe 256 > 7) 6TB RAID6+LUKS array, stripe 2048 > > In the 15GB array a stripe size of 4096 seems to get slightly more > throughput than 2048. > > See below for the results :-) > > Perhaps, I should give you another information about the server > hardware, which might (but hopefully do not) explain the low performance: > The server is that one: > http://www.supermicro.com/products/system/2u/6027/ssg-6027r-e1r12l.cfm > with a 12-port SAS expander, which is that one: > http://www.thomas-krenn.com/de/wiki/BPN-SAS-826EL1_SAS_Expander_Backplane > But the backplane should have a thoughput of 3Gb/s The hardware is fine. You have 1.2GB/s <-> the backplane. > However... I am very curious about your comments :-) 3) 15GB RAID6 array, stripe 2048: READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s ~50MB/s per drive * 5 drives = ~250 MB/s This may simply be the md-RAID6 max write you'll get from these drives. But 145MB/s aggregate write throughput is still decidedly better than 90MB/s, some 62% faster than the encrypted array. 5) 3TB LUKS volume (no raid, just one single disk): READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s ~160MB/s per drive, ~160MB/s dm-crypt throughput 7) 6TB LUKS array, stripe 2048: READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s ~30MB/s per drive * 5 drives = ~150MB/s dm-crypt throughput. Write throughput is ~100MB/s lower on your RAID6 w/dm-crypt than without. Read throughput however is ~30MB/s higher, though these differences may not be as great with multiple runs and normalized data. Even so, since the read throughput is roughly the same at 71-81MB/s per drive, both with/out encryption, I don't believe the difference in write speed is due to platter placement of the filesystems and test files. So at this point I'd guess we need to look at dm-crypt, which may be peaking a core or unaligning your IOs causing RMW cycles. And look at your partitions under the encrypted RAID6 to see if they're aligned. We can identify the latter easily if you post 'fdisk -l' output for each of the 5 drives. Whether partitions are misaligned or dm-crypt is the problem, the fix in either case is very destructive and time consuming, requiring a full backup/restore after redoing the stack. -- Stan > Bye, > Kevin > > > ################################# > 1) 15GB RAID6 volume, stripe 256: >> read: (groupid=0, jobs=4): err= 0: pid=3959 >> read : io=4092.0MB, bw=217900KB/s, iops=141 , runt= 19230msec >> slat (usec): min=190 , max=1325 , avg=231.32, stdev=77.13 >> clat (msec): min=7 , max=326 , avg=111.45, stdev=67.48 >> lat (msec): min=7 , max=326 , avg=111.69, stdev=67.48 >> bw (KB/s) : min=29425, max=109280, per=24.98%, avg=54423.27, stdev=15887.11 >> cpu : usr=0.03%, sys=0.90%, ctx=2614, majf=0, minf=6233 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 10=0.66%, 20=4.84%, 50=16.35%, 100=25.70%, 250=47.91% >> lat (msec): 500=4.55% >> write: (groupid=1, jobs=4): err= 0: pid=3964 >> write: io=4092.0MB, bw=54044KB/s, iops=35 , runt= 77533msec >> slat (msec): min=9 , max=339 , avg=113.48, stdev=36.86 >> clat (msec): min=2 , max=744 , avg=339.51, stdev=70.02 >> lat (msec): min=23 , max=927 , avg=453.00, stdev=82.81 >> bw (KB/s) : min= 5475, max=20756, per=24.92%, avg=13467.27, stdev=2107.66 >> cpu : usr=0.58%, sys=6.26%, ctx=343060, majf=0, minf=550 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 4=0.04%, 10=0.04%, 50=0.11%, 100=0.26%, 250=5.94% >> lat (msec): 500=91.50%, 750=2.13% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=217899KB/s, minb=223129KB/s, maxb=223129KB/s, mint=19230msec, maxt=19230msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=54044KB/s, minb=55341KB/s, maxb=55341KB/s, mint=77533msec, maxt=77533msec >> >> Disk stats (read/write): >> md1: ios=10912/10904, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=22836/29872, aggrmerge=364645/574313, aggrticks=216408/101340, aggrin_queue=317789, aggrutil=69.36% >> sdc: ios=23542/29233, merge=377270/563331, ticks=217520/152300, in_queue=369860, util=67.99% >> sdd: ios=23462/28930, merge=378915/557686, ticks=214104/142224, in_queue=356428, util=67.81% >> sde: ios=22264/29920, merge=355831/576114, ticks=149184/99724, in_queue=248924, util=58.22% >> sdf: ios=22240/30724, merge=352471/588412, ticks=126740/52916, in_queue=179696, util=55.44% >> sdg: ios=22676/30557, merge=358741/586025, ticks=374496/59540, in_queue=434040, util=69.36% > > ################################# > 2) 15GB RAID6 array, stripe 1024: >> read: (groupid=0, jobs=4): err= 0: pid=4334 >> read : io=4092.0MB, bw=237702KB/s, iops=154 , runt= 17628msec >> slat (usec): min=196 , max=1469 , avg=232.84, stdev=88.56 >> clat (msec): min=6 , max=391 , avg=101.62, stdev=56.61 >> lat (msec): min=7 , max=391 , avg=101.85, stdev=56.61 >> bw (KB/s) : min=26482, max=107305, per=24.94%, avg=59289.94, stdev=14153.87 >> cpu : usr=0.04%, sys=0.98%, ctx=2581, majf=0, minf=6241 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 10=0.22%, 20=3.08%, 50=16.83%, 100=31.01%, 250=47.10% >> lat (msec): 500=1.76% >> write: (groupid=1, jobs=4): err= 0: pid=4340 >> write: io=4092.0MB, bw=120759KB/s, iops=78 , runt= 34699msec >> slat (msec): min=1 , max=156 , avg=47.70, stdev=27.94 >> clat (msec): min=15 , max=520 , avg=152.37, stdev=53.34 >> lat (msec): min=28 , max=564 , avg=200.07, stdev=57.63 >> bw (KB/s) : min= 0, max=62270, per=25.04%, avg=30239.54, stdev=5435.51 >> cpu : usr=1.34%, sys=5.68%, ctx=134184, majf=0, minf=94 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 20=0.04%, 50=1.10%, 100=12.02%, 250=82.18%, 500=4.62% >> lat (msec): 750=0.04% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=237701KB/s, minb=243406KB/s, maxb=243406KB/s, mint=17628msec, maxt=17628msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=120758KB/s, minb=123656KB/s, maxb=123656KB/s, mint=34699msec, maxt=34699msec >> >> Disk stats (read/write): >> md1: ios=10912/10901, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4062/6768, aggrmerge=109645/369712, aggrticks=157952/111924, aggrin_queue=270487, aggrutil=89.91% >> sdc: ios=4070/6893, merge=109432/378682, ticks=173784/179392, in_queue=353780, util=89.88% >> sdd: ios=4041/6961, merge=110017/380798, ticks=164960/160236, in_queue=325820, util=89.91% >> sde: ios=4039/6727, merge=110658/366163, ticks=169804/166052, in_queue=336484, util=88.98% >> sdf: ios=4080/6591, merge=108922/359343, ticks=39460/24636, in_queue=64820, util=52.33% >> sdg: ios=4081/6668, merge=109197/363574, ticks=241752/29308, in_queue=271532, util=73.63% > > ################################# > 3) 15GB RAID6 array, stripe 2048: >> read: (groupid=0, jobs=4): err= 0: pid=4102 >> read : io=4092.0MB, bw=207827KB/s, iops=135 , runt= 20162msec >> slat (usec): min=122 , max=3904 , avg=230.75, stdev=99.04 >> clat (msec): min=5 , max=363 , avg=117.03, stdev=70.26 >> lat (msec): min=5 , max=363 , avg=117.27, stdev=70.25 >> bw (KB/s) : min=26533, max=104239, per=25.00%, avg=51947.24, stdev=13121.27 >> cpu : usr=0.05%, sys=0.83%, ctx=2640, majf=0, minf=6241 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 10=0.55%, 20=4.99%, 50=14.96%, 100=21.33%, 250=53.01% >> lat (msec): 500=5.17% >> write: (groupid=1, jobs=4): err= 0: pid=4108 >> write: io=4092.0MB, bw=145833KB/s, iops=94 , runt= 28733msec >> slat (usec): min=768 , max=94184 , avg=1838.20, stdev=4140.24 >> clat (msec): min=15 , max=1155 , avg=145.02, stdev=128.70 >> lat (msec): min=16 , max=1156 , avg=146.86, stdev=128.77 >> bw (KB/s) : min= 6144, max=104448, per=29.22%, avg=42617.70, stdev=17118.84 >> cpu : usr=1.63%, sys=2.52%, ctx=4086, majf=0, minf=94 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 20=0.37%, 50=9.02%, 100=38.42%, 250=39.22%, 500=10.12% >> lat (msec): 750=2.09%, 1000=0.59%, 2000=0.18% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s, mint=20162msec, maxt=20162msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s, mint=28733msec, maxt=28733msec >> >> Disk stats (read/write): >> md1: ios=10912/10873, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3018/4990, aggrmerge=79606/371231, aggrticks=152741/83319, aggrin_queue=236284, aggrutil=92.45% >> sdc: ios=2988/5046, merge=79033/376737, ticks=129600/103552, in_queue=233412, util=84.14% >> sdd: ios=3010/5140, merge=77101/390954, ticks=119032/103236, in_queue=222536, util=85.87% >> sde: ios=3039/5009, merge=79935/374563, ticks=196200/159792, in_queue=356236, util=92.45% >> sdf: ios=3021/4883, merge=80483/356643, ticks=29976/24956, in_queue=55100, util=49.53% >> sdg: ios=3033/4872, merge=81481/357261, ticks=288900/25060, in_queue=314136, util=76.17% > > ################################# > 4) 15GB RAID6 array, stripe 4096: >> read: (groupid=0, jobs=4): err= 0: pid=4411 >> read : io=4092.0MB, bw=214938KB/s, iops=139 , runt= 19495msec >> slat (usec): min=192 , max=1468 , avg=226.97, stdev=87.18 >> clat (msec): min=4 , max=339 , avg=112.51, stdev=63.05 >> lat (msec): min=4 , max=339 , avg=112.73, stdev=63.05 >> bw (KB/s) : min= 0, max=96000, per=24.87%, avg=53457.49, stdev=13728.88 >> cpu : usr=0.02%, sys=0.89%, ctx=2591, majf=0, minf=6241 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 10=0.15%, 20=4.62%, 50=13.56%, 100=26.43%, 250=52.20% >> lat (msec): 500=3.04% >> write: (groupid=1, jobs=4): err= 0: pid=4419 >> write: io=4092.0MB, bw=146526KB/s, iops=95 , runt= 28597msec >> slat (usec): min=822 , max=36057 , avg=1585.60, stdev=1174.08 >> clat (msec): min=14 , max=1132 , avg=142.49, stdev=126.05 >> lat (msec): min=15 , max=1134 , avg=144.08, stdev=126.08 >> bw (KB/s) : min= 0, max=83836, per=28.88%, avg=42315.69, stdev=18036.84 >> cpu : usr=1.42%, sys=2.85%, ctx=2937, majf=0, minf=94 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 20=0.18%, 50=10.08%, 100=37.61%, 250=40.62%, 500=8.50% >> lat (msec): 750=2.46%, 1000=0.44%, 2000=0.11% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=214937KB/s, minb=220096KB/s, maxb=220096KB/s, mint=19495msec, maxt=19495msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=146526KB/s, minb=150042KB/s, maxb=150042KB/s, mint=28597msec, maxt=28597msec >> >> Disk stats (read/write): >> md1: ios=10912/10895, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3013/5053, aggrmerge=78083/371841, aggrticks=151707/88554, aggrin_queue=240554, aggrutil=92.72% >> sdc: ios=2981/5124, merge=77153/379556, ticks=142800/116248, in_queue=259356, util=87.13% >> sdd: ios=2990/5182, merge=75443/390196, ticks=128184/107752, in_queue=236264, util=86.87% >> sde: ios=3034/5057, merge=77882/372755, ticks=185048/159352, in_queue=344636, util=92.72% >> sdf: ios=3035/4941, merge=79423/356692, ticks=29196/28100, in_queue=57520, util=50.56% >> sdg: ios=3026/4963, merge=80518/360009, ticks=273308/31320, in_queue=304996, util=76.24% > > ################################################### > 5) 3TB LUKS volume (no raid, just one single disk): >> read: (groupid=0, jobs=4): err= 0: pid=31359 >> read : io=4092.0MB, bw=125032KB/s, iops=81 , runt= 33513msec >> slat (usec): min=86 , max=1404 , avg=125.10, stdev=97.21 >> clat (msec): min=26 , max=323 , avg=195.52, stdev=39.04 >> lat (msec): min=26 , max=323 , avg=195.65, stdev=39.03 >> bw (KB/s) : min=21982, max=41890, per=25.12%, avg=31405.62, stdev=3662.77 >> cpu : usr=0.02%, sys=0.31%, ctx=2738, majf=0, minf=6233 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 50=0.51%, 100=2.38%, 250=93.07%, 500=4.03% >> write: (groupid=1, jobs=4): err= 0: pid=31364 >> write: io=4092.0MB, bw=158193KB/s, iops=102 , runt= 26488msec >> slat (usec): min=181 , max=673 , avg=442.78, stdev=67.26 >> clat (msec): min=21 , max=555 , avg=154.33, stdev=45.35 >> lat (msec): min=21 , max=555 , avg=154.77, stdev=45.35 >> bw (KB/s) : min=19097, max=45988, per=25.04%, avg=39617.62, stdev=5714.79 >> cpu : usr=1.00%, sys=0.25%, ctx=2737, majf=0, minf=86 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 50=0.44%, 100=1.54%, 250=94.61%, 500=3.15%, 750=0.26% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s, mint=33513msec, maxt=33513msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s, mint=26488msec, maxt=26488msec >> >> Disk stats (read/write): >> dm-1: ios=8184/8196, merge=0/0, ticks=1521200/1235264, in_queue=2759104, util=99.58%, aggrios=8184/8225, aggrmerge=0/31, aggrticks=1421660/1137984, aggrin_queue=2559612, aggrutil=99.49% >> sda: ios=8184/8225, merge=0/31, ticks=1421660/1137984, in_queue=2559612, util=99.49% > > ############################## > 6) 6TB LUKS array, stripe 256: >> read: (groupid=0, jobs=4): err= 0: pid=31121 >> read : io=4092.0MB, bw=137700KB/s, iops=89 , runt= 30430msec >> slat (usec): min=123 , max=1353 , avg=157.35, stdev=91.53 >> clat (msec): min=15 , max=545 , avg=148.80, stdev=61.30 >> lat (msec): min=15 , max=545 , avg=148.96, stdev=61.29 >> bw (KB/s) : min= 0, max=129024, per=29.75%, avg=40959.72, stdev=21470.47 >> cpu : usr=0.01%, sys=0.51%, ctx=2797, majf=0, minf=6241 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 20=0.29%, 50=9.90%, 100=12.76%, 250=75.07%, 500=1.58% >> lat (msec): 750=0.40% >> write: (groupid=1, jobs=4): err= 0: pid=31131 >> write: io=4092.0MB, bw=41225KB/s, iops=26 , runt=101643msec >> slat (usec): min=197 , max=1309 , avg=557.69, stdev=124.14 >> clat (msec): min=42 , max=2648 , avg=586.07, stdev=393.18 >> lat (msec): min=43 , max=2649 , avg=586.63, stdev=393.19 >> bw (KB/s) : min= 0, max=35583, per=26.36%, avg=10867.00, stdev=5232.75 >> cpu : usr=0.33%, sys=0.09%, ctx=2699, majf=0, minf=94 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 50=0.07%, 100=1.39%, 250=18.55%, 500=31.82%, 750=20.49% >> lat (msec): 1000=12.57%, 2000=14.66%, >=2000=0.44% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=137699KB/s, minb=141004KB/s, maxb=141004KB/s, mint=30430msec, maxt=30430msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=41224KB/s, minb=42214KB/s, maxb=42214KB/s, mint=101643msec, maxt=101643msec >> >> Disk stats (read/write): >> dm-0: ios=10912/10906, merge=0/0, ticks=1385524/5596828, in_queue=6983364, util=99.89%, aggrios=10912/10940, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >> md2: ios=10912/10940, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20710/27494, aggrmerge=358521/573067, aggrticks=208334/119186, aggrin_queue=327551, aggrutil=61.95% >> sdc: ios=21005/27276, merge=366645/567533, ticks=224772/179916, in_queue=404728, util=61.95% >> sdd: ios=21154/26926, merge=369215/564213, ticks=212876/160624, in_queue=373552, util=60.42% >> sde: ios=20460/27449, merge=354440/571659, ticks=127008/102188, in_queue=229220, util=48.28% >> sdf: ios=20464/27908, merge=350591/581422, ticks=241608/78180, in_queue=319784, util=55.24% >> sdg: ios=20470/27915, merge=351718/580509, ticks=235408/75024, in_queue=310472, util=55.91% > > ############################### > 7) 6TB LUKS array, stripe 2048: >> read: (groupid=0, jobs=4): err= 0: pid=30910 >> read : io=4092.0MB, bw=237500KB/s, iops=154 , runt= 17643msec >> slat (usec): min=134 , max=1511 , avg=165.25, stdev=96.45 >> clat (msec): min=16 , max=859 , avg=101.07, stdev=61.40 >> lat (msec): min=17 , max=859 , avg=101.23, stdev=61.40 >> bw (KB/s) : min=32807, max=97523, per=25.40%, avg=60325.02, stdev=15396.63 >> cpu : usr=0.02%, sys=0.76%, ctx=2709, majf=0, minf=6233 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=2728/0/0, short=0/0/0 >> >> lat (msec): 20=0.07%, 50=13.38%, 100=48.90%, 250=34.38%, 500=3.08% >> lat (msec): 750=0.15%, 1000=0.04% >> write: (groupid=1, jobs=4): err= 0: pid=30915 >> write: io=4092.0MB, bw=90652KB/s, iops=59 , runt= 46223msec >> slat (usec): min=173 , max=983 , avg=599.60, stdev=141.06 >> clat (msec): min=34 , max=1185 , avg=258.48, stdev=135.97 >> lat (msec): min=34 , max=1185 , avg=259.08, stdev=135.98 >> bw (KB/s) : min= 0, max=88722, per=25.95%, avg=23524.06, stdev=9036.31 >> cpu : usr=0.77%, sys=0.22%, ctx=2576, majf=0, minf=86 >> IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w/d: total=0/2728/0, short=0/0/0 >> >> lat (msec): 50=0.48%, 100=7.00%, 250=49.01%, 500=37.72%, 750=5.10% >> lat (msec): 1000=0.59%, 2000=0.11% >> >> Run status group 0 (all jobs): >> READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s, mint=17643msec, maxt=17643msec >> >> Run status group 1 (all jobs): >> WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s, mint=46223msec, maxt=46223msec >> >> Disk stats (read/write): >> dm-0: ios=10912/10875, merge=0/0, ticks=544880/2481824, in_queue=3026996, util=99.57%, aggrios=10912/10930, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >> md2: ios=10912/10930, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5058/6798, aggrmerge=153871/404343, aggrticks=195870/113487, aggrin_queue=309447, aggrutil=92.15% >> sdc: ios=5024/7100, merge=152571/421782, ticks=332612/204936, in_queue=537660, util=92.15% >> sdd: ios=4961/7109, merge=150591/422166, ticks=257308/161200, in_queue=418608, util=88.11% >> sde: ios=5019/6697, merge=154781/395583, ticks=226120/152068, in_queue=378296, util=85.49% >> sdf: ios=5118/6463, merge=156068/385905, ticks=65716/23932, in_queue=89724, util=55.85% >> sdg: ios=5172/6624, merge=155347/396280, ticks=97596/25300, in_queue=122948, util=60.00% > > > > > > Am 22.12.2013 09:12, schrieb Stan Hoeppner: >> On 12/21/2013 8:35 PM, Kevin Richter wrote: >>> Okay, well, I've run your fio config... but with so many results and >> >> I'm glad you changed to size=1G instead of size=10G lest the test would >> have read/written 80GB instead of 8GB, and taken forever. I often >> forget the size parameter is per job (thread) not global. >> >>> abbrevations I currently feel a bit overchallenged :) >>> So, please dont take badly, if I just paste the whole results. >>> I have tried with a stripe cache size of 1024, 2048 and 4096. >> >> The full output is good. Going from 1024 to 2048 gained you ~17MB/s >> throughput with this job file. But you didn't run with the default 256 >> for comparison, which is very useful to know. That's ok, as you can do >> so with the next set of test runs. Given the huge latencies I'm >> thinking this config may be seeking the disks too much. Modify your fio >> job file as follows: >> >> numjobs=4 >> blocksize=1536k >> iodepth=4 >> size=1g >> >> as this may yield a little better results. ~190MB/s read and 75MB/s >> write is very low for three effective 7.2K RPM SATA spindles, especially >> the write throughput. Compare to a similar FIO job with 2 threads, >> iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC >> disabled, w/deadline elevator: >> >> Run status group 0 (all jobs): >> READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s >> Run status group 1 (all jobs): >> WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s >> >> 88MB/s read, 108MB/s write. Write is faster apparently due to the small >> 16MB drive write cache. With your eight 2.4GHz SandyBridge cores and >> md-RAID6 on much newer larger 7.2K drives w/larger caches, you should >> have significantly greater write speed than this old single SATA drive. >> Currently you're showing 30MB/s slower write throughput, some 40% >> slower than this single drive. With aligned writes, thus no RMW, you >> should be getting at least 2 spindles worth of throughput, or ~200MB/s. >> >> So, run FIO again with the mods above, once with stripe_cache_size 256 >> and once with 2048. Something seems to be seriously dragging down your >> throughput. I'm sure you'd like to know what, and how to fix it. >> >>> Btw, I also have /dev/md1, this is a 15GB unencrypted array using the >>> same 5 disks like my LUKS-md2 array, so nearly the same just without >>> this LUKS layer. If helpful, I can execute some fio tests on this >>> filesystem, too. >> >> This is also a RAID6 array, without dm-crypt? If so this will afford a >> great comparison, and may be very informative. Run FIO against this >> filesystem (I assume it's also XFS) as well with stripe_cache_size 512 >> and 2048. Four FIO tests total including the two on the big array. >> >> It'll be interesting to see where this leads. At this point the >> candidates I see are a single dm-crypt thread peaking a core, dm-crypt >> taking aligned writes and misaligning them causing all writes to be >> RMWs, or AF drives with misaligned partitions causing every write to be >> an internal RMW. Or maybe something I've not considered. >> >> >>> >>> So long :) >>> Kevin >>> >>> >>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size >>>> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s] >>>> read: (groupid=0, jobs=8): err= 0: pid=12987 >>>> read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec >>>> slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66 >>>> clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18 >>>> lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18 >>>> bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29 >>>> cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> issued r/w/d: total=163840/0/0, short=0/0/0 >>>> >>>> lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59% >>>> lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25% >>>> write: (groupid=1, jobs=8): err= 0: pid=13202 >>>> write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec >>>> slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68 >>>> clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82 >>>> lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82 >>>> bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35 >>>> cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> issued r/w/d: total=0/163840/0, short=0/0/0 >>>> >>>> lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20% >>>> lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64% >>>> >>>> Run status group 0 (all jobs): >>>> READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec >>>> >>>> Run status group 1 (all jobs): >>>> WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec >>>> >>>> Disk stats (read/write): >>>> dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >>>> md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78% >>>> sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78% >>>> sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42% >>>> sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90% >>>> sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36% >>>> sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26% >>> >>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size >>>> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s] >>>> read: (groupid=0, jobs=8): err= 0: pid=6392 >>>> read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec >>>> slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00 >>>> clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76 >>>> lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76 >>>> bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64 >>>> cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> issued r/w/d: total=163840/0/0, short=0/0/0 >>>> >>>> lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36% >>>> lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35% >>>> write: (groupid=1, jobs=8): err= 0: pid=6663 >>>> write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec >>>> slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22 >>>> clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37 >>>> lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82 >>>> bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80 >>>> cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> issued r/w/d: total=0/163840/0, short=0/0/0 >>>> >>>> lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64% >>>> lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23% >>>> >>>> Run status group 0 (all jobs): >>>> READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec >>>> >>>> Run status group 1 (all jobs): >>>> WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec >>>> >>>> Disk stats (read/write): >>>> dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >>>> md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20% >>>> sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77% >>>> sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44% >>>> sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20% >>>> sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89% >>>> sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96% >>> >>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size >>>> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s] >>>> read: (groupid=0, jobs=8): err= 0: pid=11787 >>>> read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec >>>> slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74 >>>> clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14 >>>> lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14 >>>> bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71 >>>> cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> issued r/w/d: total=163840/0/0, short=0/0/0 >>>> >>>> lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56% >>>> lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33% >>>> write: (groupid=1, jobs=8): err= 0: pid=12060 >>>> write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec >>>> slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87 >>>> clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56 >>>> lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56 >>>> bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28 >>>> cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723 >>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% >>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% >>>> issued r/w/d: total=0/163840/0, short=0/0/0 >>>> >>>> lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37% >>>> lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16% >>>> >>>> Run status group 0 (all jobs): >>>> READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec >>>> >>>> Run status group 1 (all jobs): >>>> WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec >>>> >>>> Disk stats (read/write): >>>> dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% >>>> md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60% >>>> sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57% >>>> sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08% >>>> sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86% >>>> sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60% >>>> sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84% > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-20 10:26 ` Kevin Richter 2013-12-20 12:36 ` Stan Hoeppner @ 2013-12-20 22:43 ` Arkadiusz Miśkiewicz 1 sibling, 0 replies; 22+ messages in thread From: Arkadiusz Miśkiewicz @ 2013-12-20 22:43 UTC (permalink / raw) To: xfs, xfs; +Cc: linux-raid, stan On Friday 20 of December 2013, Kevin Richter wrote: > >> $ cat /sys/block/md2/md/stripe_cache_size > >> 256 > > > > 256 is the default and it is way too low. This is limiting your write > > througput. Increase this to a minimum of 1024 which will give you a > > 20MB stripe cache buffer. This should become active immediately. Add > > it to a startup script to make it permanent. > > $ echo 256 > /sys/block/md2/md/stripe_cache_size > > $ time cp -a /olddisk/testfolder /6tb/foo1/ > > real 25m38.925s > > user 0m0.595s > > sys 1m23.182s > > > > $ echo 1024 > /sys/block/md2/md/stripe_cache_size > > $ time cp -a /olddisk/testfolder /raid/foo2/ > > real 7m32.824s > > user 0m0.438s > > sys 1m6.759s > > > > $ echo 2048 > /sys/block/md2/md/stripe_cache_size > > $ time cp -a /olddisk/testfolder /raid/foo3/ > > real 5m32.847s > > user 0m0.418s > > sys 1m5.671s > > > > $ echo 4096 > /sys/block/md2/md/stripe_cache_size > > $ time cp -a /olddisk/testfolder /raid/foo4/ > > real 5m54.554s > > user 0m0.437s > > sys 1m6.268s > > The difference is really amazing! So 2048 seems to be the best choice. > 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it? > The RAID only consist of 5 SATA disks with 7200rpm. I wonder why kernel is giving defaults that everyone repeatly recommends to change/increase? Has anyone tried to bugreport that for stripe_cache_size case? -- Arkadiusz Miśkiewicz, arekm / maven.pl _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-18 2:05 Kevin Richter 2013-12-18 3:38 ` Stan Hoeppner @ 2013-12-18 8:33 ` Stefan Ring 2013-12-18 22:21 ` Dave Chinner 2 siblings, 0 replies; 22+ messages in thread From: Stefan Ring @ 2013-12-18 8:33 UTC (permalink / raw) To: xfs; +Cc: Linux fs XFS > Even a reboot is now not possible anymore, because the whole system > hangs while executing the "sync" command in one of the shutdown scripts. This is unusual. How long have you waited? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: XFS blocked task in xlog_cil_force_lsn 2013-12-18 2:05 Kevin Richter 2013-12-18 3:38 ` Stan Hoeppner 2013-12-18 8:33 ` Stefan Ring @ 2013-12-18 22:21 ` Dave Chinner 2 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2013-12-18 22:21 UTC (permalink / raw) To: Kevin Richter; +Cc: xfs On Wed, Dec 18, 2013 at 03:05:51AM +0100, Kevin Richter wrote: > Hi, > > around April 2012 there was a similar thread on this list which I have > found via Google, so my mail topic is the same. > > I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying > under heavy load there are always these blocks. At the bottom of this > message I have included some line from the syslog. > > Even a reboot is now not possible anymore, because the whole system > hangs while executing the "sync" command in one of the shutdown scripts. Of course it will hang - it's waiting for the filesystem to make progress, and clearly it isn't.... > > So... first I have thought that my disks are faulty. > But with smartmontools I have started a short and a long test on all of > the 5 disks: no errors > > Then I have even recreated the whole array, but no improvement. > > Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS Oh, that's quite old. I'd suggest upgrading to a much more recent kernel as we've fixed lots of issues in this area since then. > > kswapd0 D ffffffff81806240 0 51 2 0x00000000 > > ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78 > > ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0 > > ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28 > > Call Trace: > > [<ffffffff8165f62f>] schedule+0x3f/0x60 > > [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150 > > [<ffffffff8166004a>] mutex_lock+0x2a/0x50 > > [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs] So, reclaim is stalled, meaning there there's other threads blocked reclaiming inodes.... > > INFO: task kswapd1:52 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > kswapd1 D 0000000000000007 0 52 2 0x00000000 > > ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff > > ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0 > > ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004 > > Call Trace: > > [<ffffffff8165f62f>] schedule+0x3f/0x60 > > [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs] > > [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs] > > [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs] > > [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs] > > [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs] > > [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs] Yup, there's the stalled reclaim, waiting on a log force in progress to complete. > > Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds. > > Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0 D 0000000000000001 0 19192 2 0x00000000 > > Dec 18 02:19:19 kernel: [390470.200268] ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000 > > Dec 18 02:19:19 kernel: [390470.200274] ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0 > > Dec 18 02:19:19 kernel: [390470.200281] ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080 > > Dec 18 02:19:19 kernel: [390470.200288] Call Trace: > > Dec 18 02:19:19 kernel: [390470.200293] [<ffffffff8165f62f>] schedule+0x3f/0x60 > > Dec 18 02:19:19 kernel: [390470.200323] [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs] > > Dec 18 02:19:19 kernel: [390470.200330] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200 > > Dec 18 02:19:19 kernel: [390470.200358] [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs] > > Dec 18 02:19:19 kernel: [390470.200382] [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs] > > Dec 18 02:19:19 kernel: [390470.200410] [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs] > > Dec 18 02:19:19 kernel: [390470.200438] [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs] > > Dec 18 02:19:19 kernel: [390470.200446] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360 > > Dec 18 02:19:19 kernel: [390470.200473] [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs] > > Dec 18 02:19:19 kernel: [390470.200481] [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130 > > Dec 18 02:19:19 kernel: [390470.200507] [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs] > > Dec 18 02:19:19 kernel: [390470.200535] [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs] > > Dec 18 02:19:19 kernel: [390470.200540] [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320 > > Dec 18 02:19:19 kernel: [390470.200568] [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs] And there's the log force that is in progress, and that is waiting on log buffer space to become available to finish the log force. That's waiting on IO completion to occur. IOWs, There's nothing that points directly at an XFS issue here. There might be a problem, but it's just as likely a hardware or IO path bug that is preventing an IO completion from occurring and that is what XFS is getting hung up on.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-12-22 21:01 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-16 6:48 XFS blocked task in xlog_cil_force_lsn Stefan Priebe - Profihost AG 2012-04-16 23:56 ` Dave Chinner 2012-04-17 8:19 ` Stefan Priebe - Profihost AG 2012-04-17 8:32 ` Dave Chinner 2012-04-17 9:19 ` Stefan Priebe - Profihost AG 2013-12-18 2:05 Kevin Richter 2013-12-18 3:38 ` Stan Hoeppner 2013-12-18 10:27 ` Kevin Richter 2013-12-19 14:11 ` Stan Hoeppner 2013-12-20 10:26 ` Kevin Richter 2013-12-20 12:36 ` Stan Hoeppner 2013-12-21 5:30 ` Dave Chinner 2013-12-22 9:18 ` Stan Hoeppner 2013-12-22 20:14 ` Dave Chinner 2013-12-22 21:01 ` Michael L. Semon 2013-12-22 2:35 ` Kevin Richter 2013-12-22 8:12 ` Stan Hoeppner 2013-12-22 14:10 ` Kevin Richter 2013-12-22 17:29 ` Stan Hoeppner 2013-12-20 22:43 ` Arkadiusz Miśkiewicz 2013-12-18 8:33 ` Stefan Ring 2013-12-18 22:21 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.