All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS blocked task in xlog_cil_force_lsn
@ 2013-12-18  2:05 Kevin Richter
  2013-12-18  3:38 ` Stan Hoeppner
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Kevin Richter @ 2013-12-18  2:05 UTC (permalink / raw)
  To: xfs

Hi,

around April 2012 there was a similar thread on this list which I have
found via Google, so my mail topic is the same.

I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
under heavy load there are always these blocks. At the bottom of this
message I have included some line from the syslog.

Even a reboot is now not possible anymore, because the whole system
hangs while executing the "sync" command in one of the shutdown scripts.

So... first I have thought that my disks are faulty.
But with smartmontools I have started a short and a long test on all of
the 5 disks: no errors

Then I have even recreated the whole array, but no improvement.

Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
Details about the array: soft array with mdadm v3.2.5, no hardware raid
controller in the server

The scheduler of the raid disks:
> $ cat /sys/block/sd[cdefg]/queue/scheduler
> noop deadline [cfq]
> noop deadline [cfq]
> noop deadline [cfq]
> noop deadline [cfq]
> noop deadline [cfq]


Any ideas what I can do?


Thanks,
Kevin



> Dec 18 02:19:19 kernel: [390470.193560] INFO: task kswapd0:51 blocked for more than 120 seconds.
> Dec 18 02:19:19 kernel: [390470.194541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 18 02:19:19 kernel: [390470.195795] kswapd0         D ffffffff81806240     0    51      2 0x00000000
> Dec 18 02:19:19 kernel: [390470.195804]  ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78
> Dec 18 02:19:19 kernel: [390470.195813]  ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0
> Dec 18 02:19:19 kernel: [390470.195820]  ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28
> Dec 18 02:19:19 kernel: [390470.195827] Call Trace:
> Dec 18 02:19:19 kernel: [390470.195844]  [<ffffffff8165f62f>] schedule+0x3f/0x60
> Dec 18 02:19:19 kernel: [390470.195850]  [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150
> Dec 18 02:19:19 kernel: [390470.195856]  [<ffffffff8166004a>] mutex_lock+0x2a/0x50
> Dec 18 02:19:19 kernel: [390470.195904]  [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs]
> Dec 18 02:19:19 kernel: [390470.195928]  [<ffffffffa029d83a>] ? __xfs_inode_set_reclaim_tag+0x4a/0x140 [xfs]
> Dec 18 02:19:19 kernel: [390470.195951]  [<ffffffffa029d96d>] ? xfs_inode_set_reclaim_tag+0x3d/0xa0 [xfs]
> Dec 18 02:19:19 kernel: [390470.195973]  [<ffffffffa029d9b4>] ? xfs_inode_set_reclaim_tag+0x84/0xa0 [xfs]
> Dec 18 02:19:19 kernel: [390470.195995]  [<ffffffffa029c6a5>] ? xfs_fs_destroy_inode+0x65/0x150 [xfs]
> Dec 18 02:19:19 kernel: [390470.196003]  [<ffffffff81194dfc>] ? destroy_inode+0x3c/0x70
> Dec 18 02:19:19 kernel: [390470.196008]  [<ffffffff81194f5a>] ? evict+0x12a/0x1c0
> Dec 18 02:19:19 kernel: [390470.196016]  [<ffffffff8166154e>] ? _raw_spin_lock+0xe/0x20
> Dec 18 02:19:19 kernel: [390470.196037]  [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> Dec 18 02:19:19 kernel: [390470.196059]  [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
> Dec 18 02:19:19 kernel: [390470.196067]  [<ffffffff8117d71e>] prune_super+0x10e/0x1a0
> Dec 18 02:19:19 kernel: [390470.196073]  [<ffffffff8112b924>] shrink_slab+0x154/0x300
> Dec 18 02:19:19 kernel: [390470.196079]  [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0
> Dec 18 02:19:19 kernel: [390470.196085]  [<ffffffff8112ef01>] kswapd+0x121/0x220
> Dec 18 02:19:19 kernel: [390470.196090]  [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0
> Dec 18 02:19:19 kernel: [390470.196097]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
> Dec 18 02:19:19 kernel: [390470.196105]  [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
> Dec 18 02:19:19 kernel: [390470.196111]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
> Dec 18 02:19:19 kernel: [390470.196117]  [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
> Dec 18 02:19:19 kernel: [390470.196121] INFO: task kswapd1:52 blocked for more than 120 seconds.
> Dec 18 02:19:19 kernel: [390470.197086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 18 02:19:19 kernel: [390470.198427] kswapd1         D 0000000000000007     0    52      2 0x00000000
> Dec 18 02:19:19 kernel: [390470.198434]  ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff
> Dec 18 02:19:19 kernel: [390470.198441]  ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0
> Dec 18 02:19:19 kernel: [390470.198447]  ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004
> Dec 18 02:19:19 kernel: [390470.198454] Call Trace:
> Dec 18 02:19:19 kernel: [390470.198462]  [<ffffffff8165f62f>] schedule+0x3f/0x60
> Dec 18 02:19:19 kernel: [390470.198498]  [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs]
> Dec 18 02:19:19 kernel: [390470.198510]  [<ffffffff811643af>] ? unfreeze_partials+0x24f/0x2b0
> Dec 18 02:19:19 kernel: [390470.198521]  [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
> Dec 18 02:19:19 kernel: [390470.198554]  [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198587]  [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198617]  [<ffffffffa02ba398>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
> Dec 18 02:19:19 kernel: [390470.198646]  [<ffffffffa02ae972>] ? xfs_bmap_search_multi_extents+0x92/0x110 [xfs]
> Dec 18 02:19:19 kernel: [390470.198677]  [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198710]  [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs]
> Dec 18 02:19:19 kernel: [390470.198740]  [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198769]  [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198780]  [<ffffffff8104ef9b>] ? try_wait_for_completion+0x4b/0x60
> Dec 18 02:19:19 kernel: [390470.198806]  [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198818]  [<ffffffff81314a4c>] ? radix_tree_gang_lookup_tag.part.7+0x6c/0xf0
> Dec 18 02:19:19 kernel: [390470.198844]  [<ffffffffa029dc5f>] xfs_reclaim_inodes_ag+0x23f/0x3a0 [xfs]
> Dec 18 02:19:19 kernel: [390470.198871]  [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> Dec 18 02:19:19 kernel: [390470.198895]  [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
> Dec 18 02:19:19 kernel: [390470.198905]  [<ffffffff8117d71e>] prune_super+0x10e/0x1a0
> Dec 18 02:19:19 kernel: [390470.198913]  [<ffffffff8112b924>] shrink_slab+0x154/0x300
> Dec 18 02:19:19 kernel: [390470.198922]  [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0
> Dec 18 02:19:19 kernel: [390470.198931]  [<ffffffff8112ef01>] kswapd+0x121/0x220
> Dec 18 02:19:19 kernel: [390470.198938]  [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0
> Dec 18 02:19:19 kernel: [390470.198947]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
> Dec 18 02:19:19 kernel: [390470.198956]  [<ffffffff8105f750>] ? __migrate_task+0x120/0x120
> Dec 18 02:19:19 kernel: [390470.198966]  [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
> Dec 18 02:19:19 kernel: [390470.198974]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
> Dec 18 02:19:19 kernel: [390470.198984]  [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
> Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds.
> Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0    D 0000000000000001     0 19192      2 0x00000000
> Dec 18 02:19:19 kernel: [390470.200268]  ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000
> Dec 18 02:19:19 kernel: [390470.200274]  ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0
> Dec 18 02:19:19 kernel: [390470.200281]  ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080
> Dec 18 02:19:19 kernel: [390470.200288] Call Trace:
> Dec 18 02:19:19 kernel: [390470.200293]  [<ffffffff8165f62f>] schedule+0x3f/0x60
> Dec 18 02:19:19 kernel: [390470.200323]  [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs]
> Dec 18 02:19:19 kernel: [390470.200330]  [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
> Dec 18 02:19:19 kernel: [390470.200358]  [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs]
> Dec 18 02:19:19 kernel: [390470.200382]  [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
> Dec 18 02:19:19 kernel: [390470.200410]  [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs]
> Dec 18 02:19:19 kernel: [390470.200438]  [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs]
> Dec 18 02:19:19 kernel: [390470.200446]  [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
> Dec 18 02:19:19 kernel: [390470.200473]  [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs]
> Dec 18 02:19:19 kernel: [390470.200481]  [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130
> Dec 18 02:19:19 kernel: [390470.200507]  [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs]
> Dec 18 02:19:19 kernel: [390470.200535]  [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs]
> Dec 18 02:19:19 kernel: [390470.200540]  [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320
> Dec 18 02:19:19 kernel: [390470.200568]  [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs]
> Dec 18 02:19:19 kernel: [390470.200593]  [<ffffffffa02e7bb0>] ? xfsaild_push+0x680/0x680 [xfs]
> Dec 18 02:19:19 kernel: [390470.200599]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
> Dec 18 02:19:19 kernel: [390470.200605]  [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
> Dec 18 02:19:19 kernel: [390470.200611]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
> Dec 18 02:19:19 kernel: [390470.200617]  [<ffffffff8166bcf0>] ? gs_change+0x13/0x13


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-18  2:05 XFS blocked task in xlog_cil_force_lsn Kevin Richter
@ 2013-12-18  3:38 ` Stan Hoeppner
  2013-12-18 10:27   ` Kevin Richter
  2013-12-18  8:33 ` XFS blocked task in xlog_cil_force_lsn Stefan Ring
  2013-12-18 22:21 ` Dave Chinner
  2 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-18  3:38 UTC (permalink / raw)
  To: xfs, xfs

On 12/17/2013 8:05 PM, Kevin Richter wrote:
> Hi,
> 
> around April 2012 there was a similar thread on this list which I have
> found via Google, so my mail topic is the same.
> 
> I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
> under heavy load there are always these blocks. At the bottom of this
> message I have included some line from the syslog.
> 
> Even a reboot is now not possible anymore, because the whole system
> hangs while executing the "sync" command in one of the shutdown scripts.
> 
> So... first I have thought that my disks are faulty.
> But with smartmontools I have started a short and a long test on all of
> the 5 disks: no errors
> 
> Then I have even recreated the whole array, but no improvement.
> 
> Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
> Details about the array: soft array with mdadm v3.2.5, no hardware raid
> controller in the server
> 
> The scheduler of the raid disks:
>> $ cat /sys/block/sd[cdefg]/queue/scheduler
>> noop deadline [cfq]
>> noop deadline [cfq]
>> noop deadline [cfq]
>> noop deadline [cfq]
>> noop deadline [cfq]
> 
> 
> Any ideas what I can do?

Your workload is seeking the disks to death, which is why you're getting
these timeouts.  The actuators simply can't keep up.

1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
suitable for XFS.  This may not be a silver bullet but it will help.

2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
sub optimal for your workload.  For the latter

$ cat /sys/block/mdX/md/stripe_cache_size

3.  Post 'xfs_info /dev/mdX'

4.  You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks.  Relocate swap to disks that are
not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
swap on it.  Or install more RAM in the machine.  Going the SSD route is
better as it gives flexibility.  For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space.  This decreases the IOPS load on your rust.

5.  Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.


> 
> Thanks,
> Kevin
> 
> 
> 
>> Dec 18 02:19:19 kernel: [390470.193560] INFO: task kswapd0:51 blocked for more than 120 seconds.
>> Dec 18 02:19:19 kernel: [390470.194541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 18 02:19:19 kernel: [390470.195795] kswapd0         D ffffffff81806240     0    51      2 0x00000000
>> Dec 18 02:19:19 kernel: [390470.195804]  ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78
>> Dec 18 02:19:19 kernel: [390470.195813]  ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0
>> Dec 18 02:19:19 kernel: [390470.195820]  ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28
>> Dec 18 02:19:19 kernel: [390470.195827] Call Trace:
>> Dec 18 02:19:19 kernel: [390470.195844]  [<ffffffff8165f62f>] schedule+0x3f/0x60
>> Dec 18 02:19:19 kernel: [390470.195850]  [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150
>> Dec 18 02:19:19 kernel: [390470.195856]  [<ffffffff8166004a>] mutex_lock+0x2a/0x50
>> Dec 18 02:19:19 kernel: [390470.195904]  [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.195928]  [<ffffffffa029d83a>] ? __xfs_inode_set_reclaim_tag+0x4a/0x140 [xfs]
>> Dec 18 02:19:19 kernel: [390470.195951]  [<ffffffffa029d96d>] ? xfs_inode_set_reclaim_tag+0x3d/0xa0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.195973]  [<ffffffffa029d9b4>] ? xfs_inode_set_reclaim_tag+0x84/0xa0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.195995]  [<ffffffffa029c6a5>] ? xfs_fs_destroy_inode+0x65/0x150 [xfs]
>> Dec 18 02:19:19 kernel: [390470.196003]  [<ffffffff81194dfc>] ? destroy_inode+0x3c/0x70
>> Dec 18 02:19:19 kernel: [390470.196008]  [<ffffffff81194f5a>] ? evict+0x12a/0x1c0
>> Dec 18 02:19:19 kernel: [390470.196016]  [<ffffffff8166154e>] ? _raw_spin_lock+0xe/0x20
>> Dec 18 02:19:19 kernel: [390470.196037]  [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
>> Dec 18 02:19:19 kernel: [390470.196059]  [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
>> Dec 18 02:19:19 kernel: [390470.196067]  [<ffffffff8117d71e>] prune_super+0x10e/0x1a0
>> Dec 18 02:19:19 kernel: [390470.196073]  [<ffffffff8112b924>] shrink_slab+0x154/0x300
>> Dec 18 02:19:19 kernel: [390470.196079]  [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0
>> Dec 18 02:19:19 kernel: [390470.196085]  [<ffffffff8112ef01>] kswapd+0x121/0x220
>> Dec 18 02:19:19 kernel: [390470.196090]  [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0
>> Dec 18 02:19:19 kernel: [390470.196097]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
>> Dec 18 02:19:19 kernel: [390470.196105]  [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
>> Dec 18 02:19:19 kernel: [390470.196111]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
>> Dec 18 02:19:19 kernel: [390470.196117]  [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
>> Dec 18 02:19:19 kernel: [390470.196121] INFO: task kswapd1:52 blocked for more than 120 seconds.
>> Dec 18 02:19:19 kernel: [390470.197086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 18 02:19:19 kernel: [390470.198427] kswapd1         D 0000000000000007     0    52      2 0x00000000
>> Dec 18 02:19:19 kernel: [390470.198434]  ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff
>> Dec 18 02:19:19 kernel: [390470.198441]  ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0
>> Dec 18 02:19:19 kernel: [390470.198447]  ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004
>> Dec 18 02:19:19 kernel: [390470.198454] Call Trace:
>> Dec 18 02:19:19 kernel: [390470.198462]  [<ffffffff8165f62f>] schedule+0x3f/0x60
>> Dec 18 02:19:19 kernel: [390470.198498]  [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198510]  [<ffffffff811643af>] ? unfreeze_partials+0x24f/0x2b0
>> Dec 18 02:19:19 kernel: [390470.198521]  [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
>> Dec 18 02:19:19 kernel: [390470.198554]  [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198587]  [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198617]  [<ffffffffa02ba398>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198646]  [<ffffffffa02ae972>] ? xfs_bmap_search_multi_extents+0x92/0x110 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198677]  [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198710]  [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198740]  [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198769]  [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198780]  [<ffffffff8104ef9b>] ? try_wait_for_completion+0x4b/0x60
>> Dec 18 02:19:19 kernel: [390470.198806]  [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198818]  [<ffffffff81314a4c>] ? radix_tree_gang_lookup_tag.part.7+0x6c/0xf0
>> Dec 18 02:19:19 kernel: [390470.198844]  [<ffffffffa029dc5f>] xfs_reclaim_inodes_ag+0x23f/0x3a0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198871]  [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198895]  [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
>> Dec 18 02:19:19 kernel: [390470.198905]  [<ffffffff8117d71e>] prune_super+0x10e/0x1a0
>> Dec 18 02:19:19 kernel: [390470.198913]  [<ffffffff8112b924>] shrink_slab+0x154/0x300
>> Dec 18 02:19:19 kernel: [390470.198922]  [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0
>> Dec 18 02:19:19 kernel: [390470.198931]  [<ffffffff8112ef01>] kswapd+0x121/0x220
>> Dec 18 02:19:19 kernel: [390470.198938]  [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0
>> Dec 18 02:19:19 kernel: [390470.198947]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
>> Dec 18 02:19:19 kernel: [390470.198956]  [<ffffffff8105f750>] ? __migrate_task+0x120/0x120
>> Dec 18 02:19:19 kernel: [390470.198966]  [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
>> Dec 18 02:19:19 kernel: [390470.198974]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
>> Dec 18 02:19:19 kernel: [390470.198984]  [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
>> Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds.
>> Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0    D 0000000000000001     0 19192      2 0x00000000
>> Dec 18 02:19:19 kernel: [390470.200268]  ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000
>> Dec 18 02:19:19 kernel: [390470.200274]  ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0
>> Dec 18 02:19:19 kernel: [390470.200281]  ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080
>> Dec 18 02:19:19 kernel: [390470.200288] Call Trace:
>> Dec 18 02:19:19 kernel: [390470.200293]  [<ffffffff8165f62f>] schedule+0x3f/0x60
>> Dec 18 02:19:19 kernel: [390470.200323]  [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200330]  [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
>> Dec 18 02:19:19 kernel: [390470.200358]  [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200382]  [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200410]  [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200438]  [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200446]  [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
>> Dec 18 02:19:19 kernel: [390470.200473]  [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200481]  [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130
>> Dec 18 02:19:19 kernel: [390470.200507]  [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200535]  [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200540]  [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320
>> Dec 18 02:19:19 kernel: [390470.200568]  [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200593]  [<ffffffffa02e7bb0>] ? xfsaild_push+0x680/0x680 [xfs]
>> Dec 18 02:19:19 kernel: [390470.200599]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
>> Dec 18 02:19:19 kernel: [390470.200605]  [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
>> Dec 18 02:19:19 kernel: [390470.200611]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
>> Dec 18 02:19:19 kernel: [390470.200617]  [<ffffffff8166bcf0>] ? gs_change+0x13/0x13

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-18  2:05 XFS blocked task in xlog_cil_force_lsn Kevin Richter
  2013-12-18  3:38 ` Stan Hoeppner
@ 2013-12-18  8:33 ` Stefan Ring
  2013-12-18 22:21 ` Dave Chinner
  2 siblings, 0 replies; 30+ messages in thread
From: Stefan Ring @ 2013-12-18  8:33 UTC (permalink / raw)
  To: xfs; +Cc: Linux fs XFS

> Even a reboot is now not possible anymore, because the whole system
> hangs while executing the "sync" command in one of the shutdown scripts.

This is unusual. How long have you waited?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-18  3:38 ` Stan Hoeppner
@ 2013-12-18 10:27   ` Kevin Richter
  2013-12-19 14:11     ` Stan Hoeppner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Richter @ 2013-12-18 10:27 UTC (permalink / raw)
  To: xfs

Thanks for your mails!

> This is unusual. How long have you waited?

For the reboot? One night.
After the copy process hangs: several hours. But mostly it recovers
after several minutes.

> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
> suitable for XFS.  This may not be a silver bullet but it will help.

Can I switch it while my copy process (from a separate hd to this raid)
is running... without data loss? Otherwise I would wait a bit, because
now it is actually running for 8 hours without kernel panics.

> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
> sub optimal for your workload.

$ cat /sys/block/md2/md/stripe_cache_size
256
$ mdadm --detail /dev/md2 | grep Chunk
Chunk Size : 512K

> 3.  Post 'xfs_info /dev/mdX'

There is a LUKS volume around /dev/md2, named '6tb'.
> $ xfs_info /dev/md2
> xfs_info: /dev/md2 is not a mounted XFS filesystem
> $ xfs_info /dev/mapper/6tb
> meta-data=/dev/mapper/6tb        isize=256    agcount=32, agsize=45631360 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=1460203520, imaxpct=5
>          =                       sunit=128    swidth=384 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0


> 4.  You're getting a lot of kswapd timeouts because you have swap and
> the md/RAID6 array on the same disks.  Relocate swap to disks that are
> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
> swap on it.  Or install more RAM in the machine.  Going the SSD route is
> better as it gives flexibility.  For instance, you can also relocate
> your syslog files to it and anything else that does IO without eating
> lots of space.  This decreases the IOPS load on your rust.

No no, swap is not on any of the raid disks.

> # cat /proc/swaps
> Filename                                Type            Size    Used    Priority
> /dev/sda3                               partition       7812496 0       -1
sda is not in the raid. In the raid there are sd[cdefg].


> 5.  Describe in some detail the workload(s) causing the heavy IO, and
> thus these timeouts.

cd /olddharddisk
cp -av . /raid/

oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
above.

Heavy workload while this copy process (2 CPUs, each 4 cores):
> top - 11:13:37 up 4 days, 21:32,  2 users,  load average: 12.95, 11.33, 10.32
> Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  5.7%sy,  0.0%ni, 82.1%id, 11.8%wa,  0.0%hi,  0.3%si,  0.0%st
> Mem:  32916276k total, 32750240k used,   166036k free, 10076760k buffers
> Swap:  7812496k total,        0k used,  7812496k free, 21221136k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   699 root      20   0     0    0    0 S   11  0.0 248:17.59 md2_raid6

Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
consumption does not tell me. All entries are only 0.0% and 0.1%




Thanks,
Kevin




Am 18.12.2013 04:38, schrieb Stan Hoeppner:
> On 12/17/2013 8:05 PM, Kevin Richter wrote:
>> Hi,
>>
>> around April 2012 there was a similar thread on this list which I have
>> found via Google, so my mail topic is the same.
>>
>> I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
>> under heavy load there are always these blocks. At the bottom of this
>> message I have included some line from the syslog.
>>
>> Even a reboot is now not possible anymore, because the whole system
>> hangs while executing the "sync" command in one of the shutdown scripts.
>>
>> So... first I have thought that my disks are faulty.
>> But with smartmontools I have started a short and a long test on all of
>> the 5 disks: no errors
>>
>> Then I have even recreated the whole array, but no improvement.
>>
>> Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
>> Details about the array: soft array with mdadm v3.2.5, no hardware raid
>> controller in the server
>>
>> The scheduler of the raid disks:
>>> $ cat /sys/block/sd[cdefg]/queue/scheduler
>>> noop deadline [cfq]
>>> noop deadline [cfq]
>>> noop deadline [cfq]
>>> noop deadline [cfq]
>>> noop deadline [cfq]
>>
>>
>> Any ideas what I can do?
> 
> Your workload is seeking the disks to death, which is why you're getting
> these timeouts.  The actuators simply can't keep up.
> 
> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
> suitable for XFS.  This may not be a silver bullet but it will help.
> 
> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
> sub optimal for your workload.  For the latter
> 
> $ cat /sys/block/mdX/md/stripe_cache_size
> 
> 3.  Post 'xfs_info /dev/mdX'
> 
> 4.  You're getting a lot of kswapd timeouts because you have swap and
> the md/RAID6 array on the same disks.  Relocate swap to disks that are
> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
> swap on it.  Or install more RAM in the machine.  Going the SSD route is
> better as it gives flexibility.  For instance, you can also relocate
> your syslog files to it and anything else that does IO without eating
> lots of space.  This decreases the IOPS load on your rust.
> 
> 5.  Describe in some detail the workload(s) causing the heavy IO, and
> thus these timeouts.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-18  2:05 XFS blocked task in xlog_cil_force_lsn Kevin Richter
  2013-12-18  3:38 ` Stan Hoeppner
  2013-12-18  8:33 ` XFS blocked task in xlog_cil_force_lsn Stefan Ring
@ 2013-12-18 22:21 ` Dave Chinner
  2 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2013-12-18 22:21 UTC (permalink / raw)
  To: Kevin Richter; +Cc: xfs

On Wed, Dec 18, 2013 at 03:05:51AM +0100, Kevin Richter wrote:
> Hi,
> 
> around April 2012 there was a similar thread on this list which I have
> found via Google, so my mail topic is the same.
> 
> I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
> under heavy load there are always these blocks. At the bottom of this
> message I have included some line from the syslog.
> 
> Even a reboot is now not possible anymore, because the whole system
> hangs while executing the "sync" command in one of the shutdown scripts.

Of course it will hang - it's waiting for the filesystem to make
progress, and clearly it isn't....

> 
> So... first I have thought that my disks are faulty.
> But with smartmontools I have started a short and a long test on all of
> the 5 disks: no errors
> 
> Then I have even recreated the whole array, but no improvement.
> 
> Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS

Oh, that's quite old. I'd suggest upgrading to a much more recent
kernel as we've fixed lots of issues in this area since then.

> >  kswapd0         D ffffffff81806240     0    51      2 0x00000000
> >   ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78
> >   ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0
> >   ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28
> >  Call Trace:
> >   [<ffffffff8165f62f>] schedule+0x3f/0x60
> >   [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150
> >   [<ffffffff8166004a>] mutex_lock+0x2a/0x50
> >   [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs]

So, reclaim is stalled, meaning there there's other threads blocked
reclaiming inodes....

> >  INFO: task kswapd1:52 blocked for more than 120 seconds.
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  kswapd1         D 0000000000000007     0    52      2 0x00000000
> >   ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff
> >   ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0
> >   ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004
> >  Call Trace:
> >   [<ffffffff8165f62f>] schedule+0x3f/0x60
> >   [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs]
> >   [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs]
> >   [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs]
> >   [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs]
> >   [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs]
> >   [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs]

Yup, there's the stalled reclaim, waiting on a log force in progress
to complete.

> > Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds.
> > Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0    D 0000000000000001     0 19192      2 0x00000000
> > Dec 18 02:19:19 kernel: [390470.200268]  ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000
> > Dec 18 02:19:19 kernel: [390470.200274]  ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0
> > Dec 18 02:19:19 kernel: [390470.200281]  ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080
> > Dec 18 02:19:19 kernel: [390470.200288] Call Trace:
> > Dec 18 02:19:19 kernel: [390470.200293]  [<ffffffff8165f62f>] schedule+0x3f/0x60
> > Dec 18 02:19:19 kernel: [390470.200323]  [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200330]  [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
> > Dec 18 02:19:19 kernel: [390470.200358]  [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200382]  [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200410]  [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200438]  [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200446]  [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
> > Dec 18 02:19:19 kernel: [390470.200473]  [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200481]  [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130
> > Dec 18 02:19:19 kernel: [390470.200507]  [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200535]  [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs]
> > Dec 18 02:19:19 kernel: [390470.200540]  [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320
> > Dec 18 02:19:19 kernel: [390470.200568]  [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs]

And there's the log force that is in progress, and that is waiting
on log buffer space to become available to finish the log force.
That's waiting on IO completion to occur.

IOWs, There's nothing that points directly at an XFS issue here.
There might be a problem, but it's just as likely a hardware or IO
path bug that is preventing an IO completion from occurring and that
is what XFS is getting hung up on....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-18 10:27   ` Kevin Richter
@ 2013-12-19 14:11     ` Stan Hoeppner
  2013-12-20 10:26       ` Kevin Richter
  0 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-19 14:11 UTC (permalink / raw)
  To: xfs, xfs

On 12/18/2013 4:27 AM, Kevin Richter wrote:
> Thanks for your mails!
> 
>> This is unusual. How long have you waited?
> 
> For the reboot? One night.
> After the copy process hangs: several hours. But mostly it recovers
> after several minutes.
> 
>> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
>> suitable for XFS.  This may not be a silver bullet but it will help.
> 
> Can I switch it while my copy process (from a separate hd to this raid)
> is running... without data loss? Otherwise I would wait a bit, because
> now it is actually running for 8 hours without kernel panics.

Yes, you can safely change the elevator on the fly.  Repeat the
following once for each disk, replacing X appropriately.

$ echo deadline > /sys/block/sdX/queue/scheduler

This on the fly change is not permanent.  Add "elevator=deadline" to
your kernel command line in your bootloader config file.  Or you can add
the commands to a startup script.  Or if you build your own kernels
simply make it the default, and/or remove all the other elevators from
your kernel config, which is what I do.

>> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
>> sub optimal for your workload.
> 
> $ cat /sys/block/md2/md/stripe_cache_size
> 256

256 is the default and it is way too low.  This is limiting your write
througput.  Increase this to a minimum of 1024 which will give you a
20MB stripe cache buffer.  This should become active immediately.  Add
it to a startup script to make it permanent.

$ echo 1024 > /sys/block/md2/md/stripe_cache_size

((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache

At some future point when the system is idle, play with FIO write
testing and multiple jobs (threads) with stripe_cache_size values up to
2048 to see which value gives the best throughput.  Test 1024, 2048,
4096.  Going higher probably won't gain anything.  And at some point as
you go higher your throughput will decrease.  When you see that, select
the previous value and make it permanent.

> $ mdadm --detail /dev/md2 | grep Chunk
> Chunk Size : 512K
> 
>> 3.  Post 'xfs_info /dev/mdX'
> 
> There is a LUKS volume around /dev/md2, named '6tb'.

Ok, this adds a huge fly to the ointment.  LUKS is single threaded per
device.  You've encrypted the md device instead of the individual disks.
 This limits your encryption throughput to a single CPU core.  You've
not provided any data on your LUKS thread.  Is it pegging a core during
large file IO?  I'd bet it is.  That would explain all of this.

>> $ xfs_info /dev/md2
>> xfs_info: /dev/md2 is not a mounted XFS filesystem
>> $ xfs_info /dev/mapper/6tb
>> meta-data=/dev/mapper/6tb        isize=256    agcount=32, agsize=45631360 blks
>>          =                       sectsz=512   attr=2
>> data     =                       bsize=4096   blocks=1460203520, imaxpct=5
>>          =                       sunit=128    swidth=384 blks
>> naming   =version 2              bsize=4096   ascii-ci=0
>> log      =internal               bsize=4096   blocks=521728, version=2
>>          =                       sectsz=512   sunit=8 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0

md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width.  The
XFS alignment matches the md device.  This may be hurting you with LUKS
between XFS and md.  I don't know how LUKS handles 1.5MB inbound writes
and if it passes an aligned write to md after encrypting.  I also don't
know how it handles writes barriers passed down the stack by XFS, or if
it does at all.

>> 4.  You're getting a lot of kswapd timeouts because you have swap and
>> the md/RAID6 array on the same disks.  Relocate swap to disks that are
>> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
>> swap on it.  Or install more RAM in the machine.  Going the SSD route is
>> better as it gives flexibility.  For instance, you can also relocate
>> your syslog files to it and anything else that does IO without eating
>> lots of space.  This decreases the IOPS load on your rust.
> 
> No no, swap is not on any of the raid disks.
> 
>> # cat /proc/swaps
>> Filename                                Type            Size    Used    Priority
>> /dev/sda3                               partition       7812496 0       -1
> sda is not in the raid. In the raid there are sd[cdefg].

I was wrong here.  After reading a little about kswapd I now know it
doesn't simply move pages to/from disk.  It frees memory in other ways
as well.  In this case it's apparently waiting to free an inode being
reclaimed by xfs_reclaim_inode, which is stalled on lower level IO.  So
kswapd times out as a result.  A slow LUKS thread would explain this as
well.

>> 5.  Describe in some detail the workload(s) causing the heavy IO, and
>> thus these timeouts.
> 
> cd /olddharddisk
> cp -av . /raid/
> 
> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
> above.

So you are simply doing a huge file copy from an apparently slower
single disk to an apparently faster striped RAID.  But you get horrible
throughput and kernel timeout errors, and the RAID disks are apparently
fairly idle.

> Heavy workload while this copy process (2 CPUs, each 4 cores):

Model # of the CPUs so I can look up the specs?

>> top - 11:13:37 up 4 days, 21:32,  2 users,  load average: 12.95, 11.33, 10.32
>> Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.0%us,  5.7%sy,  0.0%ni, 82.1%id, 11.8%wa,  0.0%hi,  0.3%si,  0.0%st
>> Mem:  32916276k total, 32750240k used,   166036k free, 10076760k buffers
>> Swap:  7812496k total,        0k used,  7812496k free, 21221136k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>   699 root      20   0     0    0    0 S   11  0.0 248:17.59 md2_raid6
> 
> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
> consumption does not tell me. All entries are only 0.0% and 0.1%

~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
and page cache (20GB).  Linux uses nearly all free memory for cache and
buffers to prevent disk accesses, which speeds up access to frequently
used data considerably.  This is normal.  A huge copy like this is going
to cause a huge amount caching.

I assume this 1TB file copy has finally completed.  Start another large
test copy and paste the CPU burn for the LUKS thread so we can confirm
whether this is the root cause of your problem.

-- 
Stan



> 
> 
> 
> Thanks,
> Kevin
> 
> 
> 
> 
> Am 18.12.2013 04:38, schrieb Stan Hoeppner:
>> On 12/17/2013 8:05 PM, Kevin Richter wrote:
>>> Hi,
>>>
>>> around April 2012 there was a similar thread on this list which I have
>>> found via Google, so my mail topic is the same.
>>>
>>> I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
>>> under heavy load there are always these blocks. At the bottom of this
>>> message I have included some line from the syslog.
>>>
>>> Even a reboot is now not possible anymore, because the whole system
>>> hangs while executing the "sync" command in one of the shutdown scripts.
>>>
>>> So... first I have thought that my disks are faulty.
>>> But with smartmontools I have started a short and a long test on all of
>>> the 5 disks: no errors
>>>
>>> Then I have even recreated the whole array, but no improvement.
>>>
>>> Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
>>> Details about the array: soft array with mdadm v3.2.5, no hardware raid
>>> controller in the server
>>>
>>> The scheduler of the raid disks:
>>>> $ cat /sys/block/sd[cdefg]/queue/scheduler
>>>> noop deadline [cfq]
>>>> noop deadline [cfq]
>>>> noop deadline [cfq]
>>>> noop deadline [cfq]
>>>> noop deadline [cfq]
>>>
>>>
>>> Any ideas what I can do?
>>
>> Your workload is seeking the disks to death, which is why you're getting
>> these timeouts.  The actuators simply can't keep up.
>>
>> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
>> suitable for XFS.  This may not be a silver bullet but it will help.
>>
>> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
>> sub optimal for your workload.  For the latter
>>
>> $ cat /sys/block/mdX/md/stripe_cache_size
>>
>> 3.  Post 'xfs_info /dev/mdX'
>>
>> 4.  You're getting a lot of kswapd timeouts because you have swap and
>> the md/RAID6 array on the same disks.  Relocate swap to disks that are
>> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
>> swap on it.  Or install more RAM in the machine.  Going the SSD route is
>> better as it gives flexibility.  For instance, you can also relocate
>> your syslog files to it and anything else that does IO without eating
>> lots of space.  This decreases the IOPS load on your rust.
>>
>> 5.  Describe in some detail the workload(s) causing the heavy IO, and
>> thus these timeouts.
>>
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-19 14:11     ` Stan Hoeppner
@ 2013-12-20 10:26       ` Kevin Richter
  2013-12-20 12:36         ` Stan Hoeppner
  2013-12-20 22:43         ` Arkadiusz Miśkiewicz
  0 siblings, 2 replies; 30+ messages in thread
From: Kevin Richter @ 2013-12-20 10:26 UTC (permalink / raw)
  To: stan, xfs

Thanks a lot for your huge replies!

> Oh, that's quite old. I'd suggest upgrading to a much more recent
> kernel as we've fixed lots of issues in this area since then.

First I have switched to the newer kernel from Ubuntu Saucy:
$ uname -a
Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP

Thus, it seems that the default scheduler has been changed to deadline.
I did not change anything. After a reboot the schedulers of all disks
are now deadline.

> Model # of the CPUs so I can look up the specs?
Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz

I have prepared a folder with 60GB for the tests. This is nearly twice
of the available memory, so the process should be forced to actually
write the stuff to the disk - and not only hold in the memory.

> $ echo 256 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /6tb/foo1/
> real    25m38.925s
> user    0m0.595s
> sys     1m23.182s
>
> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /raid/foo2/
> real    7m32.824s
> user    0m0.438s
> sys     1m6.759s
> 
> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /raid/foo3/
> real    5m32.847s
> user    0m0.418s
> sys     1m5.671s
> 
> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /raid/foo4/
> real    5m54.554s
> user    0m0.437s
> sys     1m6.268s

The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.

'top' while copying with stripe size of 2048 (the source disk is ntfs):
> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1


And the best thing at all:
While all of these tests there are no warnings/panics in the syslog.


With best regards,
Kevin



Am 19.12.2013 15:11, schrieb Stan Hoeppner:
> On 12/18/2013 4:27 AM, Kevin Richter wrote:
>> Thanks for your mails!
>>
>>> This is unusual. How long have you waited?
>>
>> For the reboot? One night.
>> After the copy process hangs: several hours. But mostly it recovers
>> after several minutes.
>>
>>> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
>>> suitable for XFS.  This may not be a silver bullet but it will help.
>>
>> Can I switch it while my copy process (from a separate hd to this raid)
>> is running... without data loss? Otherwise I would wait a bit, because
>> now it is actually running for 8 hours without kernel panics.
> 
> Yes, you can safely change the elevator on the fly.  Repeat the
> following once for each disk, replacing X appropriately.
> 
> $ echo deadline > /sys/block/sdX/queue/scheduler
> 
> This on the fly change is not permanent.  Add "elevator=deadline" to
> your kernel command line in your bootloader config file.  Or you can add
> the commands to a startup script.  Or if you build your own kernels
> simply make it the default, and/or remove all the other elevators from
> your kernel config, which is what I do.
> 
>>> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
>>> sub optimal for your workload.
>>
>> $ cat /sys/block/md2/md/stripe_cache_size
>> 256
> 
> 256 is the default and it is way too low.  This is limiting your write
> througput.  Increase this to a minimum of 1024 which will give you a
> 20MB stripe cache buffer.  This should become active immediately.  Add
> it to a startup script to make it permanent.
> 
> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
> 
> ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache
> 
> At some future point when the system is idle, play with FIO write
> testing and multiple jobs (threads) with stripe_cache_size values up to
> 2048 to see which value gives the best throughput.  Test 1024, 2048,
> 4096.  Going higher probably won't gain anything.  And at some point as
> you go higher your throughput will decrease.  When you see that, select
> the previous value and make it permanent.
> 
>> $ mdadm --detail /dev/md2 | grep Chunk
>> Chunk Size : 512K
>>
>>> 3.  Post 'xfs_info /dev/mdX'
>>
>> There is a LUKS volume around /dev/md2, named '6tb'.
> 
> Ok, this adds a huge fly to the ointment.  LUKS is single threaded per
> device.  You've encrypted the md device instead of the individual disks.
>  This limits your encryption throughput to a single CPU core.  You've
> not provided any data on your LUKS thread.  Is it pegging a core during
> large file IO?  I'd bet it is.  That would explain all of this.
> 
>>> $ xfs_info /dev/md2
>>> xfs_info: /dev/md2 is not a mounted XFS filesystem
>>> $ xfs_info /dev/mapper/6tb
>>> meta-data=/dev/mapper/6tb        isize=256    agcount=32, agsize=45631360 blks
>>>          =                       sectsz=512   attr=2
>>> data     =                       bsize=4096   blocks=1460203520, imaxpct=5
>>>          =                       sunit=128    swidth=384 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0
>>> log      =internal               bsize=4096   blocks=521728, version=2
>>>          =                       sectsz=512   sunit=8 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width.  The
> XFS alignment matches the md device.  This may be hurting you with LUKS
> between XFS and md.  I don't know how LUKS handles 1.5MB inbound writes
> and if it passes an aligned write to md after encrypting.  I also don't
> know how it handles writes barriers passed down the stack by XFS, or if
> it does at all.
> 
>>> 4.  You're getting a lot of kswapd timeouts because you have swap and
>>> the md/RAID6 array on the same disks.  Relocate swap to disks that are
>>> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
>>> swap on it.  Or install more RAM in the machine.  Going the SSD route is
>>> better as it gives flexibility.  For instance, you can also relocate
>>> your syslog files to it and anything else that does IO without eating
>>> lots of space.  This decreases the IOPS load on your rust.
>>
>> No no, swap is not on any of the raid disks.
>>
>>> # cat /proc/swaps
>>> Filename                                Type            Size    Used    Priority
>>> /dev/sda3                               partition       7812496 0       -1
>> sda is not in the raid. In the raid there are sd[cdefg].
> 
> I was wrong here.  After reading a little about kswapd I now know it
> doesn't simply move pages to/from disk.  It frees memory in other ways
> as well.  In this case it's apparently waiting to free an inode being
> reclaimed by xfs_reclaim_inode, which is stalled on lower level IO.  So
> kswapd times out as a result.  A slow LUKS thread would explain this as
> well.
> 
>>> 5.  Describe in some detail the workload(s) causing the heavy IO, and
>>> thus these timeouts.
>>
>> cd /olddharddisk
>> cp -av . /raid/
>>
>> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
>> above.
> 
> So you are simply doing a huge file copy from an apparently slower
> single disk to an apparently faster striped RAID.  But you get horrible
> throughput and kernel timeout errors, and the RAID disks are apparently
> fairly idle.
> 
>> Heavy workload while this copy process (2 CPUs, each 4 cores):
> 
> Model # of the CPUs so I can look up the specs?
> 
>>> top - 11:13:37 up 4 days, 21:32,  2 users,  load average: 12.95, 11.33, 10.32
>>> Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us,  5.7%sy,  0.0%ni, 82.1%id, 11.8%wa,  0.0%hi,  0.3%si,  0.0%st
>>> Mem:  32916276k total, 32750240k used,   166036k free, 10076760k buffers
>>> Swap:  7812496k total,        0k used,  7812496k free, 21221136k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>   699 root      20   0     0    0    0 S   11  0.0 248:17.59 md2_raid6
>>
>> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
>> consumption does not tell me. All entries are only 0.0% and 0.1%
> 
> ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
> and page cache (20GB).  Linux uses nearly all free memory for cache and
> buffers to prevent disk accesses, which speeds up access to frequently
> used data considerably.  This is normal.  A huge copy like this is going
> to cause a huge amount caching.
> 
> I assume this 1TB file copy has finally completed.  Start another large
> test copy and paste the CPU burn for the LUKS thread so we can confirm
> whether this is the root cause of your problem.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-20 10:26       ` Kevin Richter
@ 2013-12-20 12:36         ` Stan Hoeppner
  2013-12-21  5:30           ` Dave Chinner
  2013-12-22  2:35           ` Kevin Richter
  2013-12-20 22:43         ` Arkadiusz Miśkiewicz
  1 sibling, 2 replies; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-20 12:36 UTC (permalink / raw)
  To: xfs, xfs

On 12/20/2013 4:26 AM, Kevin Richter wrote:
> Thanks a lot for your huge replies!

You're welcome.

>> Oh, that's quite old. I'd suggest upgrading to a much more recent
>> kernel as we've fixed lots of issues in this area since then.
> 
> First I have switched to the newer kernel from Ubuntu Saucy:
> $ uname -a
> Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
> 
> Thus, it seems that the default scheduler has been changed to deadline.
> I did not change anything. After a reboot the schedulers of all disks
> are now deadline.

Good move Ubuntu.

>> Model # of the CPUs so I can look up the specs?
> Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz

Strong CPUs.

> I have prepared a folder with 60GB for the tests. This is nearly twice
> of the available memory, so the process should be forced to actually
> write the stuff to the disk - and not only hold in the memory.
> 
>> $ echo 256 > /sys/block/md2/md/stripe_cache_size
>> $ time cp -a /olddisk/testfolder /6tb/foo1/
>> real    25m38.925s
>> user    0m0.595s
>> sys     1m23.182s
>>
>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>> $ time cp -a /olddisk/testfolder /raid/foo2/
>> real    7m32.824s
>> user    0m0.438s
>> sys     1m6.759s
>>
>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>> $ time cp -a /olddisk/testfolder /raid/foo3/
>> real    5m32.847s
>> user    0m0.418s
>> sys     1m5.671s
>>
>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>> $ time cp -a /olddisk/testfolder /raid/foo4/
>> real    5m54.554s
>> user    0m0.437s
>> sys     1m6.268s
>
> The difference is really amazing! So 2048 seems to be the best choice.
> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
> The RAID only consist of 5 SATA disks with 7200rpm.

A lot of the source data is being cached between runs so these numbers
aren't accurate.  The throughput of this copy operation will be limited
by the speed of the single source disk, not the array.  To make the
elapsed times of this copy test accurate you need to execute something
like these commands after each run:

# sync
# echo 3 > /proc/sys/vm/drop_caches

But this copy test will not inform you about the potential peak
performance of your array.  That's why I suggested you test with FIO,
the flexible IO tester.

# aptitude install fio
# man fio

Sample job file suitable for your system:

[global]
directory=/your/XFS/test/directory
zero_buffers
numjobs=8
group_reporting
blocksize=512k
ioengine=libaio
iodepth=16
direct=1
size=10g

[read]
rw=read
stonewall

[write]
rw=write
stonewall

This should give you a relatively accurate picture of the actual
potential throughput of your array and filesystem.

> 'top' while copying with stripe size of 2048 (the source disk is ntfs):
>> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
>> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
>> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
>> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
>> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
>> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
>> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
>> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
>> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
>> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1

Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?

> And the best thing at all:
> While all of these tests there are no warnings/panics in the syslog.
> 
> With best regards,
> Kevin

Even though XFS wasn't the cause of the problem I'm glad we were able to
help you fix it nonetheless.  I'm really curious to see what kind of
throughput you can achieve with FIO, and whether crypto is a bottleneck
at the 250-350 MB/s your array should be capable of.  It would be great
if you would play around a bit with FIO and post some numbers.

--
Stan


> Am 19.12.2013 15:11, schrieb Stan Hoeppner:
>> On 12/18/2013 4:27 AM, Kevin Richter wrote:
>>> Thanks for your mails!
>>>
>>>> This is unusual. How long have you waited?
>>>
>>> For the reboot? One night.
>>> After the copy process hangs: several hours. But mostly it recovers
>>> after several minutes.
>>>
>>>> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
>>>> suitable for XFS.  This may not be a silver bullet but it will help.
>>>
>>> Can I switch it while my copy process (from a separate hd to this raid)
>>> is running... without data loss? Otherwise I would wait a bit, because
>>> now it is actually running for 8 hours without kernel panics.
>>
>> Yes, you can safely change the elevator on the fly.  Repeat the
>> following once for each disk, replacing X appropriately.
>>
>> $ echo deadline > /sys/block/sdX/queue/scheduler
>>
>> This on the fly change is not permanent.  Add "elevator=deadline" to
>> your kernel command line in your bootloader config file.  Or you can add
>> the commands to a startup script.  Or if you build your own kernels
>> simply make it the default, and/or remove all the other elevators from
>> your kernel config, which is what I do.
>>
>>>> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
>>>> sub optimal for your workload.
>>>
>>> $ cat /sys/block/md2/md/stripe_cache_size
>>> 256
>>
>> 256 is the default and it is way too low.  This is limiting your write
>> througput.  Increase this to a minimum of 1024 which will give you a
>> 20MB stripe cache buffer.  This should become active immediately.  Add
>> it to a startup script to make it permanent.
>>
>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>>
>> ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache
>>
>> At some future point when the system is idle, play with FIO write
>> testing and multiple jobs (threads) with stripe_cache_size values up to
>> 2048 to see which value gives the best throughput.  Test 1024, 2048,
>> 4096.  Going higher probably won't gain anything.  And at some point as
>> you go higher your throughput will decrease.  When you see that, select
>> the previous value and make it permanent.
>>
>>> $ mdadm --detail /dev/md2 | grep Chunk
>>> Chunk Size : 512K
>>>
>>>> 3.  Post 'xfs_info /dev/mdX'
>>>
>>> There is a LUKS volume around /dev/md2, named '6tb'.
>>
>> Ok, this adds a huge fly to the ointment.  LUKS is single threaded per
>> device.  You've encrypted the md device instead of the individual disks.
>>  This limits your encryption throughput to a single CPU core.  You've
>> not provided any data on your LUKS thread.  Is it pegging a core during
>> large file IO?  I'd bet it is.  That would explain all of this.
>>
>>>> $ xfs_info /dev/md2
>>>> xfs_info: /dev/md2 is not a mounted XFS filesystem
>>>> $ xfs_info /dev/mapper/6tb
>>>> meta-data=/dev/mapper/6tb        isize=256    agcount=32, agsize=45631360 blks
>>>>          =                       sectsz=512   attr=2
>>>> data     =                       bsize=4096   blocks=1460203520, imaxpct=5
>>>>          =                       sunit=128    swidth=384 blks
>>>> naming   =version 2              bsize=4096   ascii-ci=0
>>>> log      =internal               bsize=4096   blocks=521728, version=2
>>>>          =                       sectsz=512   sunit=8 blks, lazy-count=1
>>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>> md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width.  The
>> XFS alignment matches the md device.  This may be hurting you with LUKS
>> between XFS and md.  I don't know how LUKS handles 1.5MB inbound writes
>> and if it passes an aligned write to md after encrypting.  I also don't
>> know how it handles writes barriers passed down the stack by XFS, or if
>> it does at all.
>>
>>>> 4.  You're getting a lot of kswapd timeouts because you have swap and
>>>> the md/RAID6 array on the same disks.  Relocate swap to disks that are
>>>> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
>>>> swap on it.  Or install more RAM in the machine.  Going the SSD route is
>>>> better as it gives flexibility.  For instance, you can also relocate
>>>> your syslog files to it and anything else that does IO without eating
>>>> lots of space.  This decreases the IOPS load on your rust.
>>>
>>> No no, swap is not on any of the raid disks.
>>>
>>>> # cat /proc/swaps
>>>> Filename                                Type            Size    Used    Priority
>>>> /dev/sda3                               partition       7812496 0       -1
>>> sda is not in the raid. In the raid there are sd[cdefg].
>>
>> I was wrong here.  After reading a little about kswapd I now know it
>> doesn't simply move pages to/from disk.  It frees memory in other ways
>> as well.  In this case it's apparently waiting to free an inode being
>> reclaimed by xfs_reclaim_inode, which is stalled on lower level IO.  So
>> kswapd times out as a result.  A slow LUKS thread would explain this as
>> well.
>>
>>>> 5.  Describe in some detail the workload(s) causing the heavy IO, and
>>>> thus these timeouts.
>>>
>>> cd /olddharddisk
>>> cp -av . /raid/
>>>
>>> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
>>> above.
>>
>> So you are simply doing a huge file copy from an apparently slower
>> single disk to an apparently faster striped RAID.  But you get horrible
>> throughput and kernel timeout errors, and the RAID disks are apparently
>> fairly idle.
>>
>>> Heavy workload while this copy process (2 CPUs, each 4 cores):
>>
>> Model # of the CPUs so I can look up the specs?
>>
>>>> top - 11:13:37 up 4 days, 21:32,  2 users,  load average: 12.95, 11.33, 10.32
>>>> Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  0.0%us,  5.7%sy,  0.0%ni, 82.1%id, 11.8%wa,  0.0%hi,  0.3%si,  0.0%st
>>>> Mem:  32916276k total, 32750240k used,   166036k free, 10076760k buffers
>>>> Swap:  7812496k total,        0k used,  7812496k free, 21221136k cached
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>   699 root      20   0     0    0    0 S   11  0.0 248:17.59 md2_raid6
>>>
>>> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
>>> consumption does not tell me. All entries are only 0.0% and 0.1%
>>
>> ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
>> and page cache (20GB).  Linux uses nearly all free memory for cache and
>> buffers to prevent disk accesses, which speeds up access to frequently
>> used data considerably.  This is normal.  A huge copy like this is going
>> to cause a huge amount caching.
>>
>> I assume this 1TB file copy has finally completed.  Start another large
>> test copy and paste the CPU burn for the LUKS thread so we can confirm
>> whether this is the root cause of your problem.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-20 10:26       ` Kevin Richter
  2013-12-20 12:36         ` Stan Hoeppner
@ 2013-12-20 22:43         ` Arkadiusz Miśkiewicz
  2013-12-21 11:18             ` Stan Hoeppner
  1 sibling, 1 reply; 30+ messages in thread
From: Arkadiusz Miśkiewicz @ 2013-12-20 22:43 UTC (permalink / raw)
  To: xfs, xfs; +Cc: linux-raid, stan

On Friday 20 of December 2013, Kevin Richter wrote:

> >> $ cat /sys/block/md2/md/stripe_cache_size
> >> 256
> > 
> > 256 is the default and it is way too low.  This is limiting your write
> > througput.  Increase this to a minimum of 1024 which will give you a
> > 20MB stripe cache buffer.  This should become active immediately.  Add
> > it to a startup script to make it permanent.

> > $ echo 256 > /sys/block/md2/md/stripe_cache_size
> > $ time cp -a /olddisk/testfolder /6tb/foo1/
> > real    25m38.925s
> > user    0m0.595s
> > sys     1m23.182s
> > 
> > $ echo 1024 > /sys/block/md2/md/stripe_cache_size
> > $ time cp -a /olddisk/testfolder /raid/foo2/
> > real    7m32.824s
> > user    0m0.438s
> > sys     1m6.759s
> > 
> > $ echo 2048 > /sys/block/md2/md/stripe_cache_size
> > $ time cp -a /olddisk/testfolder /raid/foo3/
> > real    5m32.847s
> > user    0m0.418s
> > sys     1m5.671s
> > 
> > $ echo 4096 > /sys/block/md2/md/stripe_cache_size
> > $ time cp -a /olddisk/testfolder /raid/foo4/
> > real    5m54.554s
> > user    0m0.437s
> > sys     1m6.268s
> 
> The difference is really amazing! So 2048 seems to be the best choice.
> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
> The RAID only consist of 5 SATA disks with 7200rpm.

I wonder why kernel is giving defaults that everyone repeatly recommends to 
change/increase? Has anyone tried to bugreport that for stripe_cache_size 
case?

-- 
Arkadiusz Miśkiewicz, arekm / maven.pl

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-20 12:36         ` Stan Hoeppner
@ 2013-12-21  5:30           ` Dave Chinner
  2013-12-22  9:18             ` Stan Hoeppner
  2013-12-22  2:35           ` Kevin Richter
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2013-12-21  5:30 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs, xfs

On Fri, Dec 20, 2013 at 06:36:33AM -0600, Stan Hoeppner wrote:
> On 12/20/2013 4:26 AM, Kevin Richter wrote:
> > 'top' while copying with stripe size of 2048 (the source disk is ntfs):
> >> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
> >> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
> >> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
> >> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
> >> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
> >> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
> >> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
> >>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
> >>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
> >> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
> >> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
> >>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
> >>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
> >> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
> >> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1
> 
> Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
> SandyBridge core at only ~180 MB/s throughput is it?

Kworkers are an anonymous kernel worker threads that do work that
has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread
on CPU 3 (3:0 is the first). The kworker is a thread pool that grows
and shrinks according to demand.

As the naming suggests, kworker threads are per-CPU, and there can
be hundreds of them per CPU is there is enough workqueue work blocks
during execution of the work (e.g. on locks, waiting for IO, etc).
If there is little blocking occurring, there might on ly be a couple
of kworker threads that do all the work, and hence you see them
consuming huge amounts of CPU on behalf of other systems...

XFS uses workqueues for lots of things, so it's not unusual to see
an IO or metadata heavy workload end up with this huge numbers of
kworker threads doing work:

<run parallel fsmark test>
....
$ ps -ef |grep kworker |wc -l
91
$

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
  2013-12-20 22:43         ` Arkadiusz Miśkiewicz
@ 2013-12-21 11:18             ` Stan Hoeppner
  0 siblings, 0 replies; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-21 11:18 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: linux-raid, xfs

I renamed the subject as your question doesn't really apply to XFS, or
the OP, but to md-RAID.

On 12/20/2013 4:43 PM, Arkadiusz Miśkiewicz wrote:

> I wonder why kernel is giving defaults that everyone repeatly recommends to 
> change/increase? Has anyone tried to bugreport that for stripe_cache_size 
> case?

The answer is balancing default md-RAID5/6 write performance against
kernel RAM consumption, with more weight given to the latter.  The formula:

((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache

High stripe_cache_size values will cause the kernel to eat non trivial
amounts of RAM for the stripe cache buffer.  This table demonstrates the
effect today for typical RAID5/6 disk counts.

stripe_cache_size	drives	RAM consumed
256			 4	  4 MB
			 8	  8 MB
			16	 16 MB
512			 4	  8 MB
			 8	 16 MB
			16	 32 MB
1024			 4	 16 MB
			 8	 32 MB
			16	 64 MB
2048			 4	 32 MB
			 8	 64 MB
			16	128 MB
4096			 4	 64 MB
			 8	128 MB
			16	256 MB

The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures.  The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.

With SSDs becoming mainstream, and becoming ever faster, at some point
the md-RAID5/6 architecture will have to be redesigned because of the
memory footprint required for performance.  Currently the required size
of the stripe cache appears directly proportional to the aggregate write
throughput of the RAID devices.  Thus the optimal value will vary
greatly from one system to another depending on the throughput of the
drives.

For example, I assisted a user with 5x Intel SSDs back in January and
his system required 4096, or 80MB of RAM for stripe cache, to reach
maximum write throughput of the devices.  This yielded 600MB/s or 60%
greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
more RAM than the default was well worth the increase as the machine was
an iSCSI target server with 8GB RAM.

In the previous case with 5x rust RAID6 the 2048 value seemed optimal
(though not yet verified), requiring 40MB less RAM than the 5x Intel
SSDs.  For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
optimal but maybe a little low.  Consider that 256 has been the default
for a very long time, and was selected back when average drive
throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
been invented, and system memories were much smaller.

Due to the massive difference in throughput between rust and SSD, any
meaningful change in the default really requires new code to sniff out
what type of devices constitute the array, if that's possible, and it
probably isn't, and set a lowish default accordingly.  Again, SSDs
didn't exist when md-RAID was coded, nor when this default was set, and
this throws a big monkey wrench into these spokes.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
@ 2013-12-21 11:18             ` Stan Hoeppner
  0 siblings, 0 replies; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-21 11:18 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: linux-raid, xfs

I renamed the subject as your question doesn't really apply to XFS, or
the OP, but to md-RAID.

On 12/20/2013 4:43 PM, Arkadiusz Miśkiewicz wrote:

> I wonder why kernel is giving defaults that everyone repeatly recommends to 
> change/increase? Has anyone tried to bugreport that for stripe_cache_size 
> case?

The answer is balancing default md-RAID5/6 write performance against
kernel RAM consumption, with more weight given to the latter.  The formula:

((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache

High stripe_cache_size values will cause the kernel to eat non trivial
amounts of RAM for the stripe cache buffer.  This table demonstrates the
effect today for typical RAID5/6 disk counts.

stripe_cache_size	drives	RAM consumed
256			 4	  4 MB
			 8	  8 MB
			16	 16 MB
512			 4	  8 MB
			 8	 16 MB
			16	 32 MB
1024			 4	 16 MB
			 8	 32 MB
			16	 64 MB
2048			 4	 32 MB
			 8	 64 MB
			16	128 MB
4096			 4	 64 MB
			 8	128 MB
			16	256 MB

The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures.  The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.

With SSDs becoming mainstream, and becoming ever faster, at some point
the md-RAID5/6 architecture will have to be redesigned because of the
memory footprint required for performance.  Currently the required size
of the stripe cache appears directly proportional to the aggregate write
throughput of the RAID devices.  Thus the optimal value will vary
greatly from one system to another depending on the throughput of the
drives.

For example, I assisted a user with 5x Intel SSDs back in January and
his system required 4096, or 80MB of RAM for stripe cache, to reach
maximum write throughput of the devices.  This yielded 600MB/s or 60%
greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
more RAM than the default was well worth the increase as the machine was
an iSCSI target server with 8GB RAM.

In the previous case with 5x rust RAID6 the 2048 value seemed optimal
(though not yet verified), requiring 40MB less RAM than the 5x Intel
SSDs.  For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
optimal but maybe a little low.  Consider that 256 has been the default
for a very long time, and was selected back when average drive
throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
been invented, and system memories were much smaller.

Due to the massive difference in throughput between rust and SSD, any
meaningful change in the default really requires new code to sniff out
what type of devices constitute the array, if that's possible, and it
probably isn't, and set a lowish default accordingly.  Again, SSDs
didn't exist when md-RAID was coded, nor when this default was set, and
this throws a big monkey wrench into these spokes.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
  2013-12-21 11:18             ` Stan Hoeppner
@ 2013-12-21 12:20               ` Piergiorgio Sartor
  -1 siblings, 0 replies; 30+ messages in thread
From: Piergiorgio Sartor @ 2013-12-21 12:20 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Arkadiusz Miśkiewicz, linux-raid, xfs

On Sat, Dec 21, 2013 at 05:18:42AM -0600, Stan Hoeppner wrote:
> I renamed the subject as your question doesn't really apply to XFS, or
> the OP, but to md-RAID.
> 
> On 12/20/2013 4:43 PM, Arkadiusz Miśkiewicz wrote:
> 
> > I wonder why kernel is giving defaults that everyone repeatly recommends to 
> > change/increase? Has anyone tried to bugreport that for stripe_cache_size 
> > case?
> 
> The answer is balancing default md-RAID5/6 write performance against
> kernel RAM consumption, with more weight given to the latter.  The formula:
> 
> ((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache
> 
> High stripe_cache_size values will cause the kernel to eat non trivial
> amounts of RAM for the stripe cache buffer.  This table demonstrates the
> effect today for typical RAID5/6 disk counts.
> 
> stripe_cache_size	drives	RAM consumed
> 256			 4	  4 MB
> 			 8	  8 MB
> 			16	 16 MB
> 512			 4	  8 MB
> 			 8	 16 MB
> 			16	 32 MB
> 1024			 4	 16 MB
> 			 8	 32 MB
> 			16	 64 MB
> 2048			 4	 32 MB
> 			 8	 64 MB
> 			16	128 MB
> 4096			 4	 64 MB
> 			 8	128 MB
> 			16	256 MB
> 
> The powers that be, Linus in particular, are not fond of default
> settings that create a lot of kernel memory structures.  The default
> md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
> 
> With SSDs becoming mainstream, and becoming ever faster, at some point
> the md-RAID5/6 architecture will have to be redesigned because of the
> memory footprint required for performance.  Currently the required size
> of the stripe cache appears directly proportional to the aggregate write
> throughput of the RAID devices.  Thus the optimal value will vary
> greatly from one system to another depending on the throughput of the
> drives.
> 
> For example, I assisted a user with 5x Intel SSDs back in January and
> his system required 4096, or 80MB of RAM for stripe cache, to reach
> maximum write throughput of the devices.  This yielded 600MB/s or 60%
> greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
> more RAM than the default was well worth the increase as the machine was
> an iSCSI target server with 8GB RAM.
> 
> In the previous case with 5x rust RAID6 the 2048 value seemed optimal
> (though not yet verified), requiring 40MB less RAM than the 5x Intel
> SSDs.  For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
> optimal but maybe a little low.  Consider that 256 has been the default
> for a very long time, and was selected back when average drive
> throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
> been invented, and system memories were much smaller.
> 
> Due to the massive difference in throughput between rust and SSD, any
> meaningful change in the default really requires new code to sniff out
> what type of devices constitute the array, if that's possible, and it
> probably isn't, and set a lowish default accordingly.  Again, SSDs
> didn't exist when md-RAID was coded, nor when this default was set, and
> this throws a big monkey wrench into these spokes.

Hi Stan,

nice analytical report, as usual...

My dumb suggestion would be to simply use udev to
setup the drives.
Everything, stripe_cache, read_ahead, stcerr, etc.
can be configured, I suppose, by udev rules.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
@ 2013-12-21 12:20               ` Piergiorgio Sartor
  0 siblings, 0 replies; 30+ messages in thread
From: Piergiorgio Sartor @ 2013-12-21 12:20 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid, xfs

On Sat, Dec 21, 2013 at 05:18:42AM -0600, Stan Hoeppner wrote:
> I renamed the subject as your question doesn't really apply to XFS, or
> the OP, but to md-RAID.
> 
> On 12/20/2013 4:43 PM, Arkadiusz Miśkiewicz wrote:
> 
> > I wonder why kernel is giving defaults that everyone repeatly recommends to 
> > change/increase? Has anyone tried to bugreport that for stripe_cache_size 
> > case?
> 
> The answer is balancing default md-RAID5/6 write performance against
> kernel RAM consumption, with more weight given to the latter.  The formula:
> 
> ((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache
> 
> High stripe_cache_size values will cause the kernel to eat non trivial
> amounts of RAM for the stripe cache buffer.  This table demonstrates the
> effect today for typical RAID5/6 disk counts.
> 
> stripe_cache_size	drives	RAM consumed
> 256			 4	  4 MB
> 			 8	  8 MB
> 			16	 16 MB
> 512			 4	  8 MB
> 			 8	 16 MB
> 			16	 32 MB
> 1024			 4	 16 MB
> 			 8	 32 MB
> 			16	 64 MB
> 2048			 4	 32 MB
> 			 8	 64 MB
> 			16	128 MB
> 4096			 4	 64 MB
> 			 8	128 MB
> 			16	256 MB
> 
> The powers that be, Linus in particular, are not fond of default
> settings that create a lot of kernel memory structures.  The default
> md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
> 
> With SSDs becoming mainstream, and becoming ever faster, at some point
> the md-RAID5/6 architecture will have to be redesigned because of the
> memory footprint required for performance.  Currently the required size
> of the stripe cache appears directly proportional to the aggregate write
> throughput of the RAID devices.  Thus the optimal value will vary
> greatly from one system to another depending on the throughput of the
> drives.
> 
> For example, I assisted a user with 5x Intel SSDs back in January and
> his system required 4096, or 80MB of RAM for stripe cache, to reach
> maximum write throughput of the devices.  This yielded 600MB/s or 60%
> greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
> more RAM than the default was well worth the increase as the machine was
> an iSCSI target server with 8GB RAM.
> 
> In the previous case with 5x rust RAID6 the 2048 value seemed optimal
> (though not yet verified), requiring 40MB less RAM than the 5x Intel
> SSDs.  For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
> optimal but maybe a little low.  Consider that 256 has been the default
> for a very long time, and was selected back when average drive
> throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
> been invented, and system memories were much smaller.
> 
> Due to the massive difference in throughput between rust and SSD, any
> meaningful change in the default really requires new code to sniff out
> what type of devices constitute the array, if that's possible, and it
> probably isn't, and set a lowish default accordingly.  Again, SSDs
> didn't exist when md-RAID was coded, nor when this default was set, and
> this throws a big monkey wrench into these spokes.

Hi Stan,

nice analytical report, as usual...

My dumb suggestion would be to simply use udev to
setup the drives.
Everything, stripe_cache, read_ahead, stcerr, etc.
can be configured, I suppose, by udev rules.

bye,

-- 

piergiorgio

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
  2013-12-21 11:18             ` Stan Hoeppner
  (?)
  (?)
@ 2013-12-22  1:41             ` Stan Hoeppner
  -1 siblings, 0 replies; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-22  1:41 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: linux-raid, xfs

On 12/21/2013 5:18 AM, Stan Hoeppner wrote:
...
> For example, I assisted a user with 5x Intel SSDs back in January and
> his system required 4096, or 80MB of RAM for stripe cache, to reach
> maximum write throughput of the devices.  This yielded 600MB/s or 60%
> greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
> more RAM than the default was well worth the increase as the machine was
> an iSCSI target server with 8GB RAM.

Correction here.  I said above that 80MB was 60MB greater than the
default for his 5 drives.  This should have said 75MB greater than the
default which is 1MB per member device, or 5MB for 5 drives.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-20 12:36         ` Stan Hoeppner
  2013-12-21  5:30           ` Dave Chinner
@ 2013-12-22  2:35           ` Kevin Richter
  2013-12-22  8:12             ` Stan Hoeppner
  1 sibling, 1 reply; 30+ messages in thread
From: Kevin Richter @ 2013-12-22  2:35 UTC (permalink / raw)
  To: stan, xfs

Okay, well, I've run your fio config... but with so many results and
abbrevations I currently feel a bit overchallenged :)
So, please dont take badly, if I just paste the whole results.
I have tried with a stripe cache size of 1024, 2048 and 4096.

Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
same 5 disks like my LUKS-md2 array, so nearly the same just without
this LUKS layer. If helpful, I can execute some fio tests on this
filesystem, too.


So long :)
Kevin


$ echo 1024 > /sys/block/md2/md/stripe_cache_size
> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193  iops] [eta 00m:06s]
> read: (groupid=0, jobs=8): err= 0: pid=12987
>   read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
>     slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
>     clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
>      lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
>     bw (KB/s) : min=    0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
>   cpu          : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=163840/0/0, short=0/0/0
> 
>      lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
>      lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
> write: (groupid=1, jobs=8): err= 0: pid=13202
>   write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
>     slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
>     clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
>      lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
>     bw (KB/s) : min=    0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
>   cpu          : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/163840/0, short=0/0/0
> 
>      lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
>      lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
> 
> Run status group 0 (all jobs):
>    READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
> 
> Disk stats (read/write):
>   dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>     md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
>     sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
>     sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
>     sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
>     sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
>     sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%

$ echo 2048 > /sys/block/md2/md/stripe_cache_size
> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175  iops] [eta 00m:06s]
> read: (groupid=0, jobs=8): err= 0: pid=6392
>   read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
>     slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
>     clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
>      lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
>     bw (KB/s) : min=    0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
>   cpu          : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=163840/0/0, short=0/0/0
> 
>      lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
>      lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
> write: (groupid=1, jobs=8): err= 0: pid=6663
>   write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
>     slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
>     clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
>      lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
>     bw (KB/s) : min=    0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
>   cpu          : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/163840/0, short=0/0/0
> 
>      lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
>      lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
> 
> Run status group 0 (all jobs):
>    READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
> 
> Disk stats (read/write):
>   dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>     md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
>     sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
>     sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
>     sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
>     sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
>     sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%

$ echo 4096 > /sys/block/md2/md/stripe_cache_size
> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182  iops] [eta 00m:00s]
> read: (groupid=0, jobs=8): err= 0: pid=11787
>   read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
>     slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
>     clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
>      lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
>     bw (KB/s) : min=    0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
>   cpu          : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=163840/0/0, short=0/0/0
> 
>      lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
>      lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
> write: (groupid=1, jobs=8): err= 0: pid=12060
>   write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
>     slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
>     clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
>      lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
>     bw (KB/s) : min=    0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
>   cpu          : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/163840/0, short=0/0/0
> 
>      lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
>      lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
> 
> Run status group 0 (all jobs):
>    READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
> 
> Disk stats (read/write):
>   dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>     md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
>     sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
>     sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
>     sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
>     sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
>     sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%





Am 20.12.2013 13:36, schrieb Stan Hoeppner:
> On 12/20/2013 4:26 AM, Kevin Richter wrote:
>> Thanks a lot for your huge replies!
> 
> You're welcome.
> 
>>> Oh, that's quite old. I'd suggest upgrading to a much more recent
>>> kernel as we've fixed lots of issues in this area since then.
>>
>> First I have switched to the newer kernel from Ubuntu Saucy:
>> $ uname -a
>> Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
>>
>> Thus, it seems that the default scheduler has been changed to deadline.
>> I did not change anything. After a reboot the schedulers of all disks
>> are now deadline.
> 
> Good move Ubuntu.
> 
>>> Model # of the CPUs so I can look up the specs?
>> Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
> 
> Strong CPUs.
> 
>> I have prepared a folder with 60GB for the tests. This is nearly twice
>> of the available memory, so the process should be forced to actually
>> write the stuff to the disk - and not only hold in the memory.
>>
>>> $ echo 256 > /sys/block/md2/md/stripe_cache_size
>>> $ time cp -a /olddisk/testfolder /6tb/foo1/
>>> real    25m38.925s
>>> user    0m0.595s
>>> sys     1m23.182s
>>>
>>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>>> $ time cp -a /olddisk/testfolder /raid/foo2/
>>> real    7m32.824s
>>> user    0m0.438s
>>> sys     1m6.759s
>>>
>>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>>> $ time cp -a /olddisk/testfolder /raid/foo3/
>>> real    5m32.847s
>>> user    0m0.418s
>>> sys     1m5.671s
>>>
>>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>>> $ time cp -a /olddisk/testfolder /raid/foo4/
>>> real    5m54.554s
>>> user    0m0.437s
>>> sys     1m6.268s
>>
>> The difference is really amazing! So 2048 seems to be the best choice.
>> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
>> The RAID only consist of 5 SATA disks with 7200rpm.
> 
> A lot of the source data is being cached between runs so these numbers
> aren't accurate.  The throughput of this copy operation will be limited
> by the speed of the single source disk, not the array.  To make the
> elapsed times of this copy test accurate you need to execute something
> like these commands after each run:
> 
> # sync
> # echo 3 > /proc/sys/vm/drop_caches
> 
> But this copy test will not inform you about the potential peak
> performance of your array.  That's why I suggested you test with FIO,
> the flexible IO tester.
> 
> # aptitude install fio
> # man fio
> 
> Sample job file suitable for your system:
> 
> [global]
> directory=/your/XFS/test/directory
> zero_buffers
> numjobs=8
> group_reporting
> blocksize=512k
> ioengine=libaio
> iodepth=16
> direct=1
> size=10g
> 
> [read]
> rw=read
> stonewall
> 
> [write]
> rw=write
> stonewall
> 
> This should give you a relatively accurate picture of the actual
> potential throughput of your array and filesystem.
> 
>> 'top' while copying with stripe size of 2048 (the source disk is ntfs):
>>> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
>>> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
>>> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
>>> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
>>> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
>>> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>>>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>>>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
>>> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
>>> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>>>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>>>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
>>> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
>>> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1
> 
> Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
> SandyBridge core at only ~180 MB/s throughput is it?
> 
>> And the best thing at all:
>> While all of these tests there are no warnings/panics in the syslog.
>>
>> With best regards,
>> Kevin
> 
> Even though XFS wasn't the cause of the problem I'm glad we were able to
> help you fix it nonetheless.  I'm really curious to see what kind of
> throughput you can achieve with FIO, and whether crypto is a bottleneck
> at the 250-350 MB/s your array should be capable of.  It would be great
> if you would play around a bit with FIO and post some numbers.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-22  2:35           ` Kevin Richter
@ 2013-12-22  8:12             ` Stan Hoeppner
  2013-12-22 14:10               ` Kevin Richter
  0 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-22  8:12 UTC (permalink / raw)
  To: xfs, xfs

On 12/21/2013 8:35 PM, Kevin Richter wrote:
> Okay, well, I've run your fio config... but with so many results and

I'm glad you changed to size=1G instead of size=10G lest the test would
have read/written 80GB instead of 8GB, and taken forever.  I often
forget the size parameter is per job (thread) not global.

> abbrevations I currently feel a bit overchallenged :)
> So, please dont take badly, if I just paste the whole results.
> I have tried with a stripe cache size of 1024, 2048 and 4096.

The full output is good.  Going from 1024 to 2048 gained you ~17MB/s
throughput with this job file.  But you didn't run with the default 256
for comparison, which is very useful to know.  That's ok, as you can do
so with the next set of test runs.  Given the huge latencies I'm
thinking this config may be seeking the disks too much.  Modify your fio
job file as follows:

numjobs=4
blocksize=1536k
iodepth=4
size=1g

as this may yield a little better results.  ~190MB/s read and 75MB/s
write is very low for three effective 7.2K RPM SATA spindles, especially
the write throughput.  Compare to a similar FIO job with 2 threads,
iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
disabled, w/deadline elevator:

Run status group 0 (all jobs):
   READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
Run status group 1 (all jobs):
  WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s

88MB/s read, 108MB/s write.  Write is faster apparently due to the small
16MB drive write cache.  With your eight 2.4GHz SandyBridge cores and
md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
have significantly greater write speed than this old single SATA drive.
 Currently you're showing 30MB/s slower write throughput, some 40%
slower than this single drive.  With aligned writes, thus no RMW, you
should be getting at least 2 spindles worth of throughput, or ~200MB/s.

So, run FIO again with the mods above, once with stripe_cache_size 256
and once with 2048.  Something seems to be seriously dragging down your
throughput.  I'm sure you'd like to know what, and how to fix it.

> Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
> same 5 disks like my LUKS-md2 array, so nearly the same just without
> this LUKS layer. If helpful, I can execute some fio tests on this
> filesystem, too.

This is also a RAID6 array, without dm-crypt?  If so this will afford a
great comparison, and may be very informative.  Run FIO against this
filesystem (I assume it's also XFS) as well with stripe_cache_size 512
and 2048.  Four FIO tests total including the two on the big array.

It'll be interesting to see where this leads.  At this point the
candidates I see are a single dm-crypt thread peaking a core, dm-crypt
taking aligned writes and misaligning them causing all writes to be
RMWs, or AF drives with misaligned partitions causing every write to be
an internal RMW.  Or maybe something I've not considered.


> 
> So long :)
> Kevin
> 
> 
> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193  iops] [eta 00m:06s]
>> read: (groupid=0, jobs=8): err= 0: pid=12987
>>   read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
>>     slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
>>     clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
>>      lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
>>     bw (KB/s) : min=    0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
>>   cpu          : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
>>      lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
>> write: (groupid=1, jobs=8): err= 0: pid=13202
>>   write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
>>     slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
>>     clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
>>      lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
>>     bw (KB/s) : min=    0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
>>   cpu          : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>
>>      lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
>>      lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
>>
>> Run status group 0 (all jobs):
>>    READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
>>     sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
>>     sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
>>     sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
>>     sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
>>     sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
> 
> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175  iops] [eta 00m:06s]
>> read: (groupid=0, jobs=8): err= 0: pid=6392
>>   read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
>>     slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
>>     clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
>>      lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
>>     bw (KB/s) : min=    0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
>>   cpu          : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
>>      lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
>> write: (groupid=1, jobs=8): err= 0: pid=6663
>>   write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
>>     slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
>>     clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
>>      lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
>>     bw (KB/s) : min=    0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
>>   cpu          : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>
>>      lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
>>      lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
>>
>> Run status group 0 (all jobs):
>>    READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
>>     sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
>>     sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
>>     sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
>>     sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
>>     sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
> 
> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182  iops] [eta 00m:00s]
>> read: (groupid=0, jobs=8): err= 0: pid=11787
>>   read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
>>     slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
>>     clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
>>      lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
>>     bw (KB/s) : min=    0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
>>   cpu          : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
>>      lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
>> write: (groupid=1, jobs=8): err= 0: pid=12060
>>   write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
>>     slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
>>     clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
>>      lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
>>     bw (KB/s) : min=    0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
>>   cpu          : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>
>>      lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
>>      lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
>>
>> Run status group 0 (all jobs):
>>    READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
>>     sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
>>     sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
>>     sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
>>     sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
>>     sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
> 
> 
> 
> 
> 
> Am 20.12.2013 13:36, schrieb Stan Hoeppner:
>> On 12/20/2013 4:26 AM, Kevin Richter wrote:
>>> Thanks a lot for your huge replies!
>>
>> You're welcome.
>>
>>>> Oh, that's quite old. I'd suggest upgrading to a much more recent
>>>> kernel as we've fixed lots of issues in this area since then.
>>>
>>> First I have switched to the newer kernel from Ubuntu Saucy:
>>> $ uname -a
>>> Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
>>>
>>> Thus, it seems that the default scheduler has been changed to deadline.
>>> I did not change anything. After a reboot the schedulers of all disks
>>> are now deadline.
>>
>> Good move Ubuntu.
>>
>>>> Model # of the CPUs so I can look up the specs?
>>> Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
>>
>> Strong CPUs.
>>
>>> I have prepared a folder with 60GB for the tests. This is nearly twice
>>> of the available memory, so the process should be forced to actually
>>> write the stuff to the disk - and not only hold in the memory.
>>>
>>>> $ echo 256 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /6tb/foo1/
>>>> real    25m38.925s
>>>> user    0m0.595s
>>>> sys     1m23.182s
>>>>
>>>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /raid/foo2/
>>>> real    7m32.824s
>>>> user    0m0.438s
>>>> sys     1m6.759s
>>>>
>>>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /raid/foo3/
>>>> real    5m32.847s
>>>> user    0m0.418s
>>>> sys     1m5.671s
>>>>
>>>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /raid/foo4/
>>>> real    5m54.554s
>>>> user    0m0.437s
>>>> sys     1m6.268s
>>>
>>> The difference is really amazing! So 2048 seems to be the best choice.
>>> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
>>> The RAID only consist of 5 SATA disks with 7200rpm.
>>
>> A lot of the source data is being cached between runs so these numbers
>> aren't accurate.  The throughput of this copy operation will be limited
>> by the speed of the single source disk, not the array.  To make the
>> elapsed times of this copy test accurate you need to execute something
>> like these commands after each run:
>>
>> # sync
>> # echo 3 > /proc/sys/vm/drop_caches
>>
>> But this copy test will not inform you about the potential peak
>> performance of your array.  That's why I suggested you test with FIO,
>> the flexible IO tester.
>>
>> # aptitude install fio
>> # man fio
>>
>> Sample job file suitable for your system:
>>
>> [global]
>> directory=/your/XFS/test/directory
>> zero_buffers
>> numjobs=8
>> group_reporting
>> blocksize=512k
>> ioengine=libaio
>> iodepth=16
>> direct=1
>> size=10g
>>
>> [read]
>> rw=read
>> stonewall
>>
>> [write]
>> rw=write
>> stonewall
>>
>> This should give you a relatively accurate picture of the actual
>> potential throughput of your array and filesystem.
>>
>>> 'top' while copying with stripe size of 2048 (the source disk is ntfs):
>>>> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
>>>> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
>>>> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
>>>> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
>>>> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
>>>> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>>>>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>>>>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
>>>> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
>>>> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>>>>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>>>>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
>>>> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
>>>> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1
>>
>> Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
>> SandyBridge core at only ~180 MB/s throughput is it?
>>
>>> And the best thing at all:
>>> While all of these tests there are no warnings/panics in the syslog.
>>>
>>> With best regards,
>>> Kevin
>>
>> Even though XFS wasn't the cause of the problem I'm glad we were able to
>> help you fix it nonetheless.  I'm really curious to see what kind of
>> throughput you can achieve with FIO, and whether crypto is a bottleneck
>> at the 250-350 MB/s your array should be capable of.  It would be great
>> if you would play around a bit with FIO and post some numbers.
>>
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-21  5:30           ` Dave Chinner
@ 2013-12-22  9:18             ` Stan Hoeppner
  2013-12-22 20:14               ` Dave Chinner
  2013-12-22 21:01               ` Michael L. Semon
  0 siblings, 2 replies; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-22  9:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs, xfs

On 12/20/2013 11:30 PM, Dave Chinner wrote:
> On Fri, Dec 20, 2013 at 06:36:33AM -0600, Stan Hoeppner wrote:
>> On 12/20/2013 4:26 AM, Kevin Richter wrote:
>>> 'top' while copying with stripe size of 2048 (the source disk is ntfs):
>>>> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
>>>> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
>>>> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
>>>> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
>>>> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
>>>> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>>>>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>>>>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
>>>> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
>>>> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>>>>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>>>>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
>>>> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
>>>> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1
>>
>> Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
>> SandyBridge core at only ~180 MB/s throughput is it?
> 
> Kworkers are an anonymous kernel worker threads that do work that
> has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread
> on CPU 3 (3:0 is the first). The kworker is a thread pool that grows
> and shrinks according to demand.
> 
> As the naming suggests, kworker threads are per-CPU, and there can
> be hundreds of them per CPU is there is enough workqueue work blocks
> during execution of the work (e.g. on locks, waiting for IO, etc).
> If there is little blocking occurring, there might on ly be a couple
> of kworker threads that do all the work, and hence you see them
> consuming huge amounts of CPU on behalf of other systems...
> 
> XFS uses workqueues for lots of things, so it's not unusual to see
> an IO or metadata heavy workload end up with this huge numbers of
> kworker threads doing work:
> 
> <run parallel fsmark test>
> ....
> $ ps -ef |grep kworker |wc -l
> 91

For the record, again, I've never used dm-crypt.  I'm just trying to
work through the layers to identify the source of Kevin's problems.  So
please don't clobber me too hard for going OT, or trying to speak
(somewhat) intelligently about something I'm just learning about...

So, how does one identify who is doing work inside a worker thread?  In
this case we're trying to figure out if dm-crypt might be eating a CPU
core thus limiting its throughput.  However...

I just read a thread on dm-devel about dm-crypt performance dated late
March 2013.  It wasn't clear to me if dm-crypt yet uses unbound
workqueues, if it dispatches work to multiple queues, or if it still
uses the hand built dispatcher, as of kernel 3.11, which Kevin just
updated to.

What was clear is that as of March dm-crypt was not preserving IO
ordering.  If this is still the case, that would imply RMW with md
parity arrays, which would tend to explain Kevin's write throughput
being ~1/3rd of his read, and less than a single disk.

Mikulas Patocka posted 400-620 MB/s dm-crypt throughput with a RAMdisk.
 It was not stated which CPU/clock he was using.  Even if it were a 4GHz
core, Kevin's 2.4GHz SandyBridge should be capable of far more than ~80
MB/s of encryption.

So if dm-crypt is contributing to Kevin's low throughput I'd think it
would be due to the IO ordering causing RMW, not due to CPU starvation.
 Is there an easy way for a mere mortal to peek under the hood and see
if dm-crypt is causing RMW?

Christoph you were involved in that thread WRT IO ordering.  Can you
shed more light on this, and if it may be a factor here.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-22  8:12             ` Stan Hoeppner
@ 2013-12-22 14:10               ` Kevin Richter
  2013-12-22 17:29                 ` Stan Hoeppner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Richter @ 2013-12-22 14:10 UTC (permalink / raw)
  To: stan, xfs

> I'm glad you changed to size=1G instead of size=10G
No no, I didnt change that. It has been written 160GB on these disks. 80
for the read part, 80 for the write part. Each test took round 30 minutes.

> This is also a RAID6 array, without dm-crypt?
Yeah, right. The same 5 disks in RAID6 mode, just in each case another
partition on these disks. And yes, of course, XFS.

So, with your new config I have made 7 fio runs:
1) 15GB RAID6 array, stripe 256
2) 15GB RAID6 array, stripe 1024
3) 15GB RAID6 array, stripe 2048
4) 15GB RAID6 array, stripe 4096
5) 3TB LUKS volume (no raid, just one single disk)
6) 6TB RAID6+LUKS array, stripe 256
7) 6TB RAID6+LUKS array, stripe 2048

In the 15GB array a stripe size of 4096 seems to get slightly more
throughput than 2048.

See below for the results :-)

Perhaps, I should give you another information about the server
hardware, which might (but hopefully do not) explain the low performance:
The server is that one:
http://www.supermicro.com/products/system/2u/6027/ssg-6027r-e1r12l.cfm
with a 12-port SAS expander, which is that one:
http://www.thomas-krenn.com/de/wiki/BPN-SAS-826EL1_SAS_Expander_Backplane
But the backplane should have a thoughput of 3Gb/s

However... I am very curious about your comments :-)

Bye,
Kevin


#################################
1) 15GB RAID6 volume, stripe 256:
> read: (groupid=0, jobs=4): err= 0: pid=3959
>   read : io=4092.0MB, bw=217900KB/s, iops=141 , runt= 19230msec
>     slat (usec): min=190 , max=1325 , avg=231.32, stdev=77.13
>     clat (msec): min=7 , max=326 , avg=111.45, stdev=67.48
>      lat (msec): min=7 , max=326 , avg=111.69, stdev=67.48
>     bw (KB/s) : min=29425, max=109280, per=24.98%, avg=54423.27, stdev=15887.11
>   cpu          : usr=0.03%, sys=0.90%, ctx=2614, majf=0, minf=6233
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 10=0.66%, 20=4.84%, 50=16.35%, 100=25.70%, 250=47.91%
>      lat (msec): 500=4.55%
> write: (groupid=1, jobs=4): err= 0: pid=3964
>   write: io=4092.0MB, bw=54044KB/s, iops=35 , runt= 77533msec
>     slat (msec): min=9 , max=339 , avg=113.48, stdev=36.86
>     clat (msec): min=2 , max=744 , avg=339.51, stdev=70.02
>      lat (msec): min=23 , max=927 , avg=453.00, stdev=82.81
>     bw (KB/s) : min= 5475, max=20756, per=24.92%, avg=13467.27, stdev=2107.66
>   cpu          : usr=0.58%, sys=6.26%, ctx=343060, majf=0, minf=550
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 4=0.04%, 10=0.04%, 50=0.11%, 100=0.26%, 250=5.94%
>      lat (msec): 500=91.50%, 750=2.13%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=217899KB/s, minb=223129KB/s, maxb=223129KB/s, mint=19230msec, maxt=19230msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=54044KB/s, minb=55341KB/s, maxb=55341KB/s, mint=77533msec, maxt=77533msec
> 
> Disk stats (read/write):
>   md1: ios=10912/10904, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=22836/29872, aggrmerge=364645/574313, aggrticks=216408/101340, aggrin_queue=317789, aggrutil=69.36%
>     sdc: ios=23542/29233, merge=377270/563331, ticks=217520/152300, in_queue=369860, util=67.99%
>     sdd: ios=23462/28930, merge=378915/557686, ticks=214104/142224, in_queue=356428, util=67.81%
>     sde: ios=22264/29920, merge=355831/576114, ticks=149184/99724, in_queue=248924, util=58.22%
>     sdf: ios=22240/30724, merge=352471/588412, ticks=126740/52916, in_queue=179696, util=55.44%
>     sdg: ios=22676/30557, merge=358741/586025, ticks=374496/59540, in_queue=434040, util=69.36%

#################################
2) 15GB RAID6 array, stripe 1024:
> read: (groupid=0, jobs=4): err= 0: pid=4334
>   read : io=4092.0MB, bw=237702KB/s, iops=154 , runt= 17628msec
>     slat (usec): min=196 , max=1469 , avg=232.84, stdev=88.56
>     clat (msec): min=6 , max=391 , avg=101.62, stdev=56.61
>      lat (msec): min=7 , max=391 , avg=101.85, stdev=56.61
>     bw (KB/s) : min=26482, max=107305, per=24.94%, avg=59289.94, stdev=14153.87
>   cpu          : usr=0.04%, sys=0.98%, ctx=2581, majf=0, minf=6241
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 10=0.22%, 20=3.08%, 50=16.83%, 100=31.01%, 250=47.10%
>      lat (msec): 500=1.76%
> write: (groupid=1, jobs=4): err= 0: pid=4340
>   write: io=4092.0MB, bw=120759KB/s, iops=78 , runt= 34699msec
>     slat (msec): min=1 , max=156 , avg=47.70, stdev=27.94
>     clat (msec): min=15 , max=520 , avg=152.37, stdev=53.34
>      lat (msec): min=28 , max=564 , avg=200.07, stdev=57.63
>     bw (KB/s) : min=    0, max=62270, per=25.04%, avg=30239.54, stdev=5435.51
>   cpu          : usr=1.34%, sys=5.68%, ctx=134184, majf=0, minf=94
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 20=0.04%, 50=1.10%, 100=12.02%, 250=82.18%, 500=4.62%
>      lat (msec): 750=0.04%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=237701KB/s, minb=243406KB/s, maxb=243406KB/s, mint=17628msec, maxt=17628msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=120758KB/s, minb=123656KB/s, maxb=123656KB/s, mint=34699msec, maxt=34699msec
> 
> Disk stats (read/write):
>   md1: ios=10912/10901, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4062/6768, aggrmerge=109645/369712, aggrticks=157952/111924, aggrin_queue=270487, aggrutil=89.91%
>     sdc: ios=4070/6893, merge=109432/378682, ticks=173784/179392, in_queue=353780, util=89.88%
>     sdd: ios=4041/6961, merge=110017/380798, ticks=164960/160236, in_queue=325820, util=89.91%
>     sde: ios=4039/6727, merge=110658/366163, ticks=169804/166052, in_queue=336484, util=88.98%
>     sdf: ios=4080/6591, merge=108922/359343, ticks=39460/24636, in_queue=64820, util=52.33%
>     sdg: ios=4081/6668, merge=109197/363574, ticks=241752/29308, in_queue=271532, util=73.63%

#################################
3) 15GB RAID6 array, stripe 2048:
> read: (groupid=0, jobs=4): err= 0: pid=4102
>   read : io=4092.0MB, bw=207827KB/s, iops=135 , runt= 20162msec
>     slat (usec): min=122 , max=3904 , avg=230.75, stdev=99.04
>     clat (msec): min=5 , max=363 , avg=117.03, stdev=70.26
>      lat (msec): min=5 , max=363 , avg=117.27, stdev=70.25
>     bw (KB/s) : min=26533, max=104239, per=25.00%, avg=51947.24, stdev=13121.27
>   cpu          : usr=0.05%, sys=0.83%, ctx=2640, majf=0, minf=6241
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 10=0.55%, 20=4.99%, 50=14.96%, 100=21.33%, 250=53.01%
>      lat (msec): 500=5.17%
> write: (groupid=1, jobs=4): err= 0: pid=4108
>   write: io=4092.0MB, bw=145833KB/s, iops=94 , runt= 28733msec
>     slat (usec): min=768 , max=94184 , avg=1838.20, stdev=4140.24
>     clat (msec): min=15 , max=1155 , avg=145.02, stdev=128.70
>      lat (msec): min=16 , max=1156 , avg=146.86, stdev=128.77
>     bw (KB/s) : min= 6144, max=104448, per=29.22%, avg=42617.70, stdev=17118.84
>   cpu          : usr=1.63%, sys=2.52%, ctx=4086, majf=0, minf=94
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 20=0.37%, 50=9.02%, 100=38.42%, 250=39.22%, 500=10.12%
>      lat (msec): 750=2.09%, 1000=0.59%, 2000=0.18%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s, mint=20162msec, maxt=20162msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s, mint=28733msec, maxt=28733msec
> 
> Disk stats (read/write):
>   md1: ios=10912/10873, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3018/4990, aggrmerge=79606/371231, aggrticks=152741/83319, aggrin_queue=236284, aggrutil=92.45%
>     sdc: ios=2988/5046, merge=79033/376737, ticks=129600/103552, in_queue=233412, util=84.14%
>     sdd: ios=3010/5140, merge=77101/390954, ticks=119032/103236, in_queue=222536, util=85.87%
>     sde: ios=3039/5009, merge=79935/374563, ticks=196200/159792, in_queue=356236, util=92.45%
>     sdf: ios=3021/4883, merge=80483/356643, ticks=29976/24956, in_queue=55100, util=49.53%
>     sdg: ios=3033/4872, merge=81481/357261, ticks=288900/25060, in_queue=314136, util=76.17%

#################################
4) 15GB RAID6 array, stripe 4096:
> read: (groupid=0, jobs=4): err= 0: pid=4411
>   read : io=4092.0MB, bw=214938KB/s, iops=139 , runt= 19495msec
>     slat (usec): min=192 , max=1468 , avg=226.97, stdev=87.18
>     clat (msec): min=4 , max=339 , avg=112.51, stdev=63.05
>      lat (msec): min=4 , max=339 , avg=112.73, stdev=63.05
>     bw (KB/s) : min=    0, max=96000, per=24.87%, avg=53457.49, stdev=13728.88
>   cpu          : usr=0.02%, sys=0.89%, ctx=2591, majf=0, minf=6241
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 10=0.15%, 20=4.62%, 50=13.56%, 100=26.43%, 250=52.20%
>      lat (msec): 500=3.04%
> write: (groupid=1, jobs=4): err= 0: pid=4419
>   write: io=4092.0MB, bw=146526KB/s, iops=95 , runt= 28597msec
>     slat (usec): min=822 , max=36057 , avg=1585.60, stdev=1174.08
>     clat (msec): min=14 , max=1132 , avg=142.49, stdev=126.05
>      lat (msec): min=15 , max=1134 , avg=144.08, stdev=126.08
>     bw (KB/s) : min=    0, max=83836, per=28.88%, avg=42315.69, stdev=18036.84
>   cpu          : usr=1.42%, sys=2.85%, ctx=2937, majf=0, minf=94
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 20=0.18%, 50=10.08%, 100=37.61%, 250=40.62%, 500=8.50%
>      lat (msec): 750=2.46%, 1000=0.44%, 2000=0.11%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=214937KB/s, minb=220096KB/s, maxb=220096KB/s, mint=19495msec, maxt=19495msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=146526KB/s, minb=150042KB/s, maxb=150042KB/s, mint=28597msec, maxt=28597msec
> 
> Disk stats (read/write):
>   md1: ios=10912/10895, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3013/5053, aggrmerge=78083/371841, aggrticks=151707/88554, aggrin_queue=240554, aggrutil=92.72%
>     sdc: ios=2981/5124, merge=77153/379556, ticks=142800/116248, in_queue=259356, util=87.13%
>     sdd: ios=2990/5182, merge=75443/390196, ticks=128184/107752, in_queue=236264, util=86.87%
>     sde: ios=3034/5057, merge=77882/372755, ticks=185048/159352, in_queue=344636, util=92.72%
>     sdf: ios=3035/4941, merge=79423/356692, ticks=29196/28100, in_queue=57520, util=50.56%
>     sdg: ios=3026/4963, merge=80518/360009, ticks=273308/31320, in_queue=304996, util=76.24%

###################################################
5) 3TB LUKS volume (no raid, just one single disk):
> read: (groupid=0, jobs=4): err= 0: pid=31359
>   read : io=4092.0MB, bw=125032KB/s, iops=81 , runt= 33513msec
>     slat (usec): min=86 , max=1404 , avg=125.10, stdev=97.21
>     clat (msec): min=26 , max=323 , avg=195.52, stdev=39.04
>      lat (msec): min=26 , max=323 , avg=195.65, stdev=39.03
>     bw (KB/s) : min=21982, max=41890, per=25.12%, avg=31405.62, stdev=3662.77
>   cpu          : usr=0.02%, sys=0.31%, ctx=2738, majf=0, minf=6233
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 50=0.51%, 100=2.38%, 250=93.07%, 500=4.03%
> write: (groupid=1, jobs=4): err= 0: pid=31364
>   write: io=4092.0MB, bw=158193KB/s, iops=102 , runt= 26488msec
>     slat (usec): min=181 , max=673 , avg=442.78, stdev=67.26
>     clat (msec): min=21 , max=555 , avg=154.33, stdev=45.35
>      lat (msec): min=21 , max=555 , avg=154.77, stdev=45.35
>     bw (KB/s) : min=19097, max=45988, per=25.04%, avg=39617.62, stdev=5714.79
>   cpu          : usr=1.00%, sys=0.25%, ctx=2737, majf=0, minf=86
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 50=0.44%, 100=1.54%, 250=94.61%, 500=3.15%, 750=0.26%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s, mint=33513msec, maxt=33513msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s, mint=26488msec, maxt=26488msec
> 
> Disk stats (read/write):
>   dm-1: ios=8184/8196, merge=0/0, ticks=1521200/1235264, in_queue=2759104, util=99.58%, aggrios=8184/8225, aggrmerge=0/31, aggrticks=1421660/1137984, aggrin_queue=2559612, aggrutil=99.49%
>     sda: ios=8184/8225, merge=0/31, ticks=1421660/1137984, in_queue=2559612, util=99.49%

##############################
6) 6TB LUKS array, stripe 256:
> read: (groupid=0, jobs=4): err= 0: pid=31121
>   read : io=4092.0MB, bw=137700KB/s, iops=89 , runt= 30430msec
>     slat (usec): min=123 , max=1353 , avg=157.35, stdev=91.53
>     clat (msec): min=15 , max=545 , avg=148.80, stdev=61.30
>      lat (msec): min=15 , max=545 , avg=148.96, stdev=61.29
>     bw (KB/s) : min=    0, max=129024, per=29.75%, avg=40959.72, stdev=21470.47
>   cpu          : usr=0.01%, sys=0.51%, ctx=2797, majf=0, minf=6241
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 20=0.29%, 50=9.90%, 100=12.76%, 250=75.07%, 500=1.58%
>      lat (msec): 750=0.40%
> write: (groupid=1, jobs=4): err= 0: pid=31131
>   write: io=4092.0MB, bw=41225KB/s, iops=26 , runt=101643msec
>     slat (usec): min=197 , max=1309 , avg=557.69, stdev=124.14
>     clat (msec): min=42 , max=2648 , avg=586.07, stdev=393.18
>      lat (msec): min=43 , max=2649 , avg=586.63, stdev=393.19
>     bw (KB/s) : min=    0, max=35583, per=26.36%, avg=10867.00, stdev=5232.75
>   cpu          : usr=0.33%, sys=0.09%, ctx=2699, majf=0, minf=94
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 50=0.07%, 100=1.39%, 250=18.55%, 500=31.82%, 750=20.49%
>      lat (msec): 1000=12.57%, 2000=14.66%, >=2000=0.44%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=137699KB/s, minb=141004KB/s, maxb=141004KB/s, mint=30430msec, maxt=30430msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=41224KB/s, minb=42214KB/s, maxb=42214KB/s, mint=101643msec, maxt=101643msec
> 
> Disk stats (read/write):
>   dm-0: ios=10912/10906, merge=0/0, ticks=1385524/5596828, in_queue=6983364, util=99.89%, aggrios=10912/10940, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>     md2: ios=10912/10940, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20710/27494, aggrmerge=358521/573067, aggrticks=208334/119186, aggrin_queue=327551, aggrutil=61.95%
>     sdc: ios=21005/27276, merge=366645/567533, ticks=224772/179916, in_queue=404728, util=61.95%
>     sdd: ios=21154/26926, merge=369215/564213, ticks=212876/160624, in_queue=373552, util=60.42%
>     sde: ios=20460/27449, merge=354440/571659, ticks=127008/102188, in_queue=229220, util=48.28%
>     sdf: ios=20464/27908, merge=350591/581422, ticks=241608/78180, in_queue=319784, util=55.24%
>     sdg: ios=20470/27915, merge=351718/580509, ticks=235408/75024, in_queue=310472, util=55.91%

###############################
7) 6TB LUKS array, stripe 2048:
> read: (groupid=0, jobs=4): err= 0: pid=30910
>   read : io=4092.0MB, bw=237500KB/s, iops=154 , runt= 17643msec
>     slat (usec): min=134 , max=1511 , avg=165.25, stdev=96.45
>     clat (msec): min=16 , max=859 , avg=101.07, stdev=61.40
>      lat (msec): min=17 , max=859 , avg=101.23, stdev=61.40
>     bw (KB/s) : min=32807, max=97523, per=25.40%, avg=60325.02, stdev=15396.63
>   cpu          : usr=0.02%, sys=0.76%, ctx=2709, majf=0, minf=6233
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=2728/0/0, short=0/0/0
> 
>      lat (msec): 20=0.07%, 50=13.38%, 100=48.90%, 250=34.38%, 500=3.08%
>      lat (msec): 750=0.15%, 1000=0.04%
> write: (groupid=1, jobs=4): err= 0: pid=30915
>   write: io=4092.0MB, bw=90652KB/s, iops=59 , runt= 46223msec
>     slat (usec): min=173 , max=983 , avg=599.60, stdev=141.06
>     clat (msec): min=34 , max=1185 , avg=258.48, stdev=135.97
>      lat (msec): min=34 , max=1185 , avg=259.08, stdev=135.98
>     bw (KB/s) : min=    0, max=88722, per=25.95%, avg=23524.06, stdev=9036.31
>   cpu          : usr=0.77%, sys=0.22%, ctx=2576, majf=0, minf=86
>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/2728/0, short=0/0/0
> 
>      lat (msec): 50=0.48%, 100=7.00%, 250=49.01%, 500=37.72%, 750=5.10%
>      lat (msec): 1000=0.59%, 2000=0.11%
> 
> Run status group 0 (all jobs):
>    READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s, mint=17643msec, maxt=17643msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s, mint=46223msec, maxt=46223msec
> 
> Disk stats (read/write):
>   dm-0: ios=10912/10875, merge=0/0, ticks=544880/2481824, in_queue=3026996, util=99.57%, aggrios=10912/10930, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>     md2: ios=10912/10930, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5058/6798, aggrmerge=153871/404343, aggrticks=195870/113487, aggrin_queue=309447, aggrutil=92.15%
>     sdc: ios=5024/7100, merge=152571/421782, ticks=332612/204936, in_queue=537660, util=92.15%
>     sdd: ios=4961/7109, merge=150591/422166, ticks=257308/161200, in_queue=418608, util=88.11%
>     sde: ios=5019/6697, merge=154781/395583, ticks=226120/152068, in_queue=378296, util=85.49%
>     sdf: ios=5118/6463, merge=156068/385905, ticks=65716/23932, in_queue=89724, util=55.85%
>     sdg: ios=5172/6624, merge=155347/396280, ticks=97596/25300, in_queue=122948, util=60.00%





Am 22.12.2013 09:12, schrieb Stan Hoeppner:
> On 12/21/2013 8:35 PM, Kevin Richter wrote:
>> Okay, well, I've run your fio config... but with so many results and
> 
> I'm glad you changed to size=1G instead of size=10G lest the test would
> have read/written 80GB instead of 8GB, and taken forever.  I often
> forget the size parameter is per job (thread) not global.
> 
>> abbrevations I currently feel a bit overchallenged :)
>> So, please dont take badly, if I just paste the whole results.
>> I have tried with a stripe cache size of 1024, 2048 and 4096.
> 
> The full output is good.  Going from 1024 to 2048 gained you ~17MB/s
> throughput with this job file.  But you didn't run with the default 256
> for comparison, which is very useful to know.  That's ok, as you can do
> so with the next set of test runs.  Given the huge latencies I'm
> thinking this config may be seeking the disks too much.  Modify your fio
> job file as follows:
> 
> numjobs=4
> blocksize=1536k
> iodepth=4
> size=1g
> 
> as this may yield a little better results.  ~190MB/s read and 75MB/s
> write is very low for three effective 7.2K RPM SATA spindles, especially
> the write throughput.  Compare to a similar FIO job with 2 threads,
> iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
> disabled, w/deadline elevator:
> 
> Run status group 0 (all jobs):
>    READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
> Run status group 1 (all jobs):
>   WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s
> 
> 88MB/s read, 108MB/s write.  Write is faster apparently due to the small
> 16MB drive write cache.  With your eight 2.4GHz SandyBridge cores and
> md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
> have significantly greater write speed than this old single SATA drive.
>  Currently you're showing 30MB/s slower write throughput, some 40%
> slower than this single drive.  With aligned writes, thus no RMW, you
> should be getting at least 2 spindles worth of throughput, or ~200MB/s.
> 
> So, run FIO again with the mods above, once with stripe_cache_size 256
> and once with 2048.  Something seems to be seriously dragging down your
> throughput.  I'm sure you'd like to know what, and how to fix it.
> 
>> Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
>> same 5 disks like my LUKS-md2 array, so nearly the same just without
>> this LUKS layer. If helpful, I can execute some fio tests on this
>> filesystem, too.
> 
> This is also a RAID6 array, without dm-crypt?  If so this will afford a
> great comparison, and may be very informative.  Run FIO against this
> filesystem (I assume it's also XFS) as well with stripe_cache_size 512
> and 2048.  Four FIO tests total including the two on the big array.
> 
> It'll be interesting to see where this leads.  At this point the
> candidates I see are a single dm-crypt thread peaking a core, dm-crypt
> taking aligned writes and misaligning them causing all writes to be
> RMWs, or AF drives with misaligned partitions causing every write to be
> an internal RMW.  Or maybe something I've not considered.
> 
> 
>>
>> So long :)
>> Kevin
>>
>>
>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>>> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193  iops] [eta 00m:06s]
>>> read: (groupid=0, jobs=8): err= 0: pid=12987
>>>   read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
>>>     slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
>>>     clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
>>>      lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
>>>     bw (KB/s) : min=    0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
>>>   cpu          : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>>
>>>      lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
>>>      lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
>>> write: (groupid=1, jobs=8): err= 0: pid=13202
>>>   write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
>>>     slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
>>>     clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
>>>      lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
>>>     bw (KB/s) : min=    0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
>>>   cpu          : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>>
>>>      lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
>>>      lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
>>>
>>> Run status group 0 (all jobs):
>>>    READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
>>>
>>> Run status group 1 (all jobs):
>>>   WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
>>>
>>> Disk stats (read/write):
>>>   dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>     md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
>>>     sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
>>>     sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
>>>     sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
>>>     sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
>>>     sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
>>
>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>>> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175  iops] [eta 00m:06s]
>>> read: (groupid=0, jobs=8): err= 0: pid=6392
>>>   read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
>>>     slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
>>>     clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
>>>      lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
>>>     bw (KB/s) : min=    0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
>>>   cpu          : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>>
>>>      lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
>>>      lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
>>> write: (groupid=1, jobs=8): err= 0: pid=6663
>>>   write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
>>>     slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
>>>     clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
>>>      lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
>>>     bw (KB/s) : min=    0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
>>>   cpu          : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>>
>>>      lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
>>>      lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
>>>
>>> Run status group 0 (all jobs):
>>>    READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
>>>
>>> Run status group 1 (all jobs):
>>>   WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
>>>
>>> Disk stats (read/write):
>>>   dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>     md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
>>>     sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
>>>     sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
>>>     sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
>>>     sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
>>>     sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
>>
>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>>> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182  iops] [eta 00m:00s]
>>> read: (groupid=0, jobs=8): err= 0: pid=11787
>>>   read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
>>>     slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
>>>     clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
>>>      lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
>>>     bw (KB/s) : min=    0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
>>>   cpu          : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>>
>>>      lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
>>>      lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
>>> write: (groupid=1, jobs=8): err= 0: pid=12060
>>>   write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
>>>     slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
>>>     clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
>>>      lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
>>>     bw (KB/s) : min=    0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
>>>   cpu          : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>>
>>>      lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
>>>      lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
>>>
>>> Run status group 0 (all jobs):
>>>    READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
>>>
>>> Run status group 1 (all jobs):
>>>   WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
>>>
>>> Disk stats (read/write):
>>>   dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>     md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
>>>     sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
>>>     sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
>>>     sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
>>>     sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
>>>     sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-22 14:10               ` Kevin Richter
@ 2013-12-22 17:29                 ` Stan Hoeppner
  0 siblings, 0 replies; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-22 17:29 UTC (permalink / raw)
  To: xfs, xfs

On 12/22/2013 8:10 AM, Kevin Richter wrote:
>> I'm glad you changed to size=1G instead of size=10G
> No no, I didnt change that. It has been written 160GB on these disks. 80
> for the read part, 80 for the write part. Each test took round 30 minutes.

I read this "read : io=81920MB" without the zero hence my incorrect
assumption.  Again, my apologies for the length test runs.  Use "1g"
from now on as that should be plenty to get good data, and run quickly.

>> This is also a RAID6 array, without dm-crypt?
> Yeah, right. The same 5 disks in RAID6 mode, just in each case another
> partition on these disks. And yes, of course, XFS.
> 
> So, with your new config I have made 7 fio runs:
> 1) 15GB RAID6 array, stripe 256
> 2) 15GB RAID6 array, stripe 1024
> 3) 15GB RAID6 array, stripe 2048
> 4) 15GB RAID6 array, stripe 4096
> 5) 3TB LUKS volume (no raid, just one single disk)
> 6) 6TB RAID6+LUKS array, stripe 256
> 7) 6TB RAID6+LUKS array, stripe 2048
> 
> In the 15GB array a stripe size of 4096 seems to get slightly more
> throughput than 2048.
> 
> See below for the results :-)
> 
> Perhaps, I should give you another information about the server
> hardware, which might (but hopefully do not) explain the low performance:
> The server is that one:
> http://www.supermicro.com/products/system/2u/6027/ssg-6027r-e1r12l.cfm
> with a 12-port SAS expander, which is that one:
> http://www.thomas-krenn.com/de/wiki/BPN-SAS-826EL1_SAS_Expander_Backplane
> But the backplane should have a thoughput of 3Gb/s

The hardware is fine.  You have 1.2GB/s <-> the backplane.

> However... I am very curious about your comments :-)

3) 15GB RAID6 array, stripe 2048:
READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s
WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s

~50MB/s per drive * 5 drives = ~250 MB/s
This may simply be the md-RAID6 max write you'll get from these drives.
 But 145MB/s aggregate write throughput is still decidedly better than
90MB/s, some 62% faster than the encrypted array.


5) 3TB LUKS volume (no raid, just one single disk):
READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s
WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s

~160MB/s per drive, ~160MB/s dm-crypt throughput


7) 6TB LUKS array, stripe 2048:
READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s
WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s

~30MB/s per drive * 5 drives = ~150MB/s dm-crypt throughput.

Write throughput is ~100MB/s lower on your RAID6 w/dm-crypt than
without.  Read throughput however is ~30MB/s higher, though these
differences may not be as great with multiple runs and normalized data.
 Even so, since the read throughput is roughly the same at 71-81MB/s per
drive, both with/out encryption, I don't believe the difference in write
speed is due to platter placement of the filesystems and test files.

So at this point I'd guess we need to look at dm-crypt, which may be
peaking a core or unaligning your IOs causing RMW cycles.  And look at
your partitions under the encrypted RAID6 to see if they're aligned.

We can identify the latter easily if you post 'fdisk -l' output for each
of the 5 drives.  Whether partitions are misaligned or dm-crypt is the
problem, the fix in either case is very destructive and time consuming,
requiring a full backup/restore after redoing the stack.

--
Stan



> Bye,
> Kevin
> 
> 
> #################################
> 1) 15GB RAID6 volume, stripe 256:
>> read: (groupid=0, jobs=4): err= 0: pid=3959
>>   read : io=4092.0MB, bw=217900KB/s, iops=141 , runt= 19230msec
>>     slat (usec): min=190 , max=1325 , avg=231.32, stdev=77.13
>>     clat (msec): min=7 , max=326 , avg=111.45, stdev=67.48
>>      lat (msec): min=7 , max=326 , avg=111.69, stdev=67.48
>>     bw (KB/s) : min=29425, max=109280, per=24.98%, avg=54423.27, stdev=15887.11
>>   cpu          : usr=0.03%, sys=0.90%, ctx=2614, majf=0, minf=6233
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.66%, 20=4.84%, 50=16.35%, 100=25.70%, 250=47.91%
>>      lat (msec): 500=4.55%
>> write: (groupid=1, jobs=4): err= 0: pid=3964
>>   write: io=4092.0MB, bw=54044KB/s, iops=35 , runt= 77533msec
>>     slat (msec): min=9 , max=339 , avg=113.48, stdev=36.86
>>     clat (msec): min=2 , max=744 , avg=339.51, stdev=70.02
>>      lat (msec): min=23 , max=927 , avg=453.00, stdev=82.81
>>     bw (KB/s) : min= 5475, max=20756, per=24.92%, avg=13467.27, stdev=2107.66
>>   cpu          : usr=0.58%, sys=6.26%, ctx=343060, majf=0, minf=550
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 4=0.04%, 10=0.04%, 50=0.11%, 100=0.26%, 250=5.94%
>>      lat (msec): 500=91.50%, 750=2.13%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=217899KB/s, minb=223129KB/s, maxb=223129KB/s, mint=19230msec, maxt=19230msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=54044KB/s, minb=55341KB/s, maxb=55341KB/s, mint=77533msec, maxt=77533msec
>>
>> Disk stats (read/write):
>>   md1: ios=10912/10904, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=22836/29872, aggrmerge=364645/574313, aggrticks=216408/101340, aggrin_queue=317789, aggrutil=69.36%
>>     sdc: ios=23542/29233, merge=377270/563331, ticks=217520/152300, in_queue=369860, util=67.99%
>>     sdd: ios=23462/28930, merge=378915/557686, ticks=214104/142224, in_queue=356428, util=67.81%
>>     sde: ios=22264/29920, merge=355831/576114, ticks=149184/99724, in_queue=248924, util=58.22%
>>     sdf: ios=22240/30724, merge=352471/588412, ticks=126740/52916, in_queue=179696, util=55.44%
>>     sdg: ios=22676/30557, merge=358741/586025, ticks=374496/59540, in_queue=434040, util=69.36%
> 
> #################################
> 2) 15GB RAID6 array, stripe 1024:
>> read: (groupid=0, jobs=4): err= 0: pid=4334
>>   read : io=4092.0MB, bw=237702KB/s, iops=154 , runt= 17628msec
>>     slat (usec): min=196 , max=1469 , avg=232.84, stdev=88.56
>>     clat (msec): min=6 , max=391 , avg=101.62, stdev=56.61
>>      lat (msec): min=7 , max=391 , avg=101.85, stdev=56.61
>>     bw (KB/s) : min=26482, max=107305, per=24.94%, avg=59289.94, stdev=14153.87
>>   cpu          : usr=0.04%, sys=0.98%, ctx=2581, majf=0, minf=6241
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.22%, 20=3.08%, 50=16.83%, 100=31.01%, 250=47.10%
>>      lat (msec): 500=1.76%
>> write: (groupid=1, jobs=4): err= 0: pid=4340
>>   write: io=4092.0MB, bw=120759KB/s, iops=78 , runt= 34699msec
>>     slat (msec): min=1 , max=156 , avg=47.70, stdev=27.94
>>     clat (msec): min=15 , max=520 , avg=152.37, stdev=53.34
>>      lat (msec): min=28 , max=564 , avg=200.07, stdev=57.63
>>     bw (KB/s) : min=    0, max=62270, per=25.04%, avg=30239.54, stdev=5435.51
>>   cpu          : usr=1.34%, sys=5.68%, ctx=134184, majf=0, minf=94
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 20=0.04%, 50=1.10%, 100=12.02%, 250=82.18%, 500=4.62%
>>      lat (msec): 750=0.04%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=237701KB/s, minb=243406KB/s, maxb=243406KB/s, mint=17628msec, maxt=17628msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=120758KB/s, minb=123656KB/s, maxb=123656KB/s, mint=34699msec, maxt=34699msec
>>
>> Disk stats (read/write):
>>   md1: ios=10912/10901, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4062/6768, aggrmerge=109645/369712, aggrticks=157952/111924, aggrin_queue=270487, aggrutil=89.91%
>>     sdc: ios=4070/6893, merge=109432/378682, ticks=173784/179392, in_queue=353780, util=89.88%
>>     sdd: ios=4041/6961, merge=110017/380798, ticks=164960/160236, in_queue=325820, util=89.91%
>>     sde: ios=4039/6727, merge=110658/366163, ticks=169804/166052, in_queue=336484, util=88.98%
>>     sdf: ios=4080/6591, merge=108922/359343, ticks=39460/24636, in_queue=64820, util=52.33%
>>     sdg: ios=4081/6668, merge=109197/363574, ticks=241752/29308, in_queue=271532, util=73.63%
> 
> #################################
> 3) 15GB RAID6 array, stripe 2048:
>> read: (groupid=0, jobs=4): err= 0: pid=4102
>>   read : io=4092.0MB, bw=207827KB/s, iops=135 , runt= 20162msec
>>     slat (usec): min=122 , max=3904 , avg=230.75, stdev=99.04
>>     clat (msec): min=5 , max=363 , avg=117.03, stdev=70.26
>>      lat (msec): min=5 , max=363 , avg=117.27, stdev=70.25
>>     bw (KB/s) : min=26533, max=104239, per=25.00%, avg=51947.24, stdev=13121.27
>>   cpu          : usr=0.05%, sys=0.83%, ctx=2640, majf=0, minf=6241
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.55%, 20=4.99%, 50=14.96%, 100=21.33%, 250=53.01%
>>      lat (msec): 500=5.17%
>> write: (groupid=1, jobs=4): err= 0: pid=4108
>>   write: io=4092.0MB, bw=145833KB/s, iops=94 , runt= 28733msec
>>     slat (usec): min=768 , max=94184 , avg=1838.20, stdev=4140.24
>>     clat (msec): min=15 , max=1155 , avg=145.02, stdev=128.70
>>      lat (msec): min=16 , max=1156 , avg=146.86, stdev=128.77
>>     bw (KB/s) : min= 6144, max=104448, per=29.22%, avg=42617.70, stdev=17118.84
>>   cpu          : usr=1.63%, sys=2.52%, ctx=4086, majf=0, minf=94
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 20=0.37%, 50=9.02%, 100=38.42%, 250=39.22%, 500=10.12%
>>      lat (msec): 750=2.09%, 1000=0.59%, 2000=0.18%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s, mint=20162msec, maxt=20162msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s, mint=28733msec, maxt=28733msec
>>
>> Disk stats (read/write):
>>   md1: ios=10912/10873, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3018/4990, aggrmerge=79606/371231, aggrticks=152741/83319, aggrin_queue=236284, aggrutil=92.45%
>>     sdc: ios=2988/5046, merge=79033/376737, ticks=129600/103552, in_queue=233412, util=84.14%
>>     sdd: ios=3010/5140, merge=77101/390954, ticks=119032/103236, in_queue=222536, util=85.87%
>>     sde: ios=3039/5009, merge=79935/374563, ticks=196200/159792, in_queue=356236, util=92.45%
>>     sdf: ios=3021/4883, merge=80483/356643, ticks=29976/24956, in_queue=55100, util=49.53%
>>     sdg: ios=3033/4872, merge=81481/357261, ticks=288900/25060, in_queue=314136, util=76.17%
> 
> #################################
> 4) 15GB RAID6 array, stripe 4096:
>> read: (groupid=0, jobs=4): err= 0: pid=4411
>>   read : io=4092.0MB, bw=214938KB/s, iops=139 , runt= 19495msec
>>     slat (usec): min=192 , max=1468 , avg=226.97, stdev=87.18
>>     clat (msec): min=4 , max=339 , avg=112.51, stdev=63.05
>>      lat (msec): min=4 , max=339 , avg=112.73, stdev=63.05
>>     bw (KB/s) : min=    0, max=96000, per=24.87%, avg=53457.49, stdev=13728.88
>>   cpu          : usr=0.02%, sys=0.89%, ctx=2591, majf=0, minf=6241
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.15%, 20=4.62%, 50=13.56%, 100=26.43%, 250=52.20%
>>      lat (msec): 500=3.04%
>> write: (groupid=1, jobs=4): err= 0: pid=4419
>>   write: io=4092.0MB, bw=146526KB/s, iops=95 , runt= 28597msec
>>     slat (usec): min=822 , max=36057 , avg=1585.60, stdev=1174.08
>>     clat (msec): min=14 , max=1132 , avg=142.49, stdev=126.05
>>      lat (msec): min=15 , max=1134 , avg=144.08, stdev=126.08
>>     bw (KB/s) : min=    0, max=83836, per=28.88%, avg=42315.69, stdev=18036.84
>>   cpu          : usr=1.42%, sys=2.85%, ctx=2937, majf=0, minf=94
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 20=0.18%, 50=10.08%, 100=37.61%, 250=40.62%, 500=8.50%
>>      lat (msec): 750=2.46%, 1000=0.44%, 2000=0.11%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=214937KB/s, minb=220096KB/s, maxb=220096KB/s, mint=19495msec, maxt=19495msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=146526KB/s, minb=150042KB/s, maxb=150042KB/s, mint=28597msec, maxt=28597msec
>>
>> Disk stats (read/write):
>>   md1: ios=10912/10895, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3013/5053, aggrmerge=78083/371841, aggrticks=151707/88554, aggrin_queue=240554, aggrutil=92.72%
>>     sdc: ios=2981/5124, merge=77153/379556, ticks=142800/116248, in_queue=259356, util=87.13%
>>     sdd: ios=2990/5182, merge=75443/390196, ticks=128184/107752, in_queue=236264, util=86.87%
>>     sde: ios=3034/5057, merge=77882/372755, ticks=185048/159352, in_queue=344636, util=92.72%
>>     sdf: ios=3035/4941, merge=79423/356692, ticks=29196/28100, in_queue=57520, util=50.56%
>>     sdg: ios=3026/4963, merge=80518/360009, ticks=273308/31320, in_queue=304996, util=76.24%
> 
> ###################################################
> 5) 3TB LUKS volume (no raid, just one single disk):
>> read: (groupid=0, jobs=4): err= 0: pid=31359
>>   read : io=4092.0MB, bw=125032KB/s, iops=81 , runt= 33513msec
>>     slat (usec): min=86 , max=1404 , avg=125.10, stdev=97.21
>>     clat (msec): min=26 , max=323 , avg=195.52, stdev=39.04
>>      lat (msec): min=26 , max=323 , avg=195.65, stdev=39.03
>>     bw (KB/s) : min=21982, max=41890, per=25.12%, avg=31405.62, stdev=3662.77
>>   cpu          : usr=0.02%, sys=0.31%, ctx=2738, majf=0, minf=6233
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 50=0.51%, 100=2.38%, 250=93.07%, 500=4.03%
>> write: (groupid=1, jobs=4): err= 0: pid=31364
>>   write: io=4092.0MB, bw=158193KB/s, iops=102 , runt= 26488msec
>>     slat (usec): min=181 , max=673 , avg=442.78, stdev=67.26
>>     clat (msec): min=21 , max=555 , avg=154.33, stdev=45.35
>>      lat (msec): min=21 , max=555 , avg=154.77, stdev=45.35
>>     bw (KB/s) : min=19097, max=45988, per=25.04%, avg=39617.62, stdev=5714.79
>>   cpu          : usr=1.00%, sys=0.25%, ctx=2737, majf=0, minf=86
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 50=0.44%, 100=1.54%, 250=94.61%, 500=3.15%, 750=0.26%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s, mint=33513msec, maxt=33513msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s, mint=26488msec, maxt=26488msec
>>
>> Disk stats (read/write):
>>   dm-1: ios=8184/8196, merge=0/0, ticks=1521200/1235264, in_queue=2759104, util=99.58%, aggrios=8184/8225, aggrmerge=0/31, aggrticks=1421660/1137984, aggrin_queue=2559612, aggrutil=99.49%
>>     sda: ios=8184/8225, merge=0/31, ticks=1421660/1137984, in_queue=2559612, util=99.49%
> 
> ##############################
> 6) 6TB LUKS array, stripe 256:
>> read: (groupid=0, jobs=4): err= 0: pid=31121
>>   read : io=4092.0MB, bw=137700KB/s, iops=89 , runt= 30430msec
>>     slat (usec): min=123 , max=1353 , avg=157.35, stdev=91.53
>>     clat (msec): min=15 , max=545 , avg=148.80, stdev=61.30
>>      lat (msec): min=15 , max=545 , avg=148.96, stdev=61.29
>>     bw (KB/s) : min=    0, max=129024, per=29.75%, avg=40959.72, stdev=21470.47
>>   cpu          : usr=0.01%, sys=0.51%, ctx=2797, majf=0, minf=6241
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 20=0.29%, 50=9.90%, 100=12.76%, 250=75.07%, 500=1.58%
>>      lat (msec): 750=0.40%
>> write: (groupid=1, jobs=4): err= 0: pid=31131
>>   write: io=4092.0MB, bw=41225KB/s, iops=26 , runt=101643msec
>>     slat (usec): min=197 , max=1309 , avg=557.69, stdev=124.14
>>     clat (msec): min=42 , max=2648 , avg=586.07, stdev=393.18
>>      lat (msec): min=43 , max=2649 , avg=586.63, stdev=393.19
>>     bw (KB/s) : min=    0, max=35583, per=26.36%, avg=10867.00, stdev=5232.75
>>   cpu          : usr=0.33%, sys=0.09%, ctx=2699, majf=0, minf=94
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 50=0.07%, 100=1.39%, 250=18.55%, 500=31.82%, 750=20.49%
>>      lat (msec): 1000=12.57%, 2000=14.66%, >=2000=0.44%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=137699KB/s, minb=141004KB/s, maxb=141004KB/s, mint=30430msec, maxt=30430msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=41224KB/s, minb=42214KB/s, maxb=42214KB/s, mint=101643msec, maxt=101643msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=10912/10906, merge=0/0, ticks=1385524/5596828, in_queue=6983364, util=99.89%, aggrios=10912/10940, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=10912/10940, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20710/27494, aggrmerge=358521/573067, aggrticks=208334/119186, aggrin_queue=327551, aggrutil=61.95%
>>     sdc: ios=21005/27276, merge=366645/567533, ticks=224772/179916, in_queue=404728, util=61.95%
>>     sdd: ios=21154/26926, merge=369215/564213, ticks=212876/160624, in_queue=373552, util=60.42%
>>     sde: ios=20460/27449, merge=354440/571659, ticks=127008/102188, in_queue=229220, util=48.28%
>>     sdf: ios=20464/27908, merge=350591/581422, ticks=241608/78180, in_queue=319784, util=55.24%
>>     sdg: ios=20470/27915, merge=351718/580509, ticks=235408/75024, in_queue=310472, util=55.91%
> 
> ###############################
> 7) 6TB LUKS array, stripe 2048:
>> read: (groupid=0, jobs=4): err= 0: pid=30910
>>   read : io=4092.0MB, bw=237500KB/s, iops=154 , runt= 17643msec
>>     slat (usec): min=134 , max=1511 , avg=165.25, stdev=96.45
>>     clat (msec): min=16 , max=859 , avg=101.07, stdev=61.40
>>      lat (msec): min=17 , max=859 , avg=101.23, stdev=61.40
>>     bw (KB/s) : min=32807, max=97523, per=25.40%, avg=60325.02, stdev=15396.63
>>   cpu          : usr=0.02%, sys=0.76%, ctx=2709, majf=0, minf=6233
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=2728/0/0, short=0/0/0
>>
>>      lat (msec): 20=0.07%, 50=13.38%, 100=48.90%, 250=34.38%, 500=3.08%
>>      lat (msec): 750=0.15%, 1000=0.04%
>> write: (groupid=1, jobs=4): err= 0: pid=30915
>>   write: io=4092.0MB, bw=90652KB/s, iops=59 , runt= 46223msec
>>     slat (usec): min=173 , max=983 , avg=599.60, stdev=141.06
>>     clat (msec): min=34 , max=1185 , avg=258.48, stdev=135.97
>>      lat (msec): min=34 , max=1185 , avg=259.08, stdev=135.98
>>     bw (KB/s) : min=    0, max=88722, per=25.95%, avg=23524.06, stdev=9036.31
>>   cpu          : usr=0.77%, sys=0.22%, ctx=2576, majf=0, minf=86
>>   IO depths    : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/2728/0, short=0/0/0
>>
>>      lat (msec): 50=0.48%, 100=7.00%, 250=49.01%, 500=37.72%, 750=5.10%
>>      lat (msec): 1000=0.59%, 2000=0.11%
>>
>> Run status group 0 (all jobs):
>>    READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s, mint=17643msec, maxt=17643msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s, mint=46223msec, maxt=46223msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=10912/10875, merge=0/0, ticks=544880/2481824, in_queue=3026996, util=99.57%, aggrios=10912/10930, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=10912/10930, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5058/6798, aggrmerge=153871/404343, aggrticks=195870/113487, aggrin_queue=309447, aggrutil=92.15%
>>     sdc: ios=5024/7100, merge=152571/421782, ticks=332612/204936, in_queue=537660, util=92.15%
>>     sdd: ios=4961/7109, merge=150591/422166, ticks=257308/161200, in_queue=418608, util=88.11%
>>     sde: ios=5019/6697, merge=154781/395583, ticks=226120/152068, in_queue=378296, util=85.49%
>>     sdf: ios=5118/6463, merge=156068/385905, ticks=65716/23932, in_queue=89724, util=55.85%
>>     sdg: ios=5172/6624, merge=155347/396280, ticks=97596/25300, in_queue=122948, util=60.00%
> 
> 
> 
> 
> 
> Am 22.12.2013 09:12, schrieb Stan Hoeppner:
>> On 12/21/2013 8:35 PM, Kevin Richter wrote:
>>> Okay, well, I've run your fio config... but with so many results and
>>
>> I'm glad you changed to size=1G instead of size=10G lest the test would
>> have read/written 80GB instead of 8GB, and taken forever.  I often
>> forget the size parameter is per job (thread) not global.
>>
>>> abbrevations I currently feel a bit overchallenged :)
>>> So, please dont take badly, if I just paste the whole results.
>>> I have tried with a stripe cache size of 1024, 2048 and 4096.
>>
>> The full output is good.  Going from 1024 to 2048 gained you ~17MB/s
>> throughput with this job file.  But you didn't run with the default 256
>> for comparison, which is very useful to know.  That's ok, as you can do
>> so with the next set of test runs.  Given the huge latencies I'm
>> thinking this config may be seeking the disks too much.  Modify your fio
>> job file as follows:
>>
>> numjobs=4
>> blocksize=1536k
>> iodepth=4
>> size=1g
>>
>> as this may yield a little better results.  ~190MB/s read and 75MB/s
>> write is very low for three effective 7.2K RPM SATA spindles, especially
>> the write throughput.  Compare to a similar FIO job with 2 threads,
>> iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
>> disabled, w/deadline elevator:
>>
>> Run status group 0 (all jobs):
>>    READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
>> Run status group 1 (all jobs):
>>   WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s
>>
>> 88MB/s read, 108MB/s write.  Write is faster apparently due to the small
>> 16MB drive write cache.  With your eight 2.4GHz SandyBridge cores and
>> md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
>> have significantly greater write speed than this old single SATA drive.
>>  Currently you're showing 30MB/s slower write throughput, some 40%
>> slower than this single drive.  With aligned writes, thus no RMW, you
>> should be getting at least 2 spindles worth of throughput, or ~200MB/s.
>>
>> So, run FIO again with the mods above, once with stripe_cache_size 256
>> and once with 2048.  Something seems to be seriously dragging down your
>> throughput.  I'm sure you'd like to know what, and how to fix it.
>>
>>> Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
>>> same 5 disks like my LUKS-md2 array, so nearly the same just without
>>> this LUKS layer. If helpful, I can execute some fio tests on this
>>> filesystem, too.
>>
>> This is also a RAID6 array, without dm-crypt?  If so this will afford a
>> great comparison, and may be very informative.  Run FIO against this
>> filesystem (I assume it's also XFS) as well with stripe_cache_size 512
>> and 2048.  Four FIO tests total including the two on the big array.
>>
>> It'll be interesting to see where this leads.  At this point the
>> candidates I see are a single dm-crypt thread peaking a core, dm-crypt
>> taking aligned writes and misaligning them causing all writes to be
>> RMWs, or AF drives with misaligned partitions causing every write to be
>> an internal RMW.  Or maybe something I've not considered.
>>
>>
>>>
>>> So long :)
>>> Kevin
>>>
>>>
>>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>>>> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193  iops] [eta 00m:06s]
>>>> read: (groupid=0, jobs=8): err= 0: pid=12987
>>>>   read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
>>>>     slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
>>>>     clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
>>>>      lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
>>>>     bw (KB/s) : min=    0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
>>>>   cpu          : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>>>
>>>>      lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
>>>>      lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
>>>> write: (groupid=1, jobs=8): err= 0: pid=13202
>>>>   write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
>>>>     slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
>>>>     clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
>>>>      lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
>>>>     bw (KB/s) : min=    0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
>>>>   cpu          : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>>>
>>>>      lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
>>>>      lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
>>>>
>>>> Run status group 0 (all jobs):
>>>>    READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
>>>>
>>>> Run status group 1 (all jobs):
>>>>   WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
>>>>
>>>> Disk stats (read/write):
>>>>   dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>>     md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
>>>>     sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
>>>>     sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
>>>>     sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
>>>>     sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
>>>>     sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
>>>
>>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>>>> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175  iops] [eta 00m:06s]
>>>> read: (groupid=0, jobs=8): err= 0: pid=6392
>>>>   read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
>>>>     slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
>>>>     clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
>>>>      lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
>>>>     bw (KB/s) : min=    0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
>>>>   cpu          : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>>>
>>>>      lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
>>>>      lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
>>>> write: (groupid=1, jobs=8): err= 0: pid=6663
>>>>   write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
>>>>     slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
>>>>     clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
>>>>      lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
>>>>     bw (KB/s) : min=    0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
>>>>   cpu          : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>>>
>>>>      lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
>>>>      lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
>>>>
>>>> Run status group 0 (all jobs):
>>>>    READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
>>>>
>>>> Run status group 1 (all jobs):
>>>>   WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
>>>>
>>>> Disk stats (read/write):
>>>>   dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>>     md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
>>>>     sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
>>>>     sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
>>>>     sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
>>>>     sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
>>>>     sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
>>>
>>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>>>> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182  iops] [eta 00m:00s]
>>>> read: (groupid=0, jobs=8): err= 0: pid=11787
>>>>   read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
>>>>     slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
>>>>     clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
>>>>      lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
>>>>     bw (KB/s) : min=    0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
>>>>   cpu          : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>>>
>>>>      lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
>>>>      lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
>>>> write: (groupid=1, jobs=8): err= 0: pid=12060
>>>>   write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
>>>>     slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
>>>>     clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
>>>>      lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
>>>>     bw (KB/s) : min=    0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
>>>>   cpu          : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>>>
>>>>      lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
>>>>      lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
>>>>
>>>> Run status group 0 (all jobs):
>>>>    READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
>>>>
>>>> Run status group 1 (all jobs):
>>>>   WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
>>>>
>>>> Disk stats (read/write):
>>>>   dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>>>     md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
>>>>     sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
>>>>     sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
>>>>     sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
>>>>     sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
>>>>     sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-22  9:18             ` Stan Hoeppner
@ 2013-12-22 20:14               ` Dave Chinner
  2013-12-22 21:01               ` Michael L. Semon
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2013-12-22 20:14 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, xfs, xfs

On Sun, Dec 22, 2013 at 03:18:05AM -0600, Stan Hoeppner wrote:
> On 12/20/2013 11:30 PM, Dave Chinner wrote:
> > On Fri, Dec 20, 2013 at 06:36:33AM -0600, Stan Hoeppner wrote:
> >> On 12/20/2013 4:26 AM, Kevin Richter wrote:
> >>> 'top' while copying with stripe size of 2048 (the source disk is ntfs):
> >>>> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
> >>>> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
> >>>> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
> >>>> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
> >>>> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
> >>>>
> >>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>>> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
> >>>> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
> >>>> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
> >>>>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
> >>>>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
> >>>> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
> >>>> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
> >>>>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
> >>>>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
> >>>> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
> >>>> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1
> >>
> >> Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
> >> SandyBridge core at only ~180 MB/s throughput is it?
> > 
> > Kworkers are an anonymous kernel worker threads that do work that
> > has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread
> > on CPU 3 (3:0 is the first). The kworker is a thread pool that grows
> > and shrinks according to demand.
> > 
> > As the naming suggests, kworker threads are per-CPU, and there can
> > be hundreds of them per CPU is there is enough workqueue work blocks
> > during execution of the work (e.g. on locks, waiting for IO, etc).
> > If there is little blocking occurring, there might on ly be a couple
> > of kworker threads that do all the work, and hence you see them
> > consuming huge amounts of CPU on behalf of other systems...
> > 
> > XFS uses workqueues for lots of things, so it's not unusual to see
> > an IO or metadata heavy workload end up with this huge numbers of
> > kworker threads doing work:
> > 
> > <run parallel fsmark test>
> > ....
> > $ ps -ef |grep kworker |wc -l
> > 91
> 
> For the record, again, I've never used dm-crypt.  I'm just trying to
> work through the layers to identify the source of Kevin's problems.  So
> please don't clobber me too hard for going OT, or trying to speak
> (somewhat) intelligently about something I'm just learning about...
> 
> So, how does one identify who is doing work inside a worker thread?  In
> this case we're trying to figure out if dm-crypt might be eating a CPU
> core thus limiting its throughput.  However...

perf top is your friend for finding out where the kernel is spending
it's CPU time. It will tell you functions (and call chains) that are
consuming CPU time rather than what individual threads are doing.

If you really want to know what worker threads are doing,
/proc/sysrq-trigger dumps the kworker threads with the name of the
workqueue they are doing work on behalf of as well as the stack
trace.

> 
> I just read a thread on dm-devel about dm-crypt performance dated late
> March 2013.  It wasn't clear to me if dm-crypt yet uses unbound
> workqueues, if it dispatches work to multiple queues, or if it still
> uses the hand built dispatcher, as of kernel 3.11, which Kevin just
> updated to.

dm-crypt uses workqueues with a concurrency depth of 1, and kcryptd
is defined as CPU intensive and so is scheduled like a long running
kernel thread rather than a workqueue. Reads decrypting is done on
IO completion, so if you've only got one CPU processing IO
completions, then read decryption will effectively be single
threaded due to the nature of per-cpu work queues and a concurrency
depth of 1.

> What was clear is that as of March dm-crypt was not preserving IO
> ordering.  If this is still the case, that would imply RMW with md
> parity arrays, which would tend to explain Kevin's write throughput
> being ~1/3rd of his read, and less than a single disk.

encryption adds IO latency, so unless you have lots of concurrency
to hide the latency, dm-crypt will almost always be slower than the
same IO an equivalent unencrypted disk.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2013-12-22  9:18             ` Stan Hoeppner
  2013-12-22 20:14               ` Dave Chinner
@ 2013-12-22 21:01               ` Michael L. Semon
  1 sibling, 0 replies; 30+ messages in thread
From: Michael L. Semon @ 2013-12-22 21:01 UTC (permalink / raw)
  To: xfs

On 12/22/2013 04:18 AM, Stan Hoeppner wrote:

> For the record, again, I've never used dm-crypt.  I'm just trying to
> work through the layers to identify the source of Kevin's problems.  So
> please don't clobber me too hard for going OT, or trying to speak
> (somewhat) intelligently about something I'm just learning about...

This stuff isn't easy.  There are a lot of security-vs-speed tradeoffs, 
and kernel support varies by processor and architecture.  Take this 
example from x86:

# info from fdisk:
/dev/sda5         8404992    10502143     1048576   83  Linux

# cryptsetup status /dev/mapper/wCrypt
/dev/mapper/wCrypt is active.
  type:    LUKS1
  cipher:  aes-xts-essiv:sha256
  keysize: 512 bits
  device:  /dev/sda5
  offset:  4096 sectors
  size:    2093056 sectors
  mode:    read/write

What this jibberish means from a speed standpoint is this:

aes: had assembly speedups in the kernel; NIST approved;

xts: almost as fast as cbc while being more secure;

essiv: "plain" is faster, but essiv is more secure;

sha256: it seemed like an ok hash;

512 bits: much slower than 256 or 128 bits, but hopefully more secure, fine 
for a small partition like this.

A similar "cryptsetup status <cryptname>" from Kevin might help you 
benchmark things, and Kevin might explain his rationale without blowing his 
cover ;-)  A `zcat /proc/config.gz` | grep CRYPTO` might help you see what 
kernel support he has as well.

Dave's recent post about I/O latency with dm-crypt reads/writes might be 
the answer that you're seeking.  I just wanted to put in my two cents.

Good luck!

Michael

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
  2013-12-21 11:18             ` Stan Hoeppner
                               ` (2 preceding siblings ...)
  (?)
@ 2013-12-26  8:55             ` Christoph Hellwig
  2013-12-26  9:24               ` Stan Hoeppner
  -1 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2013-12-26  8:55 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid, xfs

On Sat, Dec 21, 2013 at 05:18:42AM -0600, Stan Hoeppner wrote:
> The powers that be, Linus in particular, are not fond of default
> settings that create a lot of kernel memory structures.  The default
> md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.

The default sizing is stupid as it basically makes RAID unusable out
of the box, I always have to fix that up, as well as a somewhat
reasonable chunk size for parity RAID to make it usable.  I'm also
pretty sure I complained about it at least once a while ago, but never
got a reply.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
  2013-12-26  8:55             ` Christoph Hellwig
@ 2013-12-26  9:24               ` Stan Hoeppner
  2013-12-26 22:14                 ` NeilBrown
  0 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2013-12-26  9:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, xfs

On 12/26/2013 2:55 AM, Christoph Hellwig wrote:
> On Sat, Dec 21, 2013 at 05:18:42AM -0600, Stan Hoeppner wrote:
>> The powers that be, Linus in particular, are not fond of default
>> settings that create a lot of kernel memory structures.  The default
>> md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
> 
> The default sizing is stupid as it basically makes RAID unusable out
> of the box, I always have to fix that up, as well as a somewhat
> reasonable chunk size for parity RAID to make it usable.  I'm also
> pretty sure I complained about it at least once a while ago, but never
> got a reply.

IIRC you Dave C. and myself all voiced criticism after the default chunk
size was changed from 64KB to 512KB.  I guess we didn't make a strong
enough case to have it reduced, or maybe didn't use the right approach.

Maybe Neil is waiting for patches to be submitted for changing these
defaults, and to argue the merits in that context instead of pure
discussion?  Dunno.  Just guessing.  Maybe he'll read this and jump in.

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
  2013-12-26  9:24               ` Stan Hoeppner
@ 2013-12-26 22:14                 ` NeilBrown
  0 siblings, 0 replies; 30+ messages in thread
From: NeilBrown @ 2013-12-26 22:14 UTC (permalink / raw)
  To: stan; +Cc: Christoph Hellwig, linux-raid, xfs


[-- Attachment #1.1: Type: text/plain, Size: 1219 bytes --]

On Thu, 26 Dec 2013 03:24:00 -0600 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 12/26/2013 2:55 AM, Christoph Hellwig wrote:
> > On Sat, Dec 21, 2013 at 05:18:42AM -0600, Stan Hoeppner wrote:
> >> The powers that be, Linus in particular, are not fond of default
> >> settings that create a lot of kernel memory structures.  The default
> >> md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
> > 
> > The default sizing is stupid as it basically makes RAID unusable out
> > of the box, I always have to fix that up, as well as a somewhat
> > reasonable chunk size for parity RAID to make it usable.  I'm also
> > pretty sure I complained about it at least once a while ago, but never
> > got a reply.
> 
> IIRC you Dave C. and myself all voiced criticism after the default chunk
> size was changed from 64KB to 512KB.  I guess we didn't make a strong
> enough case to have it reduced, or maybe didn't use the right approach.
> 
> Maybe Neil is waiting for patches to be submitted for changing these
> defaults, and to argue the merits in that context instead of pure
> discussion?  Dunno.  Just guessing.  Maybe he'll read this and jump in.
> 

Good guess.

NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2012-04-17  8:32     ` Dave Chinner
@ 2012-04-17  9:19       ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-04-17  9:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs

Am 17.04.2012 10:32, schrieb Dave Chinner:
>>> What's your storage hardware, application, workload, etc?
>> It a Raid 10 of 1TB SATA Drives in front of a 3Ware Controller doing
>> heavy rsync backups.
> 
> Hmmm. It sounds like it is just taking a long time for the IO to
> complete, not that there is any other problem. If the IO is realy,
> really slow, the a log force is definitely a place it coul dstall
> for some time.
OK as i've seen this just once it just might be that the controller /
disks were too slow in this case.

> Does the RAID controller have a BBWC?
No but doesn't matter ;-) If it crashes i can just repair the FS and
restart the backup again.

> What IO scheduler are you using?
[root@backup1 ~]# cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq

> Does mounting with nobarriers reduce the stall time?
It is already mounted with nobarrier:
/dev/sdc3 on /sdc3 type xfs (rw,noatime,nodiratime,nobarrier,prjquota)

> Can you reduce the IO load so that it doesn't end up in this state?
Can try that by just doing less backups per time period.

Stefan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2012-04-17  8:19   ` Stefan Priebe - Profihost AG
@ 2012-04-17  8:32     ` Dave Chinner
  2012-04-17  9:19       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2012-04-17  8:32 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs

On Tue, Apr 17, 2012 at 10:19:45AM +0200, Stefan Priebe - Profihost AG wrote:
> Am 17.04.2012 01:56, schrieb Dave Chinner:
> > Your first question should be: is this a known problem?
> > And the answer to that is: I don't think so.
> Sorry absolutely correct.
> 
> > So, is the system actually hung?
> No. The kernel has ended the stalled request.
> 
> > Were there IO errors around this time?
> No.
> 
> > did the system recover and continue to work normally?
> Yes.
> 
> > What's your storage hardware, application, workload, etc?
> It a Raid 10 of 1TB SATA Drives in front of a 3Ware Controller doing
> heavy rsync backups.

Hmmm. It sounds like it is just taking a long time for the IO to
complete, not that there is any other problem. If the IO is realy,
really slow, the a log force is definitely a place it coul dstall
for some time.

Does the RAID controller have a BBWC? What IO scheduler are you
using? Does mounting with nobarriers reduce the stall time? Can you
reduce the IO load so that it doesn't end up in this state?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2012-04-16 23:56 ` Dave Chinner
@ 2012-04-17  8:19   ` Stefan Priebe - Profihost AG
  2012-04-17  8:32     ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-04-17  8:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs

Am 17.04.2012 01:56, schrieb Dave Chinner:
> Your first question should be: is this a known problem?
> And the answer to that is: I don't think so.
Sorry absolutely correct.

> So, is the system actually hung?
No. The kernel has ended the stalled request.

> Were there IO errors around this time?
No.

> did the system recover and continue to work normally?
Yes.

> What's your storage hardware, application, workload, etc?
It a Raid 10 of 1TB SATA Drives in front of a 3Ware Controller doing
heavy rsync backups.

Greets
Stefan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: XFS blocked task in xlog_cil_force_lsn
  2012-04-16  6:48 Stefan Priebe - Profihost AG
@ 2012-04-16 23:56 ` Dave Chinner
  2012-04-17  8:19   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2012-04-16 23:56 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, stable, xfs

On Mon, Apr 16, 2012 at 08:48:25AM +0200, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> today i've seen this call trace / block in XFS while running 3.0.27.
> 
> INFO: task xfsaild/sdc3:1818 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfsaild/sdc3    D ffff88041f374dc8     0  1818      2 0x00000000
> ffff88041edd3d30 0000000000000046 ffff8803ecaa8dc0 ffff88041f23fe28
> ffff88041f374860 0000000000010800 ffff88041edd3fd8 ffff88041edd2010
> ffff88041edd3fd8 0000000000010800 ffff88041f15c860 ffff88041f374860
> Call Trace:
>  [<ffffffff8159c94a>] schedule+0x3a/0x60
>  [<ffffffff812763a9>] xlog_cil_force_lsn+0xe9/0x110
>  [<ffffffff8104caf0>] ? try_to_wake_up+0x2a0/0x2a0
>  [<ffffffff81273922>] _xfs_log_force+0x62/0x250
>  [<ffffffff81273d23>] xfs_log_force+0x13/0x40
>  [<ffffffff81281ac7>] ? xfs_trans_ail_cursor_next+0x27/0x30
>  [<ffffffff81282237>] xfsaild+0x447/0x460
>  [<ffffffff81281df0>] ? xfs_trans_ail_init+0xb0/0xb0
>  [<ffffffff8106f646>] kthread+0x96/0xa0
>  [<ffffffff815a0214>] kernel_thread_helper+0x4/0x10
>  [<ffffffff8106f5b0>] ? kthread_worker_fn+0x130/0x130
>  [<ffffffff815a0210>] ? gs_change+0xb/0xb
> 
> Is there a known fix / patch available?

Your first question should be: is this a known problem?

And the answer to that is: I don't think so.

So, is the system actually hung? Were there IO errors around this
time? did the system recover and continue to work normally? What's
your storage hardware, application, workload, etc?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* XFS blocked task in xlog_cil_force_lsn
@ 2012-04-16  6:48 Stefan Priebe - Profihost AG
  2012-04-16 23:56 ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-04-16  6:48 UTC (permalink / raw)
  To: stable; +Cc: Christoph Hellwig, Greg KH, Philipp Herz - Profihost AG, xfs

Hi,

today i've seen this call trace / block in XFS while running 3.0.27.

INFO: task xfsaild/sdc3:1818 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xfsaild/sdc3    D ffff88041f374dc8     0  1818      2 0x00000000
ffff88041edd3d30 0000000000000046 ffff8803ecaa8dc0 ffff88041f23fe28
ffff88041f374860 0000000000010800 ffff88041edd3fd8 ffff88041edd2010
ffff88041edd3fd8 0000000000010800 ffff88041f15c860 ffff88041f374860
Call Trace:
 [<ffffffff8159c94a>] schedule+0x3a/0x60
 [<ffffffff812763a9>] xlog_cil_force_lsn+0xe9/0x110
 [<ffffffff8104caf0>] ? try_to_wake_up+0x2a0/0x2a0
 [<ffffffff81273922>] _xfs_log_force+0x62/0x250
 [<ffffffff81273d23>] xfs_log_force+0x13/0x40
 [<ffffffff81281ac7>] ? xfs_trans_ail_cursor_next+0x27/0x30
 [<ffffffff81282237>] xfsaild+0x447/0x460
 [<ffffffff81281df0>] ? xfs_trans_ail_init+0xb0/0xb0
 [<ffffffff8106f646>] kthread+0x96/0xa0
 [<ffffffff815a0214>] kernel_thread_helper+0x4/0x10
 [<ffffffff8106f5b0>] ? kthread_worker_fn+0x130/0x130
 [<ffffffff815a0210>] ? gs_change+0xb/0xb

Is there a known fix / patch available?

Stefan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-12-26 22:14 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-18  2:05 XFS blocked task in xlog_cil_force_lsn Kevin Richter
2013-12-18  3:38 ` Stan Hoeppner
2013-12-18 10:27   ` Kevin Richter
2013-12-19 14:11     ` Stan Hoeppner
2013-12-20 10:26       ` Kevin Richter
2013-12-20 12:36         ` Stan Hoeppner
2013-12-21  5:30           ` Dave Chinner
2013-12-22  9:18             ` Stan Hoeppner
2013-12-22 20:14               ` Dave Chinner
2013-12-22 21:01               ` Michael L. Semon
2013-12-22  2:35           ` Kevin Richter
2013-12-22  8:12             ` Stan Hoeppner
2013-12-22 14:10               ` Kevin Richter
2013-12-22 17:29                 ` Stan Hoeppner
2013-12-20 22:43         ` Arkadiusz Miśkiewicz
2013-12-21 11:18           ` md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint Stan Hoeppner
2013-12-21 11:18             ` Stan Hoeppner
2013-12-21 12:20             ` Piergiorgio Sartor
2013-12-21 12:20               ` Piergiorgio Sartor
2013-12-22  1:41             ` Stan Hoeppner
2013-12-26  8:55             ` Christoph Hellwig
2013-12-26  9:24               ` Stan Hoeppner
2013-12-26 22:14                 ` NeilBrown
2013-12-18  8:33 ` XFS blocked task in xlog_cil_force_lsn Stefan Ring
2013-12-18 22:21 ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2012-04-16  6:48 Stefan Priebe - Profihost AG
2012-04-16 23:56 ` Dave Chinner
2012-04-17  8:19   ` Stefan Priebe - Profihost AG
2012-04-17  8:32     ` Dave Chinner
2012-04-17  9:19       ` Stefan Priebe - Profihost AG

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.