All of lore.kernel.org
 help / color / mirror / Atom feed
* sleeps and waits during io_submit
@ 2015-11-28  2:43 Glauber Costa
  2015-11-30 14:10 ` Brian Foster
  2015-11-30 23:10 ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: Glauber Costa @ 2015-11-28  2:43 UTC (permalink / raw)
  To: xfs, Avi Kivity, david

[-- Attachment #1: Type: text/plain, Size: 3130 bytes --]

Hello my dear XFSers,

For those of you who don't know, we at ScyllaDB produce a modern NoSQL
data store that, at the moment, runs on top of XFS only. We deal
exclusively with asynchronous and direct IO, due to our
thread-per-core architecture. Due to that, we avoid issuing any
operation that will sleep.

While debugging an extreme case of bad performance (most likely
related to a not-so-great disk), I have found a variety of cases in
which XFS blocks. To find those, I have used perf record -e
sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
as xfs-sched_switch.log. Please note that this doesn't tell me for how
long we block, but as mentioned before, blocking operations outside
our control are detrimental to us regardless of the elapsed time.

For those who are not acquainted to our internals, please ignore
everything in that file but the xfs functions. For the xfs symbols,
there are two kinds of events: the ones that are a children of
io_submit, where we don't tolerate blocking, and the ones that are
children of our helper IO thread, to where we push big operations that
we know will block until we can get rid of them all. We care about the
former and ignore the latter.

Please allow me to ask you a couple of questions about those findings.
If we are doing anything wrong, advise on best practices is truly
welcome.

1) xfs_buf_lock -> xfs_log_force.

I've started wondering what would make xfs_log_force sleep. But then I
have noticed that xfs_log_force will only be called when a buffer is
marked stale. Most of the times a buffer is marked stale seems to be
due to errors. Although that is not my case (more on that), it got me
thinking that maybe the right thing to do would be to avoid hitting
this case altogether?

The file example-stale.txt contains a backtrace of the case where we
are being marked as stale. It seems to be happening when we convert
the the inode's extents from unwritten to real. Can this case be
avoided? I won't pretend I know the intricacies of this, but couldn't
we be keeping extents from the very beginning to avoid creating stale
buffers?

2) xfs_buf_lock -> down
This is one I truly don't understand. What can be causing contention
in this lock? We never have two different cores writing to the same
buffer, nor should we have the same core doingCAP_FOWNER so.

3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time

You guys seem to have an interface to avoid that, by setting the
FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl,
which will set this flag for all regular files. That's great, but that
ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run
our server as an unprivileged user. I don't understand, however, why
such an strict check is needed. If we have full rights on the
filesystem, why can't we issue this operation? In my view, CAP_FOWNER
should already be enough.I do understand the handles have to be stable
and a file can have its ownership changed, in which case the previous
owner would keep the handle valid. Is that the reason you went with
the most restrictive capability ?

[-- Attachment #2: xfs-sched_switch.log --]
[-- Type: text/plain, Size: 171763 bytes --]

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 2K of event 'sched:sched_switch'
# Event count (approx.): 2669
#
# Overhead  Command  Shared Object      Symbol        
# ........  .......  .................  ..............
#
   100.00%  scylla   [kernel.kallsyms]  [k] __schedule
             |
             ---__schedule
                |          
                |--96.18%-- schedule
                |          |          
                |          |--56.14%-- schedule_user
                |          |          |          
                |          |          |--53.30%-- int_careful
                |          |          |          |          
                |          |          |          |--45.05%-- 0x7f4ade6f74ed
                |          |          |          |          reactor_backend_epoll::make_reactor_notifier
                |          |          |          |          |          
                |          |          |          |          |--67.63%-- syscall_work_queue::submit_item
                |          |          |          |          |          |          
                |          |          |          |          |          |--32.05%-- posix_file_impl::truncate
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--65.33%-- _ZN12continuationIZN6futureIJEE4thenIZN19file_data_sink_impl5flushEvEUlvE_S1_EET0_OT_EUlS7_E_JEE3runEv
                |          |          |          |          |          |          |          reactor::del_timer
                |          |          |          |          |          |          |          0x60b0000e2040
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--20.00%-- db::commitlog::segment::flush(unsigned long)::{lambda()#1}::operator()
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |          |--73.33%-- future<>::then<db::commitlog::segment::flush(unsigned long)::{lambda()#1}, future<lw_shared_ptr<db::commitlog::segment> > >
                |          |          |          |          |          |          |          |          _ZN12continuationIZN6futureIJ13lw_shared_ptrIN2db9commitlog7segmentEEEE4thenIZNS4_4syncEvEUlT_E_S6_EET0_OS8_EUlSB_E_JS5_EE3runEv
                |          |          |          |          |          |          |          |          reactor::del_timer
                |          |          |          |          |          |          |          |          0x60e0000e2040
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |           --26.67%-- _ZN12continuationIZN6futureIJEE4thenIZN2db9commitlog7segment5flushEmEUlvE_S0_IJ13lw_shared_ptrIS5_EEEEET0_OT_EUlSC_E_JEE3runEv
                |          |          |          |          |          |          |                     reactor::del_timer
                |          |          |          |          |          |          |                     0x6090000e2040
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--10.67%-- sstables::sstable::seal_sstable
                |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
                |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --4.00%-- sstables::sstable::write_toc
                |          |          |          |          |          |                     sstables::sstable::prepare_write_components
                |          |          |          |          |          |                     |          
                |          |          |          |          |          |                     |--50.00%-- 0x4d3a4f6ec4e8cd75
                |          |          |          |          |          |                     |          
                |          |          |          |          |          |                      --50.00%-- 0x3ebf3dd80e3b174d
                |          |          |          |          |          |          
                |          |          |          |          |          |--23.93%-- posix_file_impl::discard
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--82.14%-- _ZN12continuationIZN6futureIImEE4thenIZN19file_data_sink_impl6do_putEm16temporary_bufferIcEEUlmE_S0_IIEEEET0_OT_EUlSA_E_ImEE3runEv
                |          |          |          |          |          |          |          reactor::del_timer
                |          |          |          |          |          |          |          0x6080000e2040
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --17.86%-- futurize<future<lw_shared_ptr<db::commitlog::segment> > >::apply<db::commitlog::segment_manager::allocate_segment(bool)::{lambda(file)#1}, file>
                |          |          |          |          |          |                     _ZN12continuationIZN6futureIJ4fileEE4thenIZN2db9commitlog15segment_manager16allocate_segmentEbEUlS1_E_S0_IJ13lw_shared_ptrINS5_7segmentEEEEEET0_OT_EUlSE_E_JS1_EE3runEv
                |          |          |          |          |          |          
                |          |          |          |          |          |--20.94%-- reactor::open_file_dma
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--20.41%-- db::commitlog::segment_manager::allocate_segment
                |          |          |          |          |          |          |          db::commitlog::segment_manager::on_timer()::{lambda()#1}::operator()
                |          |          |          |          |          |          |          0xb8c264
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--14.29%-- sstables::sstable::write_simple<(sstables::sstable::component_type)8, sstables::statistics>
                |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
                |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--12.24%-- sstables::write_crc
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |          |--16.67%-- 0x313532343536002f
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |          |--16.67%-- 0x373633323533002f
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |          |--16.67%-- 0x363139333232002f
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |          |--16.67%-- 0x353933303330002f
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |          |--16.67%-- 0x383930383133002f
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |           --16.67%-- 0x323338303037002f
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--12.24%-- sstables::write_digest
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)7, sstables::filter>
                |          |          |          |          |          |          |          sstables::sstable::write_filter
                |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
                |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)4, sstables::summary_ka>
                |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
                |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--10.20%-- 0x78d93b
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--6.12%-- sstables::sstable::open_data
                |          |          |          |          |          |          |          |          
                |          |          |          |          |          |          |           --100.00%-- 0x8000000004000000
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --4.08%-- sstables::sstable::write_toc
                |          |          |          |          |          |                     sstables::sstable::prepare_write_components
                |          |          |          |          |          |                     |          
                |          |          |          |          |          |                      --100.00%-- 0x6100206690ef
                |          |          |          |          |          |          
                |          |          |          |          |          |--18.38%-- syscall_work_queue::submit_item
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--10.00%-- 0x7f4ad89f8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--7.50%-- 0x7f4ad83f8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--7.50%-- 0x7f4ad6bf8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--7.50%-- 0x7f4ad65f8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--5.00%-- 0x60b015e8cd90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--5.00%-- 0x60100acaed90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--5.00%-- 0x607006f04d90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--5.00%-- 0xffffffffffffa5d0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60e01acbed90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60e01acbec60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60a018d7ad90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60a018d7ac60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60b015e8cc60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60900bb8ad60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60100acaec60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60800951dd90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60800951dc60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60d009089d90
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60d009089c60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x607006f04c60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x60f005984d60
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x7f4ad77f8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x7f4adb9f8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x7f4ad9bf8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x7f4ad7df8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.50%-- 0x7f4ad77f8fe0
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --2.50%-- 0x7f4ad5ff8fe0
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.99%-- reactor::open_directory
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--57.14%-- sstables::sstable::filename
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --42.86%-- sstables::sstable::write_toc
                |          |          |          |          |          |                     sstables::sstable::prepare_write_components
                |          |          |          |          |          |                     |          
                |          |          |          |          |          |                     |--50.00%-- 0x4d3a4f6ec4e8cd75
                |          |          |          |          |          |                     |          
                |          |          |          |          |          |                      --50.00%-- 0x3ebf3dd80e3b174d
                |          |          |          |          |          |          
                |          |          |          |          |           --1.71%-- reactor::rename_file
                |          |          |          |          |                     sstables::sstable::seal_sstable
                |          |          |          |          |                     std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
                |          |          |          |          |                     _GLOBAL__sub_I__ZN12app_templateC2Ev
                |          |          |          |          |          
                |          |          |          |           --32.37%-- _ZN12continuationIZN6futureIJEE4thenIZN18syscall_work_queue11submit_itemEPNS3_9work_itemEEUlvE_S1_EET0_OT_EUlS9_E_JEE3runEv
                |          |          |          |                     reactor::del_timer
                |          |          |          |                     0x60d0000e2040
                |          |          |          |          
                |          |          |          |--29.04%-- __vdso_clock_gettime
                |          |          |          |          
                |          |          |          |--19.66%-- 0x7f4ade42b193
                |          |          |          |          reactor_backend_epoll::complete_epoll_event
                |          |          |          |          |          
                |          |          |          |          |--41.61%-- smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process
                |          |          |          |          |          |          
                |          |          |          |          |          |--79.03%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--95.92%-- 0x6070000c3000
                |          |          |          |          |          |          |          
                |          |          |          |          |          |          |--2.04%-- 0x61d0000c1000
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --2.04%-- 0x61d0000c1000
                |          |          |          |          |          |          
                |          |          |          |          |          |--3.23%-- 0x14dd51
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x162a54
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x161dca
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x159c8b
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x1598b5
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x14dd3e
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x14bad8
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x14a880
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x127105
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- 0x6070000e2040
                |          |          |          |          |          |          
                |          |          |          |          |          |--1.61%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
                |          |          |          |          |          |          0x60d0000c3000
                |          |          |          |          |          |          
                |          |          |          |          |           --1.61%-- __vdso_clock_gettime
                |          |          |          |          |                     0x7f4ad77f9160
                |          |          |          |          |          
                |          |          |          |          |--30.20%-- __restore_rt
                |          |          |          |          |          |          
                |          |          |          |          |          |--57.14%-- __vdso_clock_gettime
                |          |          |          |          |          |          0x1d
                |          |          |          |          |          |          
                |          |          |          |          |          |--9.52%-- smp_message_queue::smp_message_queue
                |          |          |          |          |          |          0x6070000c3000
                |          |          |          |          |          |          
                |          |          |          |          |          |--4.76%-- 0x600000357240
                |          |          |          |          |          |          
                |          |          |          |          |          |--4.76%-- 0x60000031a640
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- posix_file_impl::list_directory
                |          |          |          |          |          |          0x609000044730
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x46efbf
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x600000442e40
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x600000376440
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x6000002bac40
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x600000295640
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x600000289e40
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x60000031a640
                |          |          |          |          |          |          
                |          |          |          |          |          |--2.38%-- 0x7f4ade6f74ed
                |          |          |          |          |          |          __libc_siglongjmp
                |          |          |          |          |          |          0x60000047be40
                |          |          |          |          |          |          
                |          |          |          |          |           --2.38%-- 0x7f4adb3f7fd0
                |          |          |          |          |          
                |          |          |          |          |--14.09%-- 0x33
                |          |          |          |          |          
                |          |          |          |          |--12.08%-- promise<temporary_buffer<char> >::promise
                |          |          |          |          |          _ZN6futureIJ16temporary_bufferIcEEE4thenIZN12input_streamIcE12read_exactlyEmEUlT_E_S2_EET0_OS6_
                |          |          |          |          |          |          
                |          |          |          |          |          |--44.44%-- input_stream<char>::read_exactly
                |          |          |          |          |          |          0x8
                |          |          |          |          |          |          
                |          |          |          |          |          |--11.11%-- 0x7f4adb3f8ea0
                |          |          |          |          |          |          
                |          |          |          |          |          |--11.11%-- 0x7f4ad9bf8ea0
                |          |          |          |          |          |          
                |          |          |          |          |          |--11.11%-- 0x7f4ad89f8ea0
                |          |          |          |          |          |          
                |          |          |          |          |          |--11.11%-- 0x7f4ad83f8ea0
                |          |          |          |          |          |          
                |          |          |          |          |          |--5.56%-- 0x7f4ad77f8ea0
                |          |          |          |          |          |          
                |          |          |          |          |           --5.56%-- 0x7f4ad7df8ea0
                |          |          |          |          |          
                |          |          |          |          |--1.34%-- 0x7f4ad6bf8d80
                |          |          |          |          |          
                |          |          |          |           --0.67%-- 0x7f4adadf8d80
                |          |          |          |          
                |          |          |          |--4.43%-- __libc_send
                |          |          |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
                |          |          |          |          |          
                |          |          |          |          |--14.71%-- 0x4
                |          |          |          |          |          
                |          |          |          |          |--11.76%-- 0x7f4ad89f8de0
                |          |          |          |          |          
                |          |          |          |          |--8.82%-- 0x7f4adb3f8de0
                |          |          |          |          |          
                |          |          |          |          |--8.82%-- 0x7f4ad9bf8de0
                |          |          |          |          |          
                |          |          |          |          |--8.82%-- 0x7f4ad77f8de0
                |          |          |          |          |          
                |          |          |          |          |--8.82%-- 0x7f4ad6bf8de0
                |          |          |          |          |          
                |          |          |          |          |--5.88%-- 0x7f4ad83f8de0
                |          |          |          |          |          
                |          |          |          |          |--5.88%-- 0x7f4ad7df8de0
                |          |          |          |          |          
                |          |          |          |          |--5.88%-- 0x7f4ad53f8de0
                |          |          |          |          |          
                |          |          |          |          |--2.94%-- 0x7f4acc9f8de0
                |          |          |          |          |          
                |          |          |          |          |--2.94%-- continuation<future<file>::wait()::{lambda(future_state<file>&&)#1}, file>::~continuation
                |          |          |          |          |          0x611003c8e9b8
                |          |          |          |          |          
                |          |          |          |          |--2.94%-- 0x7f4adb9f8de0
                |          |          |          |          |          
                |          |          |          |          |--2.94%-- 0x7f4ad71f8de0
                |          |          |          |          |          
                |          |          |          |          |--2.94%-- 0x7f4ad65f8de0
                |          |          |          |          |          
                |          |          |          |          |--2.94%-- 0x7f4ad59f8de0
                |          |          |          |          |          
                |          |          |          |           --2.94%-- 0x7f4ad35f8de0
                |          |          |          |          
                |          |          |          |--1.56%-- 0x7f4ade6f754d
                |          |          |          |          reactor::read_some
                |          |          |          |          |          
                |          |          |          |          |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv
                |          |          |          |          |          reactor::del_timer
                |          |          |          |          |          0x6070000e2040
                |          |          |          |          |          
                |          |          |          |          |--8.33%-- _ZN12continuationIZN6futureIIEE4thenIZ5sleepINSt6chrono3_V212system_clockEmSt5ratioILl1ELl1000000EEES1_NS4_8durationIT0_T1_EEEUlvE_S1_EESA_OT_EUlSF_E_IEE3runEv
                |          |          |          |          |          reactor::del_timer
                |          |          |          |          |          0x6080000e2040
                |          |          |          |          |          
                |          |          |          |          |--8.33%-- 0x600000483640
                |          |          |          |          |          
                |          |          |          |          |--8.33%-- 0x600000480440
                |          |          |          |          |          
                |          |          |          |           --8.33%-- 0x36
                |          |          |           --0.26%-- [...]
                |          |          |          
                |          |           --46.70%-- retint_careful
                |          |                     |          
                |          |                     |--6.24%-- posix_file_impl::list_directory
                |          |                     |          |          
                |          |                     |          |--80.00%-- 0x60f0000e2020
                |          |                     |          |          
                |          |                     |          |--5.00%-- 0x601000044730
                |          |                     |          |          
                |          |                     |          |--5.00%-- 0x60e000044720
                |          |                     |          |          
                |          |                     |          |--2.50%-- 0x60f000135500
                |          |                     |          |          
                |          |                     |          |--2.50%-- 0x6190000e2098
                |          |                     |          |          
                |          |                     |          |--2.50%-- 0x60d0000c3000
                |          |                     |          |          
                |          |                     |           --2.50%-- 0x1
                |          |                     |          
                |          |                     |--3.42%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
                |          |                     |          |          
                |          |                     |          |--95.65%-- boost::program_options::variables_map::get
                |          |                     |          |          
                |          |                     |           --4.35%-- 0x618000044680
                |          |                     |          
                |          |                     |--3.12%-- memory::small_pool::add_more_objects
                |          |                     |          |          
                |          |                     |          |--10.53%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::clear_and_release
                |          |                     |          |          mutation_partition::clustered_row
                |          |                     |          |          mutation::set_clustered_cell
                |          |                     |          |          cql3::constants::setter::execute
                |          |                     |          |          cql3::statements::update_statement::add_update_for_key
                |          |                     |          |          _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE
                |          |                     |          |          cql3::statements::modification_statement::get_mutations
                |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
                |          |                     |          |          cql3::query_options::query_options
                |          |                     |          |          |          
                |          |                     |          |          |--50.00%-- 0x7f4ad77f80e0
                |          |                     |          |          |          
                |          |                     |          |           --50.00%-- 0x7f4ad6bf80e0
                |          |                     |          |          
                |          |                     |          |--10.53%-- memory::small_pool::add_more_objects
                |          |                     |          |          |          
                |          |                     |          |          |--50.00%-- 0x60e00015d000
                |          |                     |          |          |          
                |          |                     |          |           --50.00%-- 0x60b00af6c758
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60a018ee3867
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60d00d41f680
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x61400c6bb4d0
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60e007c918d6
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60e0078294ce
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x607006ee4da0
                |          |                     |          |          
                |          |                     |          |--5.26%-- _ZN12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZN7service13storage_proxy22send_to_live_endpointsEmEUlNSt15__exception_ptr13exception_ptrEE0_EES1_OT_EUlSA_E_S1_EET0_SA_EUlSA_E_JEE3runEv
                |          |                     |          |          reactor::del_timer
                |          |                     |          |          0x6030000e2040
                |          |                     |          |          
                |          |                     |          |--5.26%-- service::storage_proxy::mutate_locally
                |          |                     |          |          service::storage_proxy::send_to_live_endpoints
                |          |                     |          |          parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}>
                |          |                     |          |          0x601000136d00
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60a0001900e0
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60e00015d040
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x61300015d000
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60e00013bde0
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x60b00010f308
                |          |                     |          |          
                |          |                     |          |--5.26%-- 0x6010000e4808
                |          |                     |          |          
                |          |                     |           --5.26%-- 0x7f4ad65f7f50
                |          |                     |          
                |          |                     |--2.82%-- std::unique_ptr<reactor::pollfn, std::default_delete<std::unique_ptr> > reactor::make_pollfn<reactor::run()::{lambda()#3}>(reactor::run()::{lambda()#3}&&)::the_pollfn::poll_and_check_more_work
                |          |                     |          |          
                |          |                     |          |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
                |          |                     |          |          boost::program_options::variables_map::get
                |          |                     |          |          
                |          |                     |          |--25.00%-- 0x1
                |          |                     |          |          
                |          |                     |          |--12.50%-- 0x53
                |          |                     |          |          
                |          |                     |          |--12.50%-- 0x3e
                |          |                     |          |          
                |          |                     |          |--12.50%-- 0x24
                |          |                     |          |          
                |          |                     |           --12.50%-- 0xb958000000000000
                |          |                     |          
                |          |                     |--2.67%-- std::_Function_handler<partition_presence_checker_result (partition_key const&), column_family::make_partition_presence_checker(lw_shared_ptr<std::map<long, lw_shared_ptr<sstables::sstable>, std::less<long>, std::allocator<std::pair<long const, lw_shared_ptr<sstables::sstable> > > > >)::{lambda(partition_key const&)#1}>::_M_invoke
                |          |                     |          |          
                |          |                     |          |--66.67%-- 0x1b5c280
                |          |                     |          |          
                |          |                     |          |--27.78%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::resize
                |          |                     |          |          row::apply
                |          |                     |          |          mutation_partition_applier::accept_row_cell
                |          |                     |          |          mutation_partition_view::accept
                |          |                     |          |          
                |          |                     |           --5.56%-- 0x2a4399
                |          |                     |          
                |          |                     |--2.08%-- smp_message_queue::smp_message_queue
                |          |                     |          |          
                |          |                     |          |--60.00%-- 0x60f0000c3000
                |          |                     |          |          
                |          |                     |          |--10.00%-- 0x6000002d7240
                |          |                     |          |          
                |          |                     |          |--10.00%-- 0x19
                |          |                     |          |          
                |          |                     |          |--10.00%-- 0xb
                |          |                     |          |          
                |          |                     |           --10.00%-- 0x7
                |          |                     |          
                |          |                     |--1.93%-- smp_message_queue::process_queue<4ul, smp_message_queue::process_completions()::{lambda(smp_message_queue::work_item*)#1}>
                |          |                     |          
                |          |                     |--1.63%-- __vdso_clock_gettime
                |          |                     |          |          
                |          |                     |           --100.00%-- __clock_gettime
                |          |                     |                     std::chrono::_V2::system_clock::now
                |          |                     |                     0xa63209
                |          |                     |          
                |          |                     |--1.49%-- memory::small_pool::deallocate
                |          |                     |          |          
                |          |                     |          |--40.00%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::emplace_back<atomic_cell_or_collection>
                |          |                     |          |          
                |          |                     |          |--20.00%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase
                |          |                     |          |          service::storage_proxy::got_response
                |          |                     |          |          _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv
                |          |                     |          |          reactor::del_timer
                |          |                     |          |          0x6100000e2040
                |          |                     |          |          
                |          |                     |          |--10.00%-- cql3::statements::modification_statement::get_mutations
                |          |                     |          |          
                |          |                     |          |--10.00%-- cql3::statements::modification_statement::build_partition_keys
                |          |                     |          |          cql3::statements::modification_statement::create_exploded_clustering_prefix
                |          |                     |          |          0x60c014be0b00
                |          |                     |          |          
                |          |                     |          |--10.00%-- mutation_partition::~mutation_partition
                |          |                     |          |          std::vector<mutation, std::allocator<mutation> >::~vector
                |          |                     |          |          service::storage_proxy::mutate_with_triggers
                |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
                |          |                     |          |          cql3::statements::modification_statement::execute
                |          |                     |          |          cql3::query_processor::process_statement
                |          |                     |          |          transport::cql_server::connection::process_execute
                |          |                     |          |          transport::cql_server::connection::process_request_one
                |          |                     |          |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |          |          0x8961de
                |          |                     |          |          
                |          |                     |           --10.00%-- object_deleter_impl<deleter>::~object_deleter_impl
                |          |                     |                     _ZN12continuationIZN6futureIJEE12then_wrappedIZZNS1_7finallyIZ7do_withI11foreign_ptrI10shared_ptrIN9transport10cql_server8responseEEEZZNS8_10connection14write_responseEOSB_ENUlvE_clEvEUlRT_E_EDaOSF_OT0_EUlvE_EES1_SI_ENUlS1_E_clES1_EUlSF_E_S1_EESJ_SI_EUlSI_E_JEED0Ev
                |          |                     |                     0x61a0000c3db0
                |          |                     |          
                |          |                     |--1.34%-- dht::decorated_key::equal
                |          |                     |          |          
                |          |                     |          |--83.33%-- 0x607000138f00
                |          |                     |          |          
                |          |                     |           --16.67%-- 0x60a0000e0f40
                |          |                     |          
                |          |                     |--1.34%-- service::storage_proxy::send_to_live_endpoints
                |          |                     |          
                |          |                     |--1.19%-- transport::cql_server::connection::process_execute
                |          |                     |          transport::cql_server::connection::process_request_one
                |          |                     |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |          |          
                |          |                     |          |--87.50%-- transport::cql_server::connection::process_request
                |          |                     |          |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          0x60e0000c3000
                |          |                     |          |          
                |          |                     |           --12.50%-- 0x8961de
                |          |                     |          
                |          |                     |--1.19%-- reactor::run
                |          |                     |          |          
                |          |                     |          |--87.50%-- smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
                |          |                     |          |          continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
                |          |                     |          |          0x600000043d00
                |          |                     |          |          
                |          |                     |           --12.50%-- app_template::run_deprecated
                |          |                     |                     main
                |          |                     |                     __libc_start_main
                |          |                     |                     _GLOBAL__sub_I__ZN3org6apache9cassandra21g_cassandra_constantsE
                |          |                     |                     0x7f4ae20c9fa0
                |          |                     |          
                |          |                     |--1.04%-- __clock_gettime
                |          |                     |          std::chrono::_V2::system_clock::now
                |          |                     |          |          
                |          |                     |          |--42.86%-- reactor::run
                |          |                     |          |          smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
                |          |                     |          |          continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
                |          |                     |          |          0x600000043d00
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0xa63209
                |          |                     |          |          
                |          |                     |          |--14.29%-- continuation<future<> future<>::finally<auto do_with<std::vector<frozen_mutation, std::allocator<frozen_mutation> >, shared_ptr<service::storage_proxy>, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}>(std::vector<frozen_mutation, std::allocator<frozen_mutation> >&&, shared_ptr<service::storage_proxy>&&, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}&&)::{lambda()#1}>(service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::a
                |          |                     |          |          0x2b7434
                |          |                     |          |          
                |          |                     |          |--14.29%-- _ZN8futurizeI6futureIJSt10unique_ptrIN4cql317update_parametersESt14default_deleteIS3_EEEEE5applyIZNS2_10statements22modification_statement22make_update_parametersERN7seastar7shardedIN7service13storage_proxyEEE13lw_shared_ptrISt6vectorI13partition_keySaISK_EEESI_I26exploded_clustering_prefixERKNS2_13query_optionsEblEUlT_E_JNSt12experimental15fundamentals_v18optionalINS3_13prefetch_dataEEEEEES7_OST_OSt5tupleIJDpT0_EE
                |          |                     |          |          cql3::statements::modification_statement::make_update_parameters
                |          |                     |          |          cql3::statements::modification_statement::get_mutations
                |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
                |          |                     |          |          cql3::query_options::query_options
                |          |                     |          |          0x7f4ad6bf80e0
                |          |                     |          |          
                |          |                     |           --14.29%-- database::apply_in_memory
                |          |                     |                     database::do_apply
                |          |                     |                     _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv
                |          |                     |                     reactor::del_timer
                |          |                     |                     0x6090000e2040
                |          |                     |          
                |          |                     |--1.04%-- memory::small_pool::allocate
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0x5257c379469d9
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0x609002b9fe98
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0x13c8b90
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0x60f000190710
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0x25
                |          |                     |          |          
                |          |                     |          |--14.29%-- 0x7f4ad6bf84c0
                |          |                     |          |          
                |          |                     |           --14.29%-- 0x7f4ad53f81f0
                |          |                     |          
                |          |                     |--0.89%-- db::serializer<atomic_cell_view>::serializer
                |          |                     |          mutation_partition_serializer::write_without_framing
                |          |                     |          frozen_mutation::frozen_mutation
                |          |                     |          frozen_mutation::frozen_mutation
                |          |                     |          
                |          |                     |--0.89%-- do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          0x60f0000c3000
                |          |                     |          
                |          |                     |--0.89%-- futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |          transport::cql_server::connection::process_request
                |          |                     |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          
                |          |                     |          |--83.33%-- 0x6090000c3000
                |          |                     |          |          
                |          |                     |           --16.67%-- 0x600000044400
                |          |                     |          
                |          |                     |--0.89%-- std::_Function_handler<void (), reactor::run()::{lambda()#8}>::_M_invoke
                |          |                     |          |          
                |          |                     |          |--50.00%-- reactor::run
                |          |                     |          |          smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
                |          |                     |          |          continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
                |          |                     |          |          0x600000043d00
                |          |                     |          |          
                |          |                     |           --50.00%-- reactor::signals::signal_handler::signal_handler
                |          |                     |                     0x3e8
                |          |                     |          
                |          |                     |--0.74%-- db::commitlog::segment::allocate
                |          |                     |          |          
                |          |                     |           --100.00%-- db::commitlog::add
                |          |                     |                     database::do_apply
                |          |                     |                     |          
                |          |                     |                     |--75.00%-- database::apply
                |          |                     |                     |          smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process
                |          |                     |                     |          smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
                |          |                     |                     |          boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
                |          |                     |                     |          boost::program_options::variables_map::get
                |          |                     |                     |          
                |          |                     |                      --25.00%-- _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv
                |          |                     |                                reactor::del_timer
                |          |                     |                                0x60b0000e2040
                |          |                     |          
                |          |                     |--0.74%-- service::storage_proxy::create_write_response_handler
                |          |                     |          
                |          |                     |--0.74%-- transport::cql_server::connection::process_request_one
                |          |                     |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |          |          
                |          |                     |          |--80.00%-- transport::cql_server::connection::process_request
                |          |                     |          |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          0x60a0000c3000
                |          |                     |          |          
                |          |                     |           --20.00%-- 0x8961de
                |          |                     |          
                |          |                     |--0.74%-- compound_type<(allow_prefixes)0>::compare
                |          |                     |          |          
                |          |                     |          |--20.00%-- 0x6030056c0f20
                |          |                     |          |          
                |          |                     |          |--20.00%-- boost::intrusive::bstbase2<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::find
                |          |                     |          |          mutation_partition::clustered_row
                |          |                     |          |          mutation::set_clustered_cell
                |          |                     |          |          cql3::constants::setter::execute
                |          |                     |          |          cql3::statements::update_statement::add_update_for_key
                |          |                     |          |          _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE
                |          |                     |          |          cql3::statements::modification_statement::get_mutations
                |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
                |          |                     |          |          cql3::query_options::query_options
                |          |                     |          |          0x7f4adb3f80e0
                |          |                     |          |          
                |          |                     |          |--20.00%-- compound_type<(allow_prefixes)0>::compare
                |          |                     |          |          
                |          |                     |          |--20.00%-- mutation_partition::clustered_row
                |          |                     |          |          boost::intrusive::bstree_impl<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, unsigned long, true, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::insert_unique
                |          |                     |          |          boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node
                |          |                     |          |          0x12d
                |          |                     |          |          
                |          |                     |           --20.00%-- 0x60f00052daf0
                |          |                     |          
                |          |                     |--0.74%-- __memmove_ssse3_back
                |          |                     |          |          
                |          |                     |          |--40.00%-- output_stream<char>::write
                |          |                     |          |          |          
                |          |                     |          |          |--50.00%-- transport::cql_server::response::output
                |          |                     |          |          |          futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}>
                |          |                     |          |          |          
                |          |                     |          |           --50.00%-- 0x7c7fb2
                |          |                     |          |                     0x5257c37847fa0
                |          |                     |          |          
                |          |                     |          |--20.00%-- transport::cql_server::connection::read_short_bytes
                |          |                     |          |          transport::cql_server::connection::process_query
                |          |                     |          |          0x7f4ada7f86f0
                |          |                     |          |          
                |          |                     |          |--20.00%-- transport::cql_server::response::output
                |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}>
                |          |                     |          |          0x2
                |          |                     |          |          
                |          |                     |           --20.00%-- smp_message_queue::flush_response_batch
                |          |                     |                     boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
                |          |                     |                     boost::program_options::variables_map::get
                |          |                     |          
                |          |                     |--0.74%-- syscall_work_queue::work_item_returning<syscall_result_extra<stat>, reactor::file_size(basic_sstring<char, unsigned int, 15u>)::{lambda()#1}>::~work_item_returning
                |          |                     |          |          
                |          |                     |          |--60.00%-- 0x6130000c3000
                |          |                     |          |          
                |          |                     |          |--20.00%-- 0x608001fe59a0
                |          |                     |          |          
                |          |                     |           --20.00%-- 0x16
                |          |                     |          
                |          |                     |--0.74%-- __memset_sse2
                |          |                     |          |          
                |          |                     |          |--40.00%-- std::_Hashtable<range<dht::token>, std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >, std::allocator<std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<range<dht::token> >, std::hash<range<dht::token> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable
                |          |                     |          |          locator::token_metadata::pending_endpoints_for
                |          |                     |          |          service::storage_proxy::create_write_response_handler
                |          |                     |          |          service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}>
                |          |                     |          |          service::storage_proxy::mutate
                |          |                     |          |          service::storage_proxy::mutate_with_triggers
                |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
                |          |                     |          |          cql3::statements::modification_statement::execute
                |          |                     |          |          cql3::query_processor::process_statement
                |          |                     |          |          transport::cql_server::connection::process_execute
                |          |                     |          |          transport::cql_server::connection::process_request_one
                |          |                     |          |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |          |          transport::cql_server::connection::process_request
                |          |                     |          |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          |          0x6020000c3000
                |          |                     |          |          
                |          |                     |          |--40.00%-- service::digest_read_resolver::~digest_read_resolver
                |          |                     |          |          |          
                |          |                     |          |           --100.00%-- 0x610002612b50
                |          |                     |          |          
                |          |                     |           --20.00%-- std::_Hashtable<basic_sstring<char, unsigned int, 15u>, std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, std::allocator<std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<basic_sstring<char, unsigned int, 15u> >, std::hash<basic_sstring<char, unsigned int, 15u> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable
                |          |                     |                     service::storage_proxy::send_to_live_endpoints
                |          |                     |                     parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}>
                |          |                     |                     service::storage_proxy::mutate
                |          |                     |                     service::storage_proxy::mutate_with_triggers
                |          |                     |                     cql3::statements::modification_statement::execute_without_condition
                |          |                     |                     cql3::statements::modification_statement::execute
                |          |                     |                     cql3::query_processor::process_statement
                |          |                     |                     transport::cql_server::connection::process_execute
                |          |                     |                     transport::cql_server::connection::process_request_one
                |          |                     |                     futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |                     |                     futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |                     futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |                     transport::cql_server::connection::process_request
                |          |                     |                     do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |                     do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |                     do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |                     0x6070000c3000
                |          |                     |          
                |          |                     |--0.74%-- reactor::del_timer
                |          |                     |          |          
                |          |                     |          |--80.00%-- 0x60a0000e2040
                |          |                     |          |          
                |          |                     |           --20.00%-- 0x6080000c3db0
                |          |                     |          
                |          |                     |--0.59%-- unimplemented::operator<<
                |          |                     |          |          
                |          |                     |          |--25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev
                |          |                     |          |          0x600100000008
                |          |                     |          |          
                |          |                     |          |--25.00%-- floating_type_impl<float>::from_string
                |          |                     |          |          
                |          |                     |          |--25.00%-- 0x60e0000e4c10
                |          |                     |          |          
                |          |                     |           --25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev
                |          |                     |                     0x600100000008
                |          |                     |          
                |          |                     |--0.59%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node
                |          |                     |          service::storage_proxy::register_response_handler
                |          |                     |          service::storage_proxy::create_write_response_handler
                |          |                     |          service::storage_proxy::create_write_response_handler
                |          |                     |          service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}>
                |          |                     |          service::storage_proxy::mutate
                |          |                     |          service::storage_proxy::mutate_with_triggers
                |          |                     |          cql3::statements::modification_statement::execute_without_condition
                |          |                     |          cql3::statements::modification_statement::execute
                |          |                     |          cql3::query_processor::process_statement
                |          |                     |          transport::cql_server::connection::process_execute
                |          |                     |          transport::cql_server::connection::process_request_one
                |          |                     |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |                     |          transport::cql_server::connection::process_request
                |          |                     |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |                     |          0x60b0000c3000
                |          |                     |          
                |          |                     |--0.59%-- mutation::set_clustered_cell
                |          |                     |          |          
                |          |                     |          |--75.00%-- 0xa
                |          |                     |          |          
                |          |                     |           --25.00%-- cql3::constants::setter::execute
                |          |                     |                     cql3::statements::update_statement::add_update_for_key
                |          |                     |                     _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE
                |          |                     |                     cql3::statements::modification_statement::get_mutations
                |          |                     |                     cql3::statements::modification_statement::execute_without_condition
                |          |                     |                     cql3::query_options::query_options
                |          |                     |                     0x7f4ad89f80e0
                |          |                     |          
                |          |                     |--0.59%-- memory::small_pool::small_pool
                |          |                     |          |          
                |          |                     |          |--25.00%-- memory::stats
                |          |                     |          |          boost::program_options::variables_map::get
                |          |                     |          |          
                |          |                     |          |--25.00%-- memory::reclaimer::~reclaimer
                |          |                     |          |          0x1e
                |          |                     |          |          
                |          |                     |          |--25.00%-- memory::allocate_aligned
                |          |                     |          |          
                |          |                     |           --25.00%-- memory::small_pool::add_more_objects
                |          |                     |                     memory::small_pool::add_more_objects
                |          |                     |                     0x6100000e0310
                |          |                     |          
                |          |                     |--0.59%-- __memcpy_sse2_unaligned
                |          |                     |          |          
                |          |                     |          |--50.00%-- mutation_partition_applier::accept_row_cell
                |          |                     |          |          mutation_partition_view::accept
                |          |                     |          |          boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node
                |          |                     |          |          0x12d
                |          |                     |          |          
                |          |                     |          |--25.00%-- scanning_reader::operator()
                |          |                     |          |          sstables::sstable::do_write_components
                |          |                     |          |          sstables::sstable::prepare_write_components
                |          |                     |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
                |          |                     |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
                |          |                     |          |          
                |          |                     |           --25.00%-- memtable::find_or_create_partition_slow
                |          |                     |                     memtable::apply
                |          |                     |                     database::apply_in_memory
                |          |                     |                     database::do_apply
                |          |                     |                     database::apply
                |          |                     |                     smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process
                |          |                     |                     smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
                |          |                     |                     boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
                |          |                     |                     boost::program_options::variables_map::get
                |          |                     |          
                |          |                     |--0.59%-- smp_message_queue::flush_response_batch
                |          |                     |          |          
                |          |                     |          |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
                |          |                     |          |          boost::program_options::variables_map::get
                |          |                     |          |          
                |          |                     |          |--25.00%-- 0x13
                |          |                     |          |          
                |          |                     |          |--25.00%-- 0x7f4ad5ff8f40
                |          |                     |          |          
                |          |                     |           --25.00%-- reactor::run
                |          |                     |                     smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
                |          |                     |                     continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
                |          |                     |                     0x600000043d00
                |          |                      --54.38%-- [...]
                |          |          
                |          |--14.26%-- schedule_timeout
                |          |          |          
                |          |          |--38.52%-- wait_for_completion
                |          |          |          |          
                |          |          |          |--90.07%-- flush_work
                |          |          |          |          xlog_cil_force_lsn
                |          |          |          |          |          
                |          |          |          |          |--96.85%-- _xfs_log_force_lsn
                |          |          |          |          |          |          
                |          |          |          |          |          |--79.67%-- xfs_file_fsync
                |          |          |          |          |          |          vfs_fsync_range
                |          |          |          |          |          |          do_fsync
                |          |          |          |          |          |          sys_fdatasync
                |          |          |          |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          |          |          |          
                |          |          |          |          |          |           --100.00%-- 0x7f4ade4212ad
                |          |          |          |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |          |          |          |                     0x6030000c3ec0
                |          |          |          |          |          |          
                |          |          |          |          |           --20.33%-- xfs_dir_fsync
                |          |          |          |          |                     vfs_fsync_range
                |          |          |          |          |                     do_fsync
                |          |          |          |          |                     sys_fdatasync
                |          |          |          |          |                     entry_SYSCALL_64_fastpath
                |          |          |          |          |                     |          
                |          |          |          |          |                      --100.00%-- 0x7f4ade4212ad
                |          |          |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |          |          |                                0x6040000c3ec0
                |          |          |          |          |          
                |          |          |          |           --3.15%-- _xfs_log_force
                |          |          |          |                     xfs_log_force
                |          |          |          |                     xfs_buf_lock
                |          |          |          |                     _xfs_buf_find
                |          |          |          |                     xfs_buf_get_map
                |          |          |          |                     xfs_trans_get_buf_map
                |          |          |          |                     xfs_btree_get_bufl
                |          |          |          |                     xfs_bmap_extents_to_btree
                |          |          |          |                     xfs_bmap_add_extent_hole_real
                |          |          |          |                     xfs_bmapi_write
                |          |          |          |                     xfs_iomap_write_direct
                |          |          |          |                     __xfs_get_blocks
                |          |          |          |                     xfs_get_blocks_direct
                |          |          |          |                     do_blockdev_direct_IO
                |          |          |          |                     __blockdev_direct_IO
                |          |          |          |                     xfs_vm_direct_IO
                |          |          |          |                     xfs_file_dio_aio_write
                |          |          |          |                     xfs_file_write_iter
                |          |          |          |                     aio_run_iocb
                |          |          |          |                     do_io_submit
                |          |          |          |                     sys_io_submit
                |          |          |          |                     entry_SYSCALL_64_fastpath
                |          |          |          |                     io_submit
                |          |          |          |                     0x46d98a
                |          |          |          |          
                |          |          |           --9.93%-- submit_bio_wait
                |          |          |                     blkdev_issue_flush
                |          |          |                     xfs_blkdev_issue_flush
                |          |          |                     xfs_file_fsync
                |          |          |                     vfs_fsync_range
                |          |          |                     do_fsync
                |          |          |                     sys_fdatasync
                |          |          |                     entry_SYSCALL_64_fastpath
                |          |          |                     |          
                |          |          |                      --100.00%-- 0x7f4ade4212ad
                |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |                                0x6030000c3ec0
                |          |          |          
                |          |          |--32.79%-- io_schedule_timeout
                |          |          |          bit_wait_io
                |          |          |          __wait_on_bit
                |          |          |          |          
                |          |          |          |--51.67%-- wait_on_page_bit
                |          |          |          |          |          
                |          |          |          |          |--95.16%-- filemap_fdatawait_range
                |          |          |          |          |          filemap_write_and_wait_range
                |          |          |          |          |          xfs_file_fsync
                |          |          |          |          |          vfs_fsync_range
                |          |          |          |          |          do_fsync
                |          |          |          |          |          sys_fdatasync
                |          |          |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          |          0x7f4ade4212ad
                |          |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |          |          |          0x60b0000c3ec0
                |          |          |          |          |          
                |          |          |          |           --4.84%-- __migration_entry_wait
                |          |          |          |                     migration_entry_wait
                |          |          |          |                     handle_mm_fault
                |          |          |          |                     __do_page_fault
                |          |          |          |                     do_page_fault
                |          |          |          |                     page_fault
                |          |          |          |                     std::_Function_handler<void (), httpd::http_server::_date_format_timer::{lambda()#1}>::_M_invoke
                |          |          |          |                     |          
                |          |          |          |                      --100.00%-- service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}>
                |          |          |          |                                service::storage_proxy::mutate
                |          |          |          |                                service::storage_proxy::mutate_with_triggers
                |          |          |          |                                cql3::statements::modification_statement::execute_without_condition
                |          |          |          |                                cql3::statements::modification_statement::execute
                |          |          |          |                                cql3::query_processor::process_statement
                |          |          |          |                                transport::cql_server::connection::process_execute
                |          |          |          |                                transport::cql_server::connection::process_request_one
                |          |          |          |                                futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
                |          |          |          |                                futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
                |          |          |          |                                futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
                |          |          |          |                                transport::cql_server::connection::process_request
                |          |          |          |                                do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
                |          |          |          |                                do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |          |          |                                do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
                |          |          |          |                                0x6140000c3000
                |          |          |          |          
                |          |          |           --48.33%-- out_of_line_wait_on_bit
                |          |          |                     block_truncate_page
                |          |          |                     xfs_setattr_size
                |          |          |                     xfs_vn_setattr
                |          |          |                     notify_change
                |          |          |                     do_truncate
                |          |          |                     do_sys_ftruncate.constprop.15
                |          |          |                     sys_ftruncate
                |          |          |                     entry_SYSCALL_64_fastpath
                |          |          |                     __GI___ftruncate64
                |          |          |                     syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process
                |          |          |                     |          
                |          |          |                     |--13.79%-- 0x7f4ad29ff700
                |          |          |                     |          
                |          |          |                     |--13.79%-- 0x7f4acdbff700
                |          |          |                     |          
                |          |          |                     |--12.07%-- 0x7f4ad05ff700
                |          |          |                     |          
                |          |          |                     |--12.07%-- 0x7f4acedff700
                |          |          |                     |          
                |          |          |                     |--10.34%-- 0x7f4ad0bff700
                |          |          |                     |          
                |          |          |                     |--6.90%-- 0x7f4ad2fff700
                |          |          |                     |          
                |          |          |                     |--6.90%-- 0x7f4ad11ff700
                |          |          |                     |          
                |          |          |                     |--6.90%-- 0x7f4acf9ff700
                |          |          |                     |          
                |          |          |                     |--6.90%-- 0x7f4acf3ff700
                |          |          |                     |          
                |          |          |                     |--6.90%-- 0x7f4ace7ff700
                |          |          |                     |          
                |          |          |                     |--1.72%-- 0x7f4ad17ff700
                |          |          |                     |          
                |          |          |                      --1.72%-- 0x7f4aca5ff700
                |          |          |          
                |          |           --28.69%-- __down
                |          |                     down
                |          |                     xfs_buf_lock
                |          |                     _xfs_buf_find
                |          |                     xfs_buf_get_map
                |          |                     |          
                |          |                     |--97.14%-- xfs_buf_read_map
                |          |                     |          xfs_trans_read_buf_map
                |          |                     |          |          
                |          |                     |          |--98.04%-- xfs_read_agf
                |          |                     |          |          xfs_alloc_read_agf
                |          |                     |          |          xfs_alloc_fix_freelist
                |          |                     |          |          |          
                |          |                     |          |          |--93.00%-- xfs_free_extent
                |          |                     |          |          |          xfs_bmap_finish
                |          |                     |          |          |          xfs_itruncate_extents
                |          |                     |          |          |          |          
                |          |                     |          |          |          |--87.10%-- xfs_inactive_truncate
                |          |                     |          |          |          |          xfs_inactive
                |          |                     |          |          |          |          xfs_fs_evict_inode
                |          |                     |          |          |          |          evict
                |          |                     |          |          |          |          iput
                |          |                     |          |          |          |          __dentry_kill
                |          |                     |          |          |          |          dput
                |          |                     |          |          |          |          __fput
                |          |                     |          |          |          |          ____fput
                |          |                     |          |          |          |          task_work_run
                |          |                     |          |          |          |          do_notify_resume
                |          |                     |          |          |          |          int_signal
                |          |                     |          |          |          |          __libc_close
                |          |                     |          |          |          |          std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access
                |          |                     |          |          |          |          
                |          |                     |          |          |           --12.90%-- xfs_setattr_size
                |          |                     |          |          |                     xfs_vn_setattr
                |          |                     |          |          |                     notify_change
                |          |                     |          |          |                     do_truncate
                |          |                     |          |          |                     do_sys_ftruncate.constprop.15
                |          |                     |          |          |                     sys_ftruncate
                |          |                     |          |          |                     entry_SYSCALL_64_fastpath
                |          |                     |          |          |                     |          
                |          |                     |          |          |                      --100.00%-- __GI___ftruncate64
                |          |                     |          |          |                                syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--20.00%-- 0x7f4ad0bff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--20.00%-- 0x7f4acedff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--10.00%-- 0x7f4ad2fff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--10.00%-- 0x7f4ad17ff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--10.00%-- 0x7f4ad11ff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--10.00%-- 0x7f4ad05ff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                |--10.00%-- 0x7f4acf3ff700
                |          |                     |          |          |                                |          
                |          |                     |          |          |                                 --10.00%-- 0x7f4acdbff700
                |          |                     |          |          |          
                |          |                     |          |           --7.00%-- xfs_alloc_vextent
                |          |                     |          |                     xfs_bmap_btalloc
                |          |                     |          |                     xfs_bmap_alloc
                |          |                     |          |                     xfs_bmapi_write
                |          |                     |          |                     xfs_iomap_write_direct
                |          |                     |          |                     __xfs_get_blocks
                |          |                     |          |                     xfs_get_blocks_direct
                |          |                     |          |                     do_blockdev_direct_IO
                |          |                     |          |                     __blockdev_direct_IO
                |          |                     |          |                     xfs_vm_direct_IO
                |          |                     |          |                     xfs_file_dio_aio_write
                |          |                     |          |                     xfs_file_write_iter
                |          |                     |          |                     aio_run_iocb
                |          |                     |          |                     do_io_submit
                |          |                     |          |                     sys_io_submit
                |          |                     |          |                     entry_SYSCALL_64_fastpath
                |          |                     |          |                     io_submit
                |          |                     |          |                     0x46d98a
                |          |                     |          |          
                |          |                     |           --1.96%-- xfs_read_agi
                |          |                     |                     xfs_iunlink_remove
                |          |                     |                     xfs_ifree
                |          |                     |                     xfs_inactive_ifree
                |          |                     |                     xfs_inactive
                |          |                     |                     xfs_fs_evict_inode
                |          |                     |                     evict
                |          |                     |                     iput
                |          |                     |                     __dentry_kill
                |          |                     |                     dput
                |          |                     |                     __fput
                |          |                     |                     ____fput
                |          |                     |                     task_work_run
                |          |                     |                     do_notify_resume
                |          |                     |                     int_signal
                |          |                     |                     __libc_close
                |          |                     |                     std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access
                |          |                     |          
                |          |                      --2.86%-- xfs_trans_get_buf_map
                |          |                                xfs_btree_get_bufl
                |          |                                xfs_bmap_extents_to_btree
                |          |                                xfs_bmap_add_extent_hole_real
                |          |                                xfs_bmapi_write
                |          |                                xfs_iomap_write_direct
                |          |                                __xfs_get_blocks
                |          |                                xfs_get_blocks_direct
                |          |                                do_blockdev_direct_IO
                |          |                                __blockdev_direct_IO
                |          |                                xfs_vm_direct_IO
                |          |                                xfs_file_dio_aio_write
                |          |                                xfs_file_write_iter
                |          |                                aio_run_iocb
                |          |                                do_io_submit
                |          |                                sys_io_submit
                |          |                                entry_SYSCALL_64_fastpath
                |          |                                io_submit
                |          |                                0x46d98a
                |          |          
                |          |--13.48%-- eventfd_ctx_read
                |          |          eventfd_read
                |          |          __vfs_read
                |          |          vfs_read
                |          |          sys_read
                |          |          entry_SYSCALL_64_fastpath
                |          |          0x7f4ade6f754d
                |          |          smp_message_queue::respond
                |          |          0xffffffffffffffff
                |          |          
                |          |--7.83%-- md_flush_request
                |          |          raid0_make_request
                |          |          md_make_request
                |          |          generic_make_request
                |          |          submit_bio
                |          |          |          
                |          |          |--92.54%-- submit_bio_wait
                |          |          |          blkdev_issue_flush
                |          |          |          xfs_blkdev_issue_flush
                |          |          |          xfs_file_fsync
                |          |          |          vfs_fsync_range
                |          |          |          do_fsync
                |          |          |          sys_fdatasync
                |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          
                |          |          |           --100.00%-- 0x7f4ade4212ad
                |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |                     0x6010000c3ec0
                |          |          |          
                |          |           --7.46%-- _xfs_buf_ioapply
                |          |                     xfs_buf_submit
                |          |                     xlog_bdstrat
                |          |                     xlog_sync
                |          |                     xlog_state_release_iclog
                |          |                     |          
                |          |                     |--73.33%-- _xfs_log_force_lsn
                |          |                     |          xfs_file_fsync
                |          |                     |          vfs_fsync_range
                |          |                     |          do_fsync
                |          |                     |          sys_fdatasync
                |          |                     |          entry_SYSCALL_64_fastpath
                |          |                     |          |          
                |          |                     |           --100.00%-- 0x7f4ade4212ad
                |          |                     |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |                     |                     0x6080000c3ec0
                |          |                     |          
                |          |                      --26.67%-- _xfs_log_force
                |          |                                xfs_log_force
                |          |                                xfs_buf_lock
                |          |                                _xfs_buf_find
                |          |                                xfs_buf_get_map
                |          |                                xfs_trans_get_buf_map
                |          |                                xfs_btree_get_bufl
                |          |                                xfs_bmap_extents_to_btree
                |          |                                xfs_bmap_add_extent_hole_real
                |          |                                xfs_bmapi_write
                |          |                                xfs_iomap_write_direct
                |          |                                __xfs_get_blocks
                |          |                                xfs_get_blocks_direct
                |          |                                do_blockdev_direct_IO
                |          |                                __blockdev_direct_IO
                |          |                                xfs_vm_direct_IO
                |          |                                xfs_file_dio_aio_write
                |          |                                xfs_file_write_iter
                |          |                                aio_run_iocb
                |          |                                do_io_submit
                |          |                                sys_io_submit
                |          |                                entry_SYSCALL_64_fastpath
                |          |                                io_submit
                |          |                                0x46d98a
                |          |          
                |          |--5.53%-- _xfs_log_force_lsn
                |          |          |          
                |          |          |--80.28%-- xfs_file_fsync
                |          |          |          vfs_fsync_range
                |          |          |          do_fsync
                |          |          |          sys_fdatasync
                |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          
                |          |          |           --100.00%-- 0x7f4ade4212ad
                |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |                     |          
                |          |          |                     |--97.92%-- 0x60d0000c3ec0
                |          |          |                     |          
                |          |          |                     |--1.04%-- 0x6020000c3ec0
                |          |          |                     |          
                |          |          |                      --1.04%-- 0x600000557ec0
                |          |          |          
                |          |           --19.72%-- xfs_dir_fsync
                |          |                     vfs_fsync_range
                |          |                     do_fsync
                |          |                     sys_fdatasync
                |          |                     entry_SYSCALL_64_fastpath
                |          |                     |          
                |          |                      --100.00%-- 0x7f4ade4212ad
                |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |                                0x6040000c3ec0
                |          |          
                |          |--1.25%-- rwsem_down_read_failed
                |          |          call_rwsem_down_read_failed
                |          |          |          
                |          |          |--90.62%-- xfs_ilock
                |          |          |          |          
                |          |          |          |--86.21%-- xfs_ilock_data_map_shared
                |          |          |          |          __xfs_get_blocks
                |          |          |          |          xfs_get_blocks_direct
                |          |          |          |          do_blockdev_direct_IO
                |          |          |          |          __blockdev_direct_IO
                |          |          |          |          xfs_vm_direct_IO
                |          |          |          |          xfs_file_dio_aio_write
                |          |          |          |          xfs_file_write_iter
                |          |          |          |          aio_run_iocb
                |          |          |          |          do_io_submit
                |          |          |          |          sys_io_submit
                |          |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          |          
                |          |          |          |           --100.00%-- io_submit
                |          |          |          |                     0x46d98a
                |          |          |          |          
                |          |          |          |--6.90%-- xfs_file_fsync
                |          |          |          |          vfs_fsync_range
                |          |          |          |          do_fsync
                |          |          |          |          sys_fdatasync
                |          |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          0x7f4ade4212ad
                |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |          |          0x6090000c3ec0
                |          |          |          |          
                |          |          |           --6.90%-- xfs_dir_fsync
                |          |          |                     vfs_fsync_range
                |          |          |                     do_fsync
                |          |          |                     sys_fdatasync
                |          |          |                     entry_SYSCALL_64_fastpath
                |          |          |                     0x7f4ade4212ad
                |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |                     0x6070000c3ec0
                |          |          |          
                |          |           --9.38%-- xfs_log_commit_cil
                |          |                     __xfs_trans_commit
                |          |                     xfs_trans_commit
                |          |                     |          
                |          |                     |--33.33%-- xfs_setattr_size
                |          |                     |          xfs_vn_setattr
                |          |                     |          notify_change
                |          |                     |          do_truncate
                |          |                     |          do_sys_ftruncate.constprop.15
                |          |                     |          sys_ftruncate
                |          |                     |          entry_SYSCALL_64_fastpath
                |          |                     |          __GI___ftruncate64
                |          |                     |          syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process
                |          |                     |          0x7f4acedff700
                |          |                     |          
                |          |                     |--33.33%-- xfs_vn_update_time
                |          |                     |          file_update_time
                |          |                     |          xfs_file_aio_write_checks
                |          |                     |          xfs_file_dio_aio_write
                |          |                     |          xfs_file_write_iter
                |          |                     |          aio_run_iocb
                |          |                     |          do_io_submit
                |          |                     |          sys_io_submit
                |          |                     |          entry_SYSCALL_64_fastpath
                |          |                     |          io_submit
                |          |                     |          0x46d98a
                |          |                     |          
                |          |                      --33.33%-- xfs_bmap_add_attrfork
                |          |                                xfs_attr_set
                |          |                                xfs_initxattrs
                |          |                                security_inode_init_security
                |          |                                xfs_init_security
                |          |                                xfs_generic_create
                |          |                                xfs_vn_mknod
                |          |                                xfs_vn_create
                |          |                                vfs_create
                |          |                                path_openat
                |          |                                do_filp_open
                |          |                                do_sys_open
                |          |                                sys_open
                |          |                                entry_SYSCALL_64_fastpath
                |          |                                0x7f4ade6f7cdd
                |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, reactor::open_file_dma(basic_sstring<char, unsigned int, 15u>, open_flags, file_open_options)::{lambda()#1}>::process
                |          |                                0xffffffffffffffff
                |          |          
                |          |--0.97%-- rwsem_down_write_failed
                |          |          call_rwsem_down_write_failed
                |          |          xfs_ilock
                |          |          xfs_vn_update_time
                |          |          file_update_time
                |          |          xfs_file_aio_write_checks
                |          |          xfs_file_dio_aio_write
                |          |          xfs_file_write_iter
                |          |          aio_run_iocb
                |          |          do_io_submit
                |          |          sys_io_submit
                |          |          entry_SYSCALL_64_fastpath
                |          |          io_submit
                |          |          0x46d98a
                |          |          
                |          |--0.51%-- xlog_cil_force_lsn
                |          |          |          
                |          |          |--92.31%-- _xfs_log_force_lsn
                |          |          |          |          
                |          |          |          |--91.67%-- xfs_file_fsync
                |          |          |          |          vfs_fsync_range
                |          |          |          |          do_fsync
                |          |          |          |          sys_fdatasync
                |          |          |          |          entry_SYSCALL_64_fastpath
                |          |          |          |          0x7f4ade4212ad
                |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |          |          0x60b0000c3ec0
                |          |          |          |          
                |          |          |           --8.33%-- xfs_dir_fsync
                |          |          |                     vfs_fsync_range
                |          |          |                     do_fsync
                |          |          |                     sys_fdatasync
                |          |          |                     entry_SYSCALL_64_fastpath
                |          |          |                     0x7f4ade4212ad
                |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                |          |          |                     0x60d0000c3ec0
                |          |          |          
                |          |           --7.69%-- _xfs_log_force
                |          |                     xfs_log_force
                |          |                     xfs_buf_lock
                |          |                     _xfs_buf_find
                |          |                     xfs_buf_get_map
                |          |                     xfs_trans_get_buf_map
                |          |                     xfs_btree_get_bufl
                |          |                     xfs_bmap_extents_to_btree
                |          |                     xfs_bmap_add_extent_hole_real
                |          |                     xfs_bmapi_write
                |          |                     xfs_iomap_write_direct
                |          |                     __xfs_get_blocks
                |          |                     xfs_get_blocks_direct
                |          |                     do_blockdev_direct_IO
                |          |                     __blockdev_direct_IO
                |          |                     xfs_vm_direct_IO
                |          |                     xfs_file_dio_aio_write
                |          |                     xfs_file_write_iter
                |          |                     aio_run_iocb
                |          |                     do_io_submit
                |          |                     sys_io_submit
                |          |                     entry_SYSCALL_64_fastpath
                |          |                     io_submit
                |          |                     0x46d98a
                |           --0.04%-- [...]
                |          
                 --3.82%-- preempt_schedule_common
                           |          
                           |--99.02%-- _cond_resched
                           |          |          
                           |          |--41.58%-- wait_for_completion
                           |          |          |          
                           |          |          |--66.67%-- flush_work
                           |          |          |          xlog_cil_force_lsn
                           |          |          |          |          
                           |          |          |          |--96.43%-- _xfs_log_force_lsn
                           |          |          |          |          |          
                           |          |          |          |          |--77.78%-- xfs_file_fsync
                           |          |          |          |          |          vfs_fsync_range
                           |          |          |          |          |          do_fsync
                           |          |          |          |          |          sys_fdatasync
                           |          |          |          |          |          entry_SYSCALL_64_fastpath
                           |          |          |          |          |          0x7f4ade4212ad
                           |          |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                           |          |          |          |          |          0x6030000c3ec0
                           |          |          |          |          |          
                           |          |          |          |           --22.22%-- xfs_dir_fsync
                           |          |          |          |                     vfs_fsync_range
                           |          |          |          |                     do_fsync
                           |          |          |          |                     sys_fdatasync
                           |          |          |          |                     entry_SYSCALL_64_fastpath
                           |          |          |          |                     |          
                           |          |          |          |                      --100.00%-- 0x7f4ade4212ad
                           |          |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                           |          |          |          |                                0x6030000c3ec0
                           |          |          |          |          
                           |          |          |           --3.57%-- _xfs_log_force
                           |          |          |                     xfs_log_force
                           |          |          |                     xfs_buf_lock
                           |          |          |                     _xfs_buf_find
                           |          |          |                     xfs_buf_get_map
                           |          |          |                     xfs_trans_get_buf_map
                           |          |          |                     xfs_btree_get_bufl
                           |          |          |                     xfs_bmap_extents_to_btree
                           |          |          |                     xfs_bmap_add_extent_hole_real
                           |          |          |                     xfs_bmapi_write
                           |          |          |                     xfs_iomap_write_direct
                           |          |          |                     __xfs_get_blocks
                           |          |          |                     xfs_get_blocks_direct
                           |          |          |                     do_blockdev_direct_IO
                           |          |          |                     __blockdev_direct_IO
                           |          |          |                     xfs_vm_direct_IO
                           |          |          |                     xfs_file_dio_aio_write
                           |          |          |                     xfs_file_write_iter
                           |          |          |                     aio_run_iocb
                           |          |          |                     do_io_submit
                           |          |          |                     sys_io_submit
                           |          |          |                     entry_SYSCALL_64_fastpath
                           |          |          |                     io_submit
                           |          |          |                     0x46d98a
                           |          |          |          
                           |          |           --33.33%-- submit_bio_wait
                           |          |                     blkdev_issue_flush
                           |          |                     xfs_blkdev_issue_flush
                           |          |                     xfs_file_fsync
                           |          |                     vfs_fsync_range
                           |          |                     do_fsync
                           |          |                     sys_fdatasync
                           |          |                     entry_SYSCALL_64_fastpath
                           |          |                     |          
                           |          |                      --100.00%-- 0x7f4ade4212ad
                           |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                           |          |                                0x6030000c3ec0
                           |          |          
                           |          |--33.66%-- flush_work
                           |          |          xlog_cil_force_lsn
                           |          |          |          
                           |          |          |--97.06%-- _xfs_log_force_lsn
                           |          |          |          |          
                           |          |          |          |--78.79%-- xfs_file_fsync
                           |          |          |          |          vfs_fsync_range
                           |          |          |          |          do_fsync
                           |          |          |          |          sys_fdatasync
                           |          |          |          |          entry_SYSCALL_64_fastpath
                           |          |          |          |          0x7f4ade4212ad
                           |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                           |          |          |          |          0x6030000c3ec0
                           |          |          |          |          
                           |          |          |           --21.21%-- xfs_dir_fsync
                           |          |          |                     vfs_fsync_range
                           |          |          |                     do_fsync
                           |          |          |                     sys_fdatasync
                           |          |          |                     entry_SYSCALL_64_fastpath
                           |          |          |                     |          
                           |          |          |                      --100.00%-- 0x7f4ade4212ad
                           |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
                           |          |          |                                0x6030000c3ec0
                           |          |          |          
                           |          |           --2.94%-- _xfs_log_force
                           |          |                     xfs_log_force
                           |          |                     xfs_buf_lock
                           |          |                     _xfs_buf_find
                           |          |                     xfs_buf_get_map
                           |          |                     xfs_trans_get_buf_map
                           |          |                     xfs_btree_get_bufl
                           |          |                     xfs_bmap_extents_to_btree
                           |          |                     xfs_bmap_add_extent_hole_real
                           |          |                     xfs_bmapi_write
                           |          |                     xfs_iomap_write_direct
                           |          |                     __xfs_get_blocks
                           |          |                     xfs_get_blocks_direct
                           |          |                     do_blockdev_direct_IO
                           |          |                     __blockdev_direct_IO
                           |          |                     xfs_vm_direct_IO
                           |          |                     xfs_file_dio_aio_write
                           |          |                     xfs_file_write_iter
                           |          |                     aio_run_iocb
                           |          |                     do_io_submit
                           |          |                     sys_io_submit
                           |          |                     entry_SYSCALL_64_fastpath
                           |          |                     io_submit
                           |          |                     0x46d98a
                           |          |          
                           |          |--13.86%-- lock_sock_nested
                           |          |          |          
                           |          |          |--78.57%-- tcp_sendmsg
                           |          |          |          inet_sendmsg
                           |          |          |          sock_sendmsg
                           |          |          |          SYSC_sendto
                           |          |          |          sys_sendto
                           |          |          |          entry_SYSCALL_64_fastpath
                           |          |          |          __libc_send
                           |          |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
                           |          |          |          |          
                           |          |          |          |--36.36%-- 0x7f4ad6bf8de0
                           |          |          |          |          
                           |          |          |          |--9.09%-- 0x4
                           |          |          |          |          
                           |          |          |          |--9.09%-- 0x7f4adadf8de0
                           |          |          |          |          
                           |          |          |          |--9.09%-- 0x7f4ada1f8de0
                           |          |          |          |          
                           |          |          |          |--9.09%-- 0x7f4ad89f8de0
                           |          |          |          |          
                           |          |          |          |--9.09%-- 0x7f4ad83f8de0
                           |          |          |          |          
                           |          |          |          |--9.09%-- 0x7f4ad4df8de0
                           |          |          |          |          
                           |          |          |           --9.09%-- 0x7f4ad35f8de0
                           |          |          |          
                           |          |           --21.43%-- tcp_recvmsg
                           |          |                     inet_recvmsg
                           |          |                     sock_recvmsg
                           |          |                     sock_read_iter
                           |          |                     __vfs_read
                           |          |                     vfs_read
                           |          |                     sys_read
                           |          |                     entry_SYSCALL_64_fastpath
                           |          |                     0x7f4ade6f754d
                           |          |                     reactor::read_some
                           |          |                     |          
                           |          |                     |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv
                           |          |                     |          reactor::del_timer
                           |          |                     |          0x6160000e2040
                           |          |                     |          
                           |          |                      --33.33%-- continuation<future<> future<>::then_wrapped<future<> future<>::finally<auto seastar::with_gate<transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}>(seastar::gate&, transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>(seastar::gate&)::{lambda(future<>)#1}::operator()(future<>)::{lambda(seastar::gate)#1}, future<> >(seastar::gate&)::{lambda(seastar::gate&)#1}>::run
                           |          |                                reactor::del_timer
                           |          |                                0x6030000e2040
                           |          |          
                           |          |--3.96%-- generic_make_request_checks
                           |          |          generic_make_request
                           |          |          submit_bio
                           |          |          do_blockdev_direct_IO
                           |          |          __blockdev_direct_IO
                           |          |          xfs_vm_direct_IO
                           |          |          xfs_file_dio_aio_write
                           |          |          xfs_file_write_iter
                           |          |          aio_run_iocb
                           |          |          do_io_submit
                           |          |          sys_io_submit
                           |          |          entry_SYSCALL_64_fastpath
                           |          |          io_submit
                           |          |          0x46d98a
                           |          |          
                           |          |--3.96%-- kmem_cache_alloc_node
                           |          |          __alloc_skb
                           |          |          sk_stream_alloc_skb
                           |          |          tcp_sendmsg
                           |          |          inet_sendmsg
                           |          |          sock_sendmsg
                           |          |          SYSC_sendto
                           |          |          sys_sendto
                           |          |          entry_SYSCALL_64_fastpath
                           |          |          __libc_send
                           |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
                           |          |          |          
                           |          |          |--25.00%-- 0x7f4ad9bf8de0
                           |          |          |          
                           |          |          |--25.00%-- 0x7f4ad7df8de0
                           |          |          |          
                           |          |          |--25.00%-- 0x7f4ad77f8de0
                           |          |          |          
                           |          |           --25.00%-- 0x7f4ad59f8de0
                           |          |          
                           |          |--0.99%-- unmap_underlying_metadata
                           |          |          do_blockdev_direct_IO
                           |          |          __blockdev_direct_IO
                           |          |          xfs_vm_direct_IO
                           |          |          xfs_file_dio_aio_write
                           |          |          xfs_file_write_iter
                           |          |          aio_run_iocb
                           |          |          do_io_submit
                           |          |          sys_io_submit
                           |          |          entry_SYSCALL_64_fastpath
                           |          |          io_submit
                           |          |          0x46d98a
                           |          |          
                           |          |--0.99%-- __kmalloc_node_track_caller
                           |          |          __kmalloc_reserve.isra.32
                           |          |          __alloc_skb
                           |          |          sk_stream_alloc_skb
                           |          |          tcp_sendmsg
                           |          |          inet_sendmsg
                           |          |          sock_sendmsg
                           |          |          SYSC_sendto
                           |          |          sys_sendto
                           |          |          entry_SYSCALL_64_fastpath
                           |          |          __libc_send
                           |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
                           |          |          0x7f4ad6bf8de0
                           |          |          
                           |           --0.99%-- task_work_run
                           |                     do_notify_resume
                           |                     int_signal
                           |                     __libc_close
                           |                     std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access
                           |          
                            --0.98%-- __cond_resched_softirq
                                      release_sock
                                      tcp_sendmsg
                                      inet_sendmsg
                                      sock_sendmsg
                                      SYSC_sendto
                                      sys_sendto
                                      entry_SYSCALL_64_fastpath
                                      __libc_send
                                      _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
                                      0x7f4ada1f8de0



#
# (For a higher level overview, try: perf report --sort comm,dso)
#

[-- Attachment #3: example-stale.txt --]
[-- Type: text/plain, Size: 1700 bytes --]

[164814.835933] CPU: 22 PID: 48042 Comm: scylla Tainted: G            E   4.2.6-200.fc22.x86_64 #1
[164814.835936] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
[164814.835937]  0000000000000000 00000000a8713b7a ffff8802fb977ab8 ffffffff817729ea
[164814.835941]  0000000000000000 ffff88076a69f780 ffff8802fb977ad8 ffffffffa03217a6
[164814.835946]  ffff88077119bcb0 0000000000000000 ffff8802fb977b08 ffffffffa034e749
[164814.835951] Call Trace:
[164814.835954]  [<ffffffff817729ea>] dump_stack+0x45/0x57
[164814.835971]  [<ffffffffa03217a6>] xfs_buf_stale+0x26/0x80 [xfs]
[164814.835989]  [<ffffffffa034e749>] xfs_trans_binval+0x79/0x100 [xfs]
[164814.836001]  [<ffffffffa02f479b>] xfs_bmap_btree_to_extents+0x12b/0x1a0 [xfs]
[164814.836012]  [<ffffffffa02f8977>] xfs_bunmapi+0x967/0x9f0 [xfs]
[164814.836027]  [<ffffffffa0334b9e>] xfs_itruncate_extents+0x10e/0x220 [xfs]
[164814.836044]  [<ffffffffa033f75a>] ? kmem_zone_alloc+0x5a/0xe0 [xfs]
[164814.836084]  [<ffffffffa0334d49>] xfs_inactive_truncate+0x99/0x110 [xfs]
[164814.836120]  [<ffffffffa0335aa2>] xfs_inactive+0x102/0x120 [xfs]
[164814.836135]  [<ffffffffa033a6cf>] xfs_fs_evict_inode+0x6f/0xa0 [xfs]
[164814.836138]  [<ffffffff81238d76>] evict+0xa6/0x170
[164814.836140]  [<ffffffff81239026>] iput+0x196/0x220
[164814.836147]  [<ffffffff81234fe4>] __dentry_kill+0x174/0x1c0
[164814.836150]  [<ffffffff8123514b>] dput+0x11b/0x200
[164814.836155]  [<ffffffff8121fe02>] __fput+0x172/0x1e0
[164814.836158]  [<ffffffff8121febe>] ____fput+0xe/0x10
[164814.836161]  [<ffffffff810bab75>] task_work_run+0x85/0xb0
[164814.836164]  [<ffffffff81014a4d>] do_notify_resume+0x8d/0x90
[164814.836167]  [<ffffffff817795bc>] int_signal+0x12/0x17

[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-28  2:43 sleeps and waits during io_submit Glauber Costa
@ 2015-11-30 14:10 ` Brian Foster
  2015-11-30 14:29   ` Avi Kivity
  2015-11-30 15:49   ` Glauber Costa
  2015-11-30 23:10 ` Dave Chinner
  1 sibling, 2 replies; 58+ messages in thread
From: Brian Foster @ 2015-11-30 14:10 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, xfs

On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote:
> Hello my dear XFSers,
> 
> For those of you who don't know, we at ScyllaDB produce a modern NoSQL
> data store that, at the moment, runs on top of XFS only. We deal
> exclusively with asynchronous and direct IO, due to our
> thread-per-core architecture. Due to that, we avoid issuing any
> operation that will sleep.
> 
> While debugging an extreme case of bad performance (most likely
> related to a not-so-great disk), I have found a variety of cases in
> which XFS blocks. To find those, I have used perf record -e
> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
> as xfs-sched_switch.log. Please note that this doesn't tell me for how
> long we block, but as mentioned before, blocking operations outside
> our control are detrimental to us regardless of the elapsed time.
> 
> For those who are not acquainted to our internals, please ignore
> everything in that file but the xfs functions. For the xfs symbols,
> there are two kinds of events: the ones that are a children of
> io_submit, where we don't tolerate blocking, and the ones that are
> children of our helper IO thread, to where we push big operations that
> we know will block until we can get rid of them all. We care about the
> former and ignore the latter.
> 
> Please allow me to ask you a couple of questions about those findings.
> If we are doing anything wrong, advise on best practices is truly
> welcome.
> 
> 1) xfs_buf_lock -> xfs_log_force.
> 
> I've started wondering what would make xfs_log_force sleep. But then I
> have noticed that xfs_log_force will only be called when a buffer is
> marked stale. Most of the times a buffer is marked stale seems to be
> due to errors. Although that is not my case (more on that), it got me
> thinking that maybe the right thing to do would be to avoid hitting
> this case altogether?
> 

I'm not following where you get the "only if marked stale" part..? It
certainly looks like that's one potential purpose for the call, but this
is called in a variety of other places as well. E.g., forcing the log
via pushing on the ail when it has pinned items is another case. The ail
push itself can originate from transaction reservation, etc., when log
space is needed. In other words, I'm not sure this is something that's
easily controlled from userspace, if at all. Rather, it's a significant
part of the wider state machine the fs uses to manage logging.

> The file example-stale.txt contains a backtrace of the case where we
> are being marked as stale. It seems to be happening when we convert
> the the inode's extents from unwritten to real. Can this case be
> avoided? I won't pretend I know the intricacies of this, but couldn't
> we be keeping extents from the very beginning to avoid creating stale
> buffers?
> 

This is down in xfs_fs_evict_inode()->xfs_inactive(), which is generally
when an inode is evicted from cache. In this case, it looks like the
inode is unlinked (permanently removed), the extents are being removed
and a bmap btree block is being invalidated as part of that overall
process. I don't think this has anything to do with unwritten extents.

> 2) xfs_buf_lock -> down
> This is one I truly don't understand. What can be causing contention
> in this lock? We never have two different cores writing to the same
> buffer, nor should we have the same core doingCAP_FOWNER so.
> 

This is not one single lock. An XFS buffer is the data structure used to
modify/log/read-write metadata on-disk and each buffer has its own lock
to prevent corruption. Buffer lock contention is possible because the
filesystem has bits of "global" metadata that has to be updated via
buffers.

For example, usually one has multiple allocation groups to maximize
parallelism, but we still have per-ag metadata that has to be tracked
globally with respect to each AG (e.g., free space trees, inode
allocation trees, etc.). Any operation that affects this metadata (e.g.,
block/inode allocation) has to lock the agi/agf buffers along with any
buffers associated with the modified btree leaf/node blocks, etc.

One example in your attached perf traces has several threads looking to
acquire the AGF, which is a per-AG data structure for tracking free
space in the AG. One thread looks like the inode eviction case noted
above (freeing blocks), another looks like a file truncate (also freeing
blocks), and yet another is a block allocation due to a direct I/O
write. Were any of these operations directed to an inode in a separate
AG, they would be able to proceed in parallel (but I believe they would
still hit the same codepaths as far as perf can tell).

> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time
> 
> You guys seem to have an interface to avoid that, by setting the
> FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl,
> which will set this flag for all regular files. That's great, but that
> ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run
> our server as an unprivileged user. I don't understand, however, why
> such an strict check is needed. If we have full rights on the
> filesystem, why can't we issue this operation? In my view, CAP_FOWNER
> should already be enough.I do understand the handles have to be stable
> and a file can have its ownership changed, in which case the previous
> owner would keep the handle valid. Is that the reason you went with
> the most restrictive capability ?

I'm not familiar enough with the open-by-handle stuff to comment on the
permission constraints. Perhaps Dave or others can comment further on
this bit...

Brian

> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 2K of event 'sched:sched_switch'
> # Event count (approx.): 2669
> #
> # Overhead  Command  Shared Object      Symbol        
> # ........  .......  .................  ..............
> #
>    100.00%  scylla   [kernel.kallsyms]  [k] __schedule
>              |
>              ---__schedule
>                 |          
>                 |--96.18%-- schedule
>                 |          |          
>                 |          |--56.14%-- schedule_user
>                 |          |          |          
>                 |          |          |--53.30%-- int_careful
>                 |          |          |          |          
>                 |          |          |          |--45.05%-- 0x7f4ade6f74ed
>                 |          |          |          |          reactor_backend_epoll::make_reactor_notifier
>                 |          |          |          |          |          
>                 |          |          |          |          |--67.63%-- syscall_work_queue::submit_item
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--32.05%-- posix_file_impl::truncate
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--65.33%-- _ZN12continuationIZN6futureIJEE4thenIZN19file_data_sink_impl5flushEvEUlvE_S1_EET0_OT_EUlS7_E_JEE3runEv
>                 |          |          |          |          |          |          |          reactor::del_timer
>                 |          |          |          |          |          |          |          0x60b0000e2040
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--20.00%-- db::commitlog::segment::flush(unsigned long)::{lambda()#1}::operator()
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |          |--73.33%-- future<>::then<db::commitlog::segment::flush(unsigned long)::{lambda()#1}, future<lw_shared_ptr<db::commitlog::segment> > >
>                 |          |          |          |          |          |          |          |          _ZN12continuationIZN6futureIJ13lw_shared_ptrIN2db9commitlog7segmentEEEE4thenIZNS4_4syncEvEUlT_E_S6_EET0_OS8_EUlSB_E_JS5_EE3runEv
>                 |          |          |          |          |          |          |          |          reactor::del_timer
>                 |          |          |          |          |          |          |          |          0x60e0000e2040
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |           --26.67%-- _ZN12continuationIZN6futureIJEE4thenIZN2db9commitlog7segment5flushEmEUlvE_S0_IJ13lw_shared_ptrIS5_EEEEET0_OT_EUlSC_E_JEE3runEv
>                 |          |          |          |          |          |          |                     reactor::del_timer
>                 |          |          |          |          |          |          |                     0x6090000e2040
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--10.67%-- sstables::sstable::seal_sstable
>                 |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd
 a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
>                 |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --4.00%-- sstables::sstable::write_toc
>                 |          |          |          |          |          |                     sstables::sstable::prepare_write_components
>                 |          |          |          |          |          |                     |          
>                 |          |          |          |          |          |                     |--50.00%-- 0x4d3a4f6ec4e8cd75
>                 |          |          |          |          |          |                     |          
>                 |          |          |          |          |          |                      --50.00%-- 0x3ebf3dd80e3b174d
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--23.93%-- posix_file_impl::discard
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--82.14%-- _ZN12continuationIZN6futureIImEE4thenIZN19file_data_sink_impl6do_putEm16temporary_bufferIcEEUlmE_S0_IIEEEET0_OT_EUlSA_E_ImEE3runEv
>                 |          |          |          |          |          |          |          reactor::del_timer
>                 |          |          |          |          |          |          |          0x6080000e2040
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --17.86%-- futurize<future<lw_shared_ptr<db::commitlog::segment> > >::apply<db::commitlog::segment_manager::allocate_segment(bool)::{lambda(file)#1}, file>
>                 |          |          |          |          |          |                     _ZN12continuationIZN6futureIJ4fileEE4thenIZN2db9commitlog15segment_manager16allocate_segmentEbEUlS1_E_S0_IJ13lw_shared_ptrINS5_7segmentEEEEEET0_OT_EUlSE_E_JS1_EE3runEv
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--20.94%-- reactor::open_file_dma
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--20.41%-- db::commitlog::segment_manager::allocate_segment
>                 |          |          |          |          |          |          |          db::commitlog::segment_manager::on_timer()::{lambda()#1}::operator()
>                 |          |          |          |          |          |          |          0xb8c264
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--14.29%-- sstables::sstable::write_simple<(sstables::sstable::component_type)8, sstables::statistics>
>                 |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd
 a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
>                 |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--12.24%-- sstables::write_crc
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |          |--16.67%-- 0x313532343536002f
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |          |--16.67%-- 0x373633323533002f
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |          |--16.67%-- 0x363139333232002f
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |          |--16.67%-- 0x353933303330002f
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |          |--16.67%-- 0x383930383133002f
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |           --16.67%-- 0x323338303037002f
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--12.24%-- sstables::write_digest
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)7, sstables::filter>
>                 |          |          |          |          |          |          |          sstables::sstable::write_filter
>                 |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd
 a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
>                 |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)4, sstables::summary_ka>
>                 |          |          |          |          |          |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd
 a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
>                 |          |          |          |          |          |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--10.20%-- 0x78d93b
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--6.12%-- sstables::sstable::open_data
>                 |          |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |           --100.00%-- 0x8000000004000000
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --4.08%-- sstables::sstable::write_toc
>                 |          |          |          |          |          |                     sstables::sstable::prepare_write_components
>                 |          |          |          |          |          |                     |          
>                 |          |          |          |          |          |                      --100.00%-- 0x6100206690ef
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--18.38%-- syscall_work_queue::submit_item
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--10.00%-- 0x7f4ad89f8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--7.50%-- 0x7f4ad83f8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--7.50%-- 0x7f4ad6bf8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--7.50%-- 0x7f4ad65f8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--5.00%-- 0x60b015e8cd90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--5.00%-- 0x60100acaed90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--5.00%-- 0x607006f04d90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--5.00%-- 0xffffffffffffa5d0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60e01acbed90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60e01acbec60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60a018d7ad90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60a018d7ac60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60b015e8cc60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60900bb8ad60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60100acaec60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60800951dd90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60800951dc60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60d009089d90
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60d009089c60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x607006f04c60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x60f005984d60
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x7f4ad77f8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x7f4adb9f8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x7f4ad9bf8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x7f4ad7df8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.50%-- 0x7f4ad77f8fe0
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --2.50%-- 0x7f4ad5ff8fe0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.99%-- reactor::open_directory
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--57.14%-- sstables::sstable::filename
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --42.86%-- sstables::sstable::write_toc
>                 |          |          |          |          |          |                     sstables::sstable::prepare_write_components
>                 |          |          |          |          |          |                     |          
>                 |          |          |          |          |          |                     |--50.00%-- 0x4d3a4f6ec4e8cd75
>                 |          |          |          |          |          |                     |          
>                 |          |          |          |          |          |                      --50.00%-- 0x3ebf3dd80e3b174d
>                 |          |          |          |          |          |          
>                 |          |          |          |          |           --1.71%-- reactor::rename_file
>                 |          |          |          |          |                     sstables::sstable::seal_sstable
>                 |          |          |          |          |                     std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::ty
 pe ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
>                 |          |          |          |          |                     _GLOBAL__sub_I__ZN12app_templateC2Ev
>                 |          |          |          |          |          
>                 |          |          |          |           --32.37%-- _ZN12continuationIZN6futureIJEE4thenIZN18syscall_work_queue11submit_itemEPNS3_9work_itemEEUlvE_S1_EET0_OT_EUlS9_E_JEE3runEv
>                 |          |          |          |                     reactor::del_timer
>                 |          |          |          |                     0x60d0000e2040
>                 |          |          |          |          
>                 |          |          |          |--29.04%-- __vdso_clock_gettime
>                 |          |          |          |          
>                 |          |          |          |--19.66%-- 0x7f4ade42b193
>                 |          |          |          |          reactor_backend_epoll::complete_epoll_event
>                 |          |          |          |          |          
>                 |          |          |          |          |--41.61%-- smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--79.03%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--95.92%-- 0x6070000c3000
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |          |--2.04%-- 0x61d0000c1000
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --2.04%-- 0x61d0000c1000
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--3.23%-- 0x14dd51
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x162a54
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x161dca
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x159c8b
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x1598b5
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x14dd3e
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x14bad8
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x14a880
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x127105
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- 0x6070000e2040
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--1.61%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
>                 |          |          |          |          |          |          0x60d0000c3000
>                 |          |          |          |          |          |          
>                 |          |          |          |          |           --1.61%-- __vdso_clock_gettime
>                 |          |          |          |          |                     0x7f4ad77f9160
>                 |          |          |          |          |          
>                 |          |          |          |          |--30.20%-- __restore_rt
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--57.14%-- __vdso_clock_gettime
>                 |          |          |          |          |          |          0x1d
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--9.52%-- smp_message_queue::smp_message_queue
>                 |          |          |          |          |          |          0x6070000c3000
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--4.76%-- 0x600000357240
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--4.76%-- 0x60000031a640
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- posix_file_impl::list_directory
>                 |          |          |          |          |          |          0x609000044730
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x46efbf
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x600000442e40
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x600000376440
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x6000002bac40
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x600000295640
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x600000289e40
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x60000031a640
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--2.38%-- 0x7f4ade6f74ed
>                 |          |          |          |          |          |          __libc_siglongjmp
>                 |          |          |          |          |          |          0x60000047be40
>                 |          |          |          |          |          |          
>                 |          |          |          |          |           --2.38%-- 0x7f4adb3f7fd0
>                 |          |          |          |          |          
>                 |          |          |          |          |--14.09%-- 0x33
>                 |          |          |          |          |          
>                 |          |          |          |          |--12.08%-- promise<temporary_buffer<char> >::promise
>                 |          |          |          |          |          _ZN6futureIJ16temporary_bufferIcEEE4thenIZN12input_streamIcE12read_exactlyEmEUlT_E_S2_EET0_OS6_
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--44.44%-- input_stream<char>::read_exactly
>                 |          |          |          |          |          |          0x8
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--11.11%-- 0x7f4adb3f8ea0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--11.11%-- 0x7f4ad9bf8ea0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--11.11%-- 0x7f4ad89f8ea0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--11.11%-- 0x7f4ad83f8ea0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--5.56%-- 0x7f4ad77f8ea0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |           --5.56%-- 0x7f4ad7df8ea0
>                 |          |          |          |          |          
>                 |          |          |          |          |--1.34%-- 0x7f4ad6bf8d80
>                 |          |          |          |          |          
>                 |          |          |          |           --0.67%-- 0x7f4adadf8d80
>                 |          |          |          |          
>                 |          |          |          |--4.43%-- __libc_send
>                 |          |          |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
>                 |          |          |          |          |          
>                 |          |          |          |          |--14.71%-- 0x4
>                 |          |          |          |          |          
>                 |          |          |          |          |--11.76%-- 0x7f4ad89f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.82%-- 0x7f4adb3f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.82%-- 0x7f4ad9bf8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.82%-- 0x7f4ad77f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.82%-- 0x7f4ad6bf8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--5.88%-- 0x7f4ad83f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--5.88%-- 0x7f4ad7df8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--5.88%-- 0x7f4ad53f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--2.94%-- 0x7f4acc9f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--2.94%-- continuation<future<file>::wait()::{lambda(future_state<file>&&)#1}, file>::~continuation
>                 |          |          |          |          |          0x611003c8e9b8
>                 |          |          |          |          |          
>                 |          |          |          |          |--2.94%-- 0x7f4adb9f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--2.94%-- 0x7f4ad71f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--2.94%-- 0x7f4ad65f8de0
>                 |          |          |          |          |          
>                 |          |          |          |          |--2.94%-- 0x7f4ad59f8de0
>                 |          |          |          |          |          
>                 |          |          |          |           --2.94%-- 0x7f4ad35f8de0
>                 |          |          |          |          
>                 |          |          |          |--1.56%-- 0x7f4ade6f754d
>                 |          |          |          |          reactor::read_some
>                 |          |          |          |          |          
>                 |          |          |          |          |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv
>                 |          |          |          |          |          reactor::del_timer
>                 |          |          |          |          |          0x6070000e2040
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.33%-- _ZN12continuationIZN6futureIIEE4thenIZ5sleepINSt6chrono3_V212system_clockEmSt5ratioILl1ELl1000000EEES1_NS4_8durationIT0_T1_EEEUlvE_S1_EESA_OT_EUlSF_E_IEE3runEv
>                 |          |          |          |          |          reactor::del_timer
>                 |          |          |          |          |          0x6080000e2040
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.33%-- 0x600000483640
>                 |          |          |          |          |          
>                 |          |          |          |          |--8.33%-- 0x600000480440
>                 |          |          |          |          |          
>                 |          |          |          |           --8.33%-- 0x36
>                 |          |          |           --0.26%-- [...]
>                 |          |          |          
>                 |          |           --46.70%-- retint_careful
>                 |          |                     |          
>                 |          |                     |--6.24%-- posix_file_impl::list_directory
>                 |          |                     |          |          
>                 |          |                     |          |--80.00%-- 0x60f0000e2020
>                 |          |                     |          |          
>                 |          |                     |          |--5.00%-- 0x601000044730
>                 |          |                     |          |          
>                 |          |                     |          |--5.00%-- 0x60e000044720
>                 |          |                     |          |          
>                 |          |                     |          |--2.50%-- 0x60f000135500
>                 |          |                     |          |          
>                 |          |                     |          |--2.50%-- 0x6190000e2098
>                 |          |                     |          |          
>                 |          |                     |          |--2.50%-- 0x60d0000c3000
>                 |          |                     |          |          
>                 |          |                     |           --2.50%-- 0x1
>                 |          |                     |          
>                 |          |                     |--3.42%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
>                 |          |                     |          |          
>                 |          |                     |          |--95.65%-- boost::program_options::variables_map::get
>                 |          |                     |          |          
>                 |          |                     |           --4.35%-- 0x618000044680
>                 |          |                     |          
>                 |          |                     |--3.12%-- memory::small_pool::add_more_objects
>                 |          |                     |          |          
>                 |          |                     |          |--10.53%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::clear_and_release
>                 |          |                     |          |          mutation_partition::clustered_row
>                 |          |                     |          |          mutation::set_clustered_cell
>                 |          |                     |          |          cql3::constants::setter::execute
>                 |          |                     |          |          cql3::statements::update_statement::add_update_for_key
>                 |          |                     |          |          _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE
>                 |          |                     |          |          cql3::statements::modification_statement::get_mutations
>                 |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |          |          cql3::query_options::query_options
>                 |          |                     |          |          |          
>                 |          |                     |          |          |--50.00%-- 0x7f4ad77f80e0
>                 |          |                     |          |          |          
>                 |          |                     |          |           --50.00%-- 0x7f4ad6bf80e0
>                 |          |                     |          |          
>                 |          |                     |          |--10.53%-- memory::small_pool::add_more_objects
>                 |          |                     |          |          |          
>                 |          |                     |          |          |--50.00%-- 0x60e00015d000
>                 |          |                     |          |          |          
>                 |          |                     |          |           --50.00%-- 0x60b00af6c758
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60a018ee3867
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60d00d41f680
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x61400c6bb4d0
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60e007c918d6
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60e0078294ce
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x607006ee4da0
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- _ZN12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZN7service13storage_proxy22send_to_live_endpointsEmEUlNSt15__exception_ptr13exception_ptrEE0_EES1_OT_EUlSA_E_S1_EET0_SA_EUlSA_E_JEE3runEv
>                 |          |                     |          |          reactor::del_timer
>                 |          |                     |          |          0x6030000e2040
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- service::storage_proxy::mutate_locally
>                 |          |                     |          |          service::storage_proxy::send_to_live_endpoints
>                 |          |                     |          |          parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}>
>                 |          |                     |          |          0x601000136d00
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60a0001900e0
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60e00015d040
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x61300015d000
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60e00013bde0
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x60b00010f308
>                 |          |                     |          |          
>                 |          |                     |          |--5.26%-- 0x6010000e4808
>                 |          |                     |          |          
>                 |          |                     |           --5.26%-- 0x7f4ad65f7f50
>                 |          |                     |          
>                 |          |                     |--2.82%-- std::unique_ptr<reactor::pollfn, std::default_delete<std::unique_ptr> > reactor::make_pollfn<reactor::run()::{lambda()#3}>(reactor::run()::{lambda()#3}&&)::the_pollfn::poll_and_check_more_work
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
>                 |          |                     |          |          boost::program_options::variables_map::get
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- 0x1
>                 |          |                     |          |          
>                 |          |                     |          |--12.50%-- 0x53
>                 |          |                     |          |          
>                 |          |                     |          |--12.50%-- 0x3e
>                 |          |                     |          |          
>                 |          |                     |          |--12.50%-- 0x24
>                 |          |                     |          |          
>                 |          |                     |           --12.50%-- 0xb958000000000000
>                 |          |                     |          
>                 |          |                     |--2.67%-- std::_Function_handler<partition_presence_checker_result (partition_key const&), column_family::make_partition_presence_checker(lw_shared_ptr<std::map<long, lw_shared_ptr<sstables::sstable>, std::less<long>, std::allocator<std::pair<long const, lw_shared_ptr<sstables::sstable> > > > >)::{lambda(partition_key const&)#1}>::_M_invoke
>                 |          |                     |          |          
>                 |          |                     |          |--66.67%-- 0x1b5c280
>                 |          |                     |          |          
>                 |          |                     |          |--27.78%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::resize
>                 |          |                     |          |          row::apply
>                 |          |                     |          |          mutation_partition_applier::accept_row_cell
>                 |          |                     |          |          mutation_partition_view::accept
>                 |          |                     |          |          
>                 |          |                     |           --5.56%-- 0x2a4399
>                 |          |                     |          
>                 |          |                     |--2.08%-- smp_message_queue::smp_message_queue
>                 |          |                     |          |          
>                 |          |                     |          |--60.00%-- 0x60f0000c3000
>                 |          |                     |          |          
>                 |          |                     |          |--10.00%-- 0x6000002d7240
>                 |          |                     |          |          
>                 |          |                     |          |--10.00%-- 0x19
>                 |          |                     |          |          
>                 |          |                     |          |--10.00%-- 0xb
>                 |          |                     |          |          
>                 |          |                     |           --10.00%-- 0x7
>                 |          |                     |          
>                 |          |                     |--1.93%-- smp_message_queue::process_queue<4ul, smp_message_queue::process_completions()::{lambda(smp_message_queue::work_item*)#1}>
>                 |          |                     |          
>                 |          |                     |--1.63%-- __vdso_clock_gettime
>                 |          |                     |          |          
>                 |          |                     |           --100.00%-- __clock_gettime
>                 |          |                     |                     std::chrono::_V2::system_clock::now
>                 |          |                     |                     0xa63209
>                 |          |                     |          
>                 |          |                     |--1.49%-- memory::small_pool::deallocate
>                 |          |                     |          |          
>                 |          |                     |          |--40.00%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::emplace_back<atomic_cell_or_collection>
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase
>                 |          |                     |          |          service::storage_proxy::got_response
>                 |          |                     |          |          _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv
>                 |          |                     |          |          reactor::del_timer
>                 |          |                     |          |          0x6100000e2040
>                 |          |                     |          |          
>                 |          |                     |          |--10.00%-- cql3::statements::modification_statement::get_mutations
>                 |          |                     |          |          
>                 |          |                     |          |--10.00%-- cql3::statements::modification_statement::build_partition_keys
>                 |          |                     |          |          cql3::statements::modification_statement::create_exploded_clustering_prefix
>                 |          |                     |          |          0x60c014be0b00
>                 |          |                     |          |          
>                 |          |                     |          |--10.00%-- mutation_partition::~mutation_partition
>                 |          |                     |          |          std::vector<mutation, std::allocator<mutation> >::~vector
>                 |          |                     |          |          service::storage_proxy::mutate_with_triggers
>                 |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |          |          cql3::statements::modification_statement::execute
>                 |          |                     |          |          cql3::query_processor::process_statement
>                 |          |                     |          |          transport::cql_server::connection::process_execute
>                 |          |                     |          |          transport::cql_server::connection::process_request_one
>                 |          |                     |          |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |          |          0x8961de
>                 |          |                     |          |          
>                 |          |                     |           --10.00%-- object_deleter_impl<deleter>::~object_deleter_impl
>                 |          |                     |                     _ZN12continuationIZN6futureIJEE12then_wrappedIZZNS1_7finallyIZ7do_withI11foreign_ptrI10shared_ptrIN9transport10cql_server8responseEEEZZNS8_10connection14write_responseEOSB_ENUlvE_clEvEUlRT_E_EDaOSF_OT0_EUlvE_EES1_SI_ENUlS1_E_clES1_EUlSF_E_S1_EESJ_SI_EUlSI_E_JEED0Ev
>                 |          |                     |                     0x61a0000c3db0
>                 |          |                     |          
>                 |          |                     |--1.34%-- dht::decorated_key::equal
>                 |          |                     |          |          
>                 |          |                     |          |--83.33%-- 0x607000138f00
>                 |          |                     |          |          
>                 |          |                     |           --16.67%-- 0x60a0000e0f40
>                 |          |                     |          
>                 |          |                     |--1.34%-- service::storage_proxy::send_to_live_endpoints
>                 |          |                     |          
>                 |          |                     |--1.19%-- transport::cql_server::connection::process_execute
>                 |          |                     |          transport::cql_server::connection::process_request_one
>                 |          |                     |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |          |          
>                 |          |                     |          |--87.50%-- transport::cql_server::connection::process_request
>                 |          |                     |          |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          0x60e0000c3000
>                 |          |                     |          |          
>                 |          |                     |           --12.50%-- 0x8961de
>                 |          |                     |          
>                 |          |                     |--1.19%-- reactor::run
>                 |          |                     |          |          
>                 |          |                     |          |--87.50%-- smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
>                 |          |                     |          |          continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
>                 |          |                     |          |          0x600000043d00
>                 |          |                     |          |          
>                 |          |                     |           --12.50%-- app_template::run_deprecated
>                 |          |                     |                     main
>                 |          |                     |                     __libc_start_main
>                 |          |                     |                     _GLOBAL__sub_I__ZN3org6apache9cassandra21g_cassandra_constantsE
>                 |          |                     |                     0x7f4ae20c9fa0
>                 |          |                     |          
>                 |          |                     |--1.04%-- __clock_gettime
>                 |          |                     |          std::chrono::_V2::system_clock::now
>                 |          |                     |          |          
>                 |          |                     |          |--42.86%-- reactor::run
>                 |          |                     |          |          smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
>                 |          |                     |          |          continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
>                 |          |                     |          |          0x600000043d00
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0xa63209
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- continuation<future<> future<>::finally<auto do_with<std::vector<frozen_mutation, std::allocator<frozen_mutation> >, shared_ptr<service::storage_proxy>, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}>(std::vector<frozen_mutation, std::allocator<frozen_mutation> >&&, shared_ptr<service::storage_proxy>&&, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}&&)::
 {lambda()#1}>(service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::a
>                 |          |                     |          |          0x2b7434
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- _ZN8futurizeI6futureIJSt10unique_ptrIN4cql317update_parametersESt14default_deleteIS3_EEEEE5applyIZNS2_10statements22modification_statement22make_update_parametersERN7seastar7shardedIN7service13storage_proxyEEE13lw_shared_ptrISt6vectorI13partition_keySaISK_EEESI_I26exploded_clustering_prefixERKNS2_13query_optionsEblEUlT_E_JNSt12experimental15fundamentals_v18optionalINS3_13prefetch_dataEEEEEES7_OST_OSt5tupleIJDpT0_EE
>                 |          |                     |          |          cql3::statements::modification_statement::make_update_parameters
>                 |          |                     |          |          cql3::statements::modification_statement::get_mutations
>                 |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |          |          cql3::query_options::query_options
>                 |          |                     |          |          0x7f4ad6bf80e0
>                 |          |                     |          |          
>                 |          |                     |           --14.29%-- database::apply_in_memory
>                 |          |                     |                     database::do_apply
>                 |          |                     |                     _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv
>                 |          |                     |                     reactor::del_timer
>                 |          |                     |                     0x6090000e2040
>                 |          |                     |          
>                 |          |                     |--1.04%-- memory::small_pool::allocate
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0x5257c379469d9
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0x609002b9fe98
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0x13c8b90
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0x60f000190710
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0x25
>                 |          |                     |          |          
>                 |          |                     |          |--14.29%-- 0x7f4ad6bf84c0
>                 |          |                     |          |          
>                 |          |                     |           --14.29%-- 0x7f4ad53f81f0
>                 |          |                     |          
>                 |          |                     |--0.89%-- db::serializer<atomic_cell_view>::serializer
>                 |          |                     |          mutation_partition_serializer::write_without_framing
>                 |          |                     |          frozen_mutation::frozen_mutation
>                 |          |                     |          frozen_mutation::frozen_mutation
>                 |          |                     |          
>                 |          |                     |--0.89%-- do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          0x60f0000c3000
>                 |          |                     |          
>                 |          |                     |--0.89%-- futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |          transport::cql_server::connection::process_request
>                 |          |                     |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          
>                 |          |                     |          |--83.33%-- 0x6090000c3000
>                 |          |                     |          |          
>                 |          |                     |           --16.67%-- 0x600000044400
>                 |          |                     |          
>                 |          |                     |--0.89%-- std::_Function_handler<void (), reactor::run()::{lambda()#8}>::_M_invoke
>                 |          |                     |          |          
>                 |          |                     |          |--50.00%-- reactor::run
>                 |          |                     |          |          smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
>                 |          |                     |          |          continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
>                 |          |                     |          |          0x600000043d00
>                 |          |                     |          |          
>                 |          |                     |           --50.00%-- reactor::signals::signal_handler::signal_handler
>                 |          |                     |                     0x3e8
>                 |          |                     |          
>                 |          |                     |--0.74%-- db::commitlog::segment::allocate
>                 |          |                     |          |          
>                 |          |                     |           --100.00%-- db::commitlog::add
>                 |          |                     |                     database::do_apply
>                 |          |                     |                     |          
>                 |          |                     |                     |--75.00%-- database::apply
>                 |          |                     |                     |          smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process
>                 |          |                     |                     |          smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
>                 |          |                     |                     |          boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
>                 |          |                     |                     |          boost::program_options::variables_map::get
>                 |          |                     |                     |          
>                 |          |                     |                      --25.00%-- _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv
>                 |          |                     |                                reactor::del_timer
>                 |          |                     |                                0x60b0000e2040
>                 |          |                     |          
>                 |          |                     |--0.74%-- service::storage_proxy::create_write_response_handler
>                 |          |                     |          
>                 |          |                     |--0.74%-- transport::cql_server::connection::process_request_one
>                 |          |                     |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |          |          
>                 |          |                     |          |--80.00%-- transport::cql_server::connection::process_request
>                 |          |                     |          |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          0x60a0000c3000
>                 |          |                     |          |          
>                 |          |                     |           --20.00%-- 0x8961de
>                 |          |                     |          
>                 |          |                     |--0.74%-- compound_type<(allow_prefixes)0>::compare
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- 0x6030056c0f20
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- boost::intrusive::bstbase2<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::find
>                 |          |                     |          |          mutation_partition::clustered_row
>                 |          |                     |          |          mutation::set_clustered_cell
>                 |          |                     |          |          cql3::constants::setter::execute
>                 |          |                     |          |          cql3::statements::update_statement::add_update_for_key
>                 |          |                     |          |          _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE
>                 |          |                     |          |          cql3::statements::modification_statement::get_mutations
>                 |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |          |          cql3::query_options::query_options
>                 |          |                     |          |          0x7f4adb3f80e0
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- compound_type<(allow_prefixes)0>::compare
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- mutation_partition::clustered_row
>                 |          |                     |          |          boost::intrusive::bstree_impl<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, unsigned long, true, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::insert_unique
>                 |          |                     |          |          boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node
>                 |          |                     |          |          0x12d
>                 |          |                     |          |          
>                 |          |                     |           --20.00%-- 0x60f00052daf0
>                 |          |                     |          
>                 |          |                     |--0.74%-- __memmove_ssse3_back
>                 |          |                     |          |          
>                 |          |                     |          |--40.00%-- output_stream<char>::write
>                 |          |                     |          |          |          
>                 |          |                     |          |          |--50.00%-- transport::cql_server::response::output
>                 |          |                     |          |          |          futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}>
>                 |          |                     |          |          |          
>                 |          |                     |          |           --50.00%-- 0x7c7fb2
>                 |          |                     |          |                     0x5257c37847fa0
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- transport::cql_server::connection::read_short_bytes
>                 |          |                     |          |          transport::cql_server::connection::process_query
>                 |          |                     |          |          0x7f4ada7f86f0
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- transport::cql_server::response::output
>                 |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}>
>                 |          |                     |          |          0x2
>                 |          |                     |          |          
>                 |          |                     |           --20.00%-- smp_message_queue::flush_response_batch
>                 |          |                     |                     boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
>                 |          |                     |                     boost::program_options::variables_map::get
>                 |          |                     |          
>                 |          |                     |--0.74%-- syscall_work_queue::work_item_returning<syscall_result_extra<stat>, reactor::file_size(basic_sstring<char, unsigned int, 15u>)::{lambda()#1}>::~work_item_returning
>                 |          |                     |          |          
>                 |          |                     |          |--60.00%-- 0x6130000c3000
>                 |          |                     |          |          
>                 |          |                     |          |--20.00%-- 0x608001fe59a0
>                 |          |                     |          |          
>                 |          |                     |           --20.00%-- 0x16
>                 |          |                     |          
>                 |          |                     |--0.74%-- __memset_sse2
>                 |          |                     |          |          
>                 |          |                     |          |--40.00%-- std::_Hashtable<range<dht::token>, std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >, std::allocator<std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<range<dht::token> >, std::hash<range<dht::token> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable
>                 |          |                     |          |          locator::token_metadata::pending_endpoints_for
>                 |          |                     |          |          service::storage_proxy::create_write_response_handler
>                 |          |                     |          |          service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}>
>                 |          |                     |          |          service::storage_proxy::mutate
>                 |          |                     |          |          service::storage_proxy::mutate_with_triggers
>                 |          |                     |          |          cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |          |          cql3::statements::modification_statement::execute
>                 |          |                     |          |          cql3::query_processor::process_statement
>                 |          |                     |          |          transport::cql_server::connection::process_execute
>                 |          |                     |          |          transport::cql_server::connection::process_request_one
>                 |          |                     |          |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |          |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |          |          transport::cql_server::connection::process_request
>                 |          |                     |          |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          |          0x6020000c3000
>                 |          |                     |          |          
>                 |          |                     |          |--40.00%-- service::digest_read_resolver::~digest_read_resolver
>                 |          |                     |          |          |          
>                 |          |                     |          |           --100.00%-- 0x610002612b50
>                 |          |                     |          |          
>                 |          |                     |           --20.00%-- std::_Hashtable<basic_sstring<char, unsigned int, 15u>, std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, std::allocator<std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<basic_sstring<char, unsigned int, 15u> >, std::hash<basic_sstring<char, unsigned int, 15u> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable
>                 |          |                     |                     service::storage_proxy::send_to_live_endpoints
>                 |          |                     |                     parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}>
>                 |          |                     |                     service::storage_proxy::mutate
>                 |          |                     |                     service::storage_proxy::mutate_with_triggers
>                 |          |                     |                     cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |                     cql3::statements::modification_statement::execute
>                 |          |                     |                     cql3::query_processor::process_statement
>                 |          |                     |                     transport::cql_server::connection::process_execute
>                 |          |                     |                     transport::cql_server::connection::process_request_one
>                 |          |                     |                     futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |                     |                     futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |                     futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |                     transport::cql_server::connection::process_request
>                 |          |                     |                     do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |                     do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |                     do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |                     0x6070000c3000
>                 |          |                     |          
>                 |          |                     |--0.74%-- reactor::del_timer
>                 |          |                     |          |          
>                 |          |                     |          |--80.00%-- 0x60a0000e2040
>                 |          |                     |          |          
>                 |          |                     |           --20.00%-- 0x6080000c3db0
>                 |          |                     |          
>                 |          |                     |--0.59%-- unimplemented::operator<<
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev
>                 |          |                     |          |          0x600100000008
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- floating_type_impl<float>::from_string
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- 0x60e0000e4c10
>                 |          |                     |          |          
>                 |          |                     |           --25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev
>                 |          |                     |                     0x600100000008
>                 |          |                     |          
>                 |          |                     |--0.59%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node
>                 |          |                     |          service::storage_proxy::register_response_handler
>                 |          |                     |          service::storage_proxy::create_write_response_handler
>                 |          |                     |          service::storage_proxy::create_write_response_handler
>                 |          |                     |          service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}>
>                 |          |                     |          service::storage_proxy::mutate
>                 |          |                     |          service::storage_proxy::mutate_with_triggers
>                 |          |                     |          cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |          cql3::statements::modification_statement::execute
>                 |          |                     |          cql3::query_processor::process_statement
>                 |          |                     |          transport::cql_server::connection::process_execute
>                 |          |                     |          transport::cql_server::connection::process_request_one
>                 |          |                     |          futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |                     |          futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |                     |          transport::cql_server::connection::process_request
>                 |          |                     |          do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |                     |          0x60b0000c3000
>                 |          |                     |          
>                 |          |                     |--0.59%-- mutation::set_clustered_cell
>                 |          |                     |          |          
>                 |          |                     |          |--75.00%-- 0xa
>                 |          |                     |          |          
>                 |          |                     |           --25.00%-- cql3::constants::setter::execute
>                 |          |                     |                     cql3::statements::update_statement::add_update_for_key
>                 |          |                     |                     _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE
>                 |          |                     |                     cql3::statements::modification_statement::get_mutations
>                 |          |                     |                     cql3::statements::modification_statement::execute_without_condition
>                 |          |                     |                     cql3::query_options::query_options
>                 |          |                     |                     0x7f4ad89f80e0
>                 |          |                     |          
>                 |          |                     |--0.59%-- memory::small_pool::small_pool
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- memory::stats
>                 |          |                     |          |          boost::program_options::variables_map::get
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- memory::reclaimer::~reclaimer
>                 |          |                     |          |          0x1e
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- memory::allocate_aligned
>                 |          |                     |          |          
>                 |          |                     |           --25.00%-- memory::small_pool::add_more_objects
>                 |          |                     |                     memory::small_pool::add_more_objects
>                 |          |                     |                     0x6100000e0310
>                 |          |                     |          
>                 |          |                     |--0.59%-- __memcpy_sse2_unaligned
>                 |          |                     |          |          
>                 |          |                     |          |--50.00%-- mutation_partition_applier::accept_row_cell
>                 |          |                     |          |          mutation_partition_view::accept
>                 |          |                     |          |          boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node
>                 |          |                     |          |          0x12d
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- scanning_reader::operator()
>                 |          |                     |          |          sstables::sstable::do_write_components
>                 |          |                     |          |          sstables::sstable::prepare_write_components
>                 |          |                     |          |          std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::typ
 e>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke
>                 |          |                     |          |          _GLOBAL__sub_I__ZN12app_templateC2Ev
>                 |          |                     |          |          
>                 |          |                     |           --25.00%-- memtable::find_or_create_partition_slow
>                 |          |                     |                     memtable::apply
>                 |          |                     |                     database::apply_in_memory
>                 |          |                     |                     database::do_apply
>                 |          |                     |                     database::apply
>                 |          |                     |                     smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process
>                 |          |                     |                     smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}>
>                 |          |                     |                     boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
>                 |          |                     |                     boost::program_options::variables_map::get
>                 |          |                     |          
>                 |          |                     |--0.59%-- smp_message_queue::flush_response_batch
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop
>                 |          |                     |          |          boost::program_options::variables_map::get
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- 0x13
>                 |          |                     |          |          
>                 |          |                     |          |--25.00%-- 0x7f4ad5ff8f40
>                 |          |                     |          |          
>                 |          |                     |           --25.00%-- reactor::run
>                 |          |                     |                     smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator()
>                 |          |                     |                     continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run
>                 |          |                     |                     0x600000043d00
>                 |          |                      --54.38%-- [...]
>                 |          |          
>                 |          |--14.26%-- schedule_timeout
>                 |          |          |          
>                 |          |          |--38.52%-- wait_for_completion
>                 |          |          |          |          
>                 |          |          |          |--90.07%-- flush_work
>                 |          |          |          |          xlog_cil_force_lsn
>                 |          |          |          |          |          
>                 |          |          |          |          |--96.85%-- _xfs_log_force_lsn
>                 |          |          |          |          |          |          
>                 |          |          |          |          |          |--79.67%-- xfs_file_fsync
>                 |          |          |          |          |          |          vfs_fsync_range
>                 |          |          |          |          |          |          do_fsync
>                 |          |          |          |          |          |          sys_fdatasync
>                 |          |          |          |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          |          |          |          
>                 |          |          |          |          |          |           --100.00%-- 0x7f4ade4212ad
>                 |          |          |          |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |          |          |          |                     0x6030000c3ec0
>                 |          |          |          |          |          |          
>                 |          |          |          |          |           --20.33%-- xfs_dir_fsync
>                 |          |          |          |          |                     vfs_fsync_range
>                 |          |          |          |          |                     do_fsync
>                 |          |          |          |          |                     sys_fdatasync
>                 |          |          |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |          |          |          |                     |          
>                 |          |          |          |          |                      --100.00%-- 0x7f4ade4212ad
>                 |          |          |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |          |          |                                0x6040000c3ec0
>                 |          |          |          |          |          
>                 |          |          |          |           --3.15%-- _xfs_log_force
>                 |          |          |          |                     xfs_log_force
>                 |          |          |          |                     xfs_buf_lock
>                 |          |          |          |                     _xfs_buf_find
>                 |          |          |          |                     xfs_buf_get_map
>                 |          |          |          |                     xfs_trans_get_buf_map
>                 |          |          |          |                     xfs_btree_get_bufl
>                 |          |          |          |                     xfs_bmap_extents_to_btree
>                 |          |          |          |                     xfs_bmap_add_extent_hole_real
>                 |          |          |          |                     xfs_bmapi_write
>                 |          |          |          |                     xfs_iomap_write_direct
>                 |          |          |          |                     __xfs_get_blocks
>                 |          |          |          |                     xfs_get_blocks_direct
>                 |          |          |          |                     do_blockdev_direct_IO
>                 |          |          |          |                     __blockdev_direct_IO
>                 |          |          |          |                     xfs_vm_direct_IO
>                 |          |          |          |                     xfs_file_dio_aio_write
>                 |          |          |          |                     xfs_file_write_iter
>                 |          |          |          |                     aio_run_iocb
>                 |          |          |          |                     do_io_submit
>                 |          |          |          |                     sys_io_submit
>                 |          |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |          |          |                     io_submit
>                 |          |          |          |                     0x46d98a
>                 |          |          |          |          
>                 |          |          |           --9.93%-- submit_bio_wait
>                 |          |          |                     blkdev_issue_flush
>                 |          |          |                     xfs_blkdev_issue_flush
>                 |          |          |                     xfs_file_fsync
>                 |          |          |                     vfs_fsync_range
>                 |          |          |                     do_fsync
>                 |          |          |                     sys_fdatasync
>                 |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |          |                     |          
>                 |          |          |                      --100.00%-- 0x7f4ade4212ad
>                 |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |                                0x6030000c3ec0
>                 |          |          |          
>                 |          |          |--32.79%-- io_schedule_timeout
>                 |          |          |          bit_wait_io
>                 |          |          |          __wait_on_bit
>                 |          |          |          |          
>                 |          |          |          |--51.67%-- wait_on_page_bit
>                 |          |          |          |          |          
>                 |          |          |          |          |--95.16%-- filemap_fdatawait_range
>                 |          |          |          |          |          filemap_write_and_wait_range
>                 |          |          |          |          |          xfs_file_fsync
>                 |          |          |          |          |          vfs_fsync_range
>                 |          |          |          |          |          do_fsync
>                 |          |          |          |          |          sys_fdatasync
>                 |          |          |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          |          0x7f4ade4212ad
>                 |          |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |          |          |          0x60b0000c3ec0
>                 |          |          |          |          |          
>                 |          |          |          |           --4.84%-- __migration_entry_wait
>                 |          |          |          |                     migration_entry_wait
>                 |          |          |          |                     handle_mm_fault
>                 |          |          |          |                     __do_page_fault
>                 |          |          |          |                     do_page_fault
>                 |          |          |          |                     page_fault
>                 |          |          |          |                     std::_Function_handler<void (), httpd::http_server::_date_format_timer::{lambda()#1}>::_M_invoke
>                 |          |          |          |                     |          
>                 |          |          |          |                      --100.00%-- service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}>
>                 |          |          |          |                                service::storage_proxy::mutate
>                 |          |          |          |                                service::storage_proxy::mutate_with_triggers
>                 |          |          |          |                                cql3::statements::modification_statement::execute_without_condition
>                 |          |          |          |                                cql3::statements::modification_statement::execute
>                 |          |          |          |                                cql3::query_processor::process_statement
>                 |          |          |          |                                transport::cql_server::connection::process_execute
>                 |          |          |          |                                transport::cql_server::connection::process_request_one
>                 |          |          |          |                                futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&>
>                 |          |          |          |                                futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer>
>                 |          |          |          |                                futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > >
>                 |          |          |          |                                transport::cql_server::connection::process_request
>                 |          |          |          |                                do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>
>                 |          |          |          |                                do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |          |          |                                do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> >
>                 |          |          |          |                                0x6140000c3000
>                 |          |          |          |          
>                 |          |          |           --48.33%-- out_of_line_wait_on_bit
>                 |          |          |                     block_truncate_page
>                 |          |          |                     xfs_setattr_size
>                 |          |          |                     xfs_vn_setattr
>                 |          |          |                     notify_change
>                 |          |          |                     do_truncate
>                 |          |          |                     do_sys_ftruncate.constprop.15
>                 |          |          |                     sys_ftruncate
>                 |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |          |                     __GI___ftruncate64
>                 |          |          |                     syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process
>                 |          |          |                     |          
>                 |          |          |                     |--13.79%-- 0x7f4ad29ff700
>                 |          |          |                     |          
>                 |          |          |                     |--13.79%-- 0x7f4acdbff700
>                 |          |          |                     |          
>                 |          |          |                     |--12.07%-- 0x7f4ad05ff700
>                 |          |          |                     |          
>                 |          |          |                     |--12.07%-- 0x7f4acedff700
>                 |          |          |                     |          
>                 |          |          |                     |--10.34%-- 0x7f4ad0bff700
>                 |          |          |                     |          
>                 |          |          |                     |--6.90%-- 0x7f4ad2fff700
>                 |          |          |                     |          
>                 |          |          |                     |--6.90%-- 0x7f4ad11ff700
>                 |          |          |                     |          
>                 |          |          |                     |--6.90%-- 0x7f4acf9ff700
>                 |          |          |                     |          
>                 |          |          |                     |--6.90%-- 0x7f4acf3ff700
>                 |          |          |                     |          
>                 |          |          |                     |--6.90%-- 0x7f4ace7ff700
>                 |          |          |                     |          
>                 |          |          |                     |--1.72%-- 0x7f4ad17ff700
>                 |          |          |                     |          
>                 |          |          |                      --1.72%-- 0x7f4aca5ff700
>                 |          |          |          
>                 |          |           --28.69%-- __down
>                 |          |                     down
>                 |          |                     xfs_buf_lock
>                 |          |                     _xfs_buf_find
>                 |          |                     xfs_buf_get_map
>                 |          |                     |          
>                 |          |                     |--97.14%-- xfs_buf_read_map
>                 |          |                     |          xfs_trans_read_buf_map
>                 |          |                     |          |          
>                 |          |                     |          |--98.04%-- xfs_read_agf
>                 |          |                     |          |          xfs_alloc_read_agf
>                 |          |                     |          |          xfs_alloc_fix_freelist
>                 |          |                     |          |          |          
>                 |          |                     |          |          |--93.00%-- xfs_free_extent
>                 |          |                     |          |          |          xfs_bmap_finish
>                 |          |                     |          |          |          xfs_itruncate_extents
>                 |          |                     |          |          |          |          
>                 |          |                     |          |          |          |--87.10%-- xfs_inactive_truncate
>                 |          |                     |          |          |          |          xfs_inactive
>                 |          |                     |          |          |          |          xfs_fs_evict_inode
>                 |          |                     |          |          |          |          evict
>                 |          |                     |          |          |          |          iput
>                 |          |                     |          |          |          |          __dentry_kill
>                 |          |                     |          |          |          |          dput
>                 |          |                     |          |          |          |          __fput
>                 |          |                     |          |          |          |          ____fput
>                 |          |                     |          |          |          |          task_work_run
>                 |          |                     |          |          |          |          do_notify_resume
>                 |          |                     |          |          |          |          int_signal
>                 |          |                     |          |          |          |          __libc_close
>                 |          |                     |          |          |          |          std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access
>                 |          |                     |          |          |          |          
>                 |          |                     |          |          |           --12.90%-- xfs_setattr_size
>                 |          |                     |          |          |                     xfs_vn_setattr
>                 |          |                     |          |          |                     notify_change
>                 |          |                     |          |          |                     do_truncate
>                 |          |                     |          |          |                     do_sys_ftruncate.constprop.15
>                 |          |                     |          |          |                     sys_ftruncate
>                 |          |                     |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |                     |          |          |                     |          
>                 |          |                     |          |          |                      --100.00%-- __GI___ftruncate64
>                 |          |                     |          |          |                                syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--20.00%-- 0x7f4ad0bff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--20.00%-- 0x7f4acedff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--10.00%-- 0x7f4ad2fff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--10.00%-- 0x7f4ad17ff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--10.00%-- 0x7f4ad11ff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--10.00%-- 0x7f4ad05ff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                |--10.00%-- 0x7f4acf3ff700
>                 |          |                     |          |          |                                |          
>                 |          |                     |          |          |                                 --10.00%-- 0x7f4acdbff700
>                 |          |                     |          |          |          
>                 |          |                     |          |           --7.00%-- xfs_alloc_vextent
>                 |          |                     |          |                     xfs_bmap_btalloc
>                 |          |                     |          |                     xfs_bmap_alloc
>                 |          |                     |          |                     xfs_bmapi_write
>                 |          |                     |          |                     xfs_iomap_write_direct
>                 |          |                     |          |                     __xfs_get_blocks
>                 |          |                     |          |                     xfs_get_blocks_direct
>                 |          |                     |          |                     do_blockdev_direct_IO
>                 |          |                     |          |                     __blockdev_direct_IO
>                 |          |                     |          |                     xfs_vm_direct_IO
>                 |          |                     |          |                     xfs_file_dio_aio_write
>                 |          |                     |          |                     xfs_file_write_iter
>                 |          |                     |          |                     aio_run_iocb
>                 |          |                     |          |                     do_io_submit
>                 |          |                     |          |                     sys_io_submit
>                 |          |                     |          |                     entry_SYSCALL_64_fastpath
>                 |          |                     |          |                     io_submit
>                 |          |                     |          |                     0x46d98a
>                 |          |                     |          |          
>                 |          |                     |           --1.96%-- xfs_read_agi
>                 |          |                     |                     xfs_iunlink_remove
>                 |          |                     |                     xfs_ifree
>                 |          |                     |                     xfs_inactive_ifree
>                 |          |                     |                     xfs_inactive
>                 |          |                     |                     xfs_fs_evict_inode
>                 |          |                     |                     evict
>                 |          |                     |                     iput
>                 |          |                     |                     __dentry_kill
>                 |          |                     |                     dput
>                 |          |                     |                     __fput
>                 |          |                     |                     ____fput
>                 |          |                     |                     task_work_run
>                 |          |                     |                     do_notify_resume
>                 |          |                     |                     int_signal
>                 |          |                     |                     __libc_close
>                 |          |                     |                     std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access
>                 |          |                     |          
>                 |          |                      --2.86%-- xfs_trans_get_buf_map
>                 |          |                                xfs_btree_get_bufl
>                 |          |                                xfs_bmap_extents_to_btree
>                 |          |                                xfs_bmap_add_extent_hole_real
>                 |          |                                xfs_bmapi_write
>                 |          |                                xfs_iomap_write_direct
>                 |          |                                __xfs_get_blocks
>                 |          |                                xfs_get_blocks_direct
>                 |          |                                do_blockdev_direct_IO
>                 |          |                                __blockdev_direct_IO
>                 |          |                                xfs_vm_direct_IO
>                 |          |                                xfs_file_dio_aio_write
>                 |          |                                xfs_file_write_iter
>                 |          |                                aio_run_iocb
>                 |          |                                do_io_submit
>                 |          |                                sys_io_submit
>                 |          |                                entry_SYSCALL_64_fastpath
>                 |          |                                io_submit
>                 |          |                                0x46d98a
>                 |          |          
>                 |          |--13.48%-- eventfd_ctx_read
>                 |          |          eventfd_read
>                 |          |          __vfs_read
>                 |          |          vfs_read
>                 |          |          sys_read
>                 |          |          entry_SYSCALL_64_fastpath
>                 |          |          0x7f4ade6f754d
>                 |          |          smp_message_queue::respond
>                 |          |          0xffffffffffffffff
>                 |          |          
>                 |          |--7.83%-- md_flush_request
>                 |          |          raid0_make_request
>                 |          |          md_make_request
>                 |          |          generic_make_request
>                 |          |          submit_bio
>                 |          |          |          
>                 |          |          |--92.54%-- submit_bio_wait
>                 |          |          |          blkdev_issue_flush
>                 |          |          |          xfs_blkdev_issue_flush
>                 |          |          |          xfs_file_fsync
>                 |          |          |          vfs_fsync_range
>                 |          |          |          do_fsync
>                 |          |          |          sys_fdatasync
>                 |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          
>                 |          |          |           --100.00%-- 0x7f4ade4212ad
>                 |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |                     0x6010000c3ec0
>                 |          |          |          
>                 |          |           --7.46%-- _xfs_buf_ioapply
>                 |          |                     xfs_buf_submit
>                 |          |                     xlog_bdstrat
>                 |          |                     xlog_sync
>                 |          |                     xlog_state_release_iclog
>                 |          |                     |          
>                 |          |                     |--73.33%-- _xfs_log_force_lsn
>                 |          |                     |          xfs_file_fsync
>                 |          |                     |          vfs_fsync_range
>                 |          |                     |          do_fsync
>                 |          |                     |          sys_fdatasync
>                 |          |                     |          entry_SYSCALL_64_fastpath
>                 |          |                     |          |          
>                 |          |                     |           --100.00%-- 0x7f4ade4212ad
>                 |          |                     |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |                     |                     0x6080000c3ec0
>                 |          |                     |          
>                 |          |                      --26.67%-- _xfs_log_force
>                 |          |                                xfs_log_force
>                 |          |                                xfs_buf_lock
>                 |          |                                _xfs_buf_find
>                 |          |                                xfs_buf_get_map
>                 |          |                                xfs_trans_get_buf_map
>                 |          |                                xfs_btree_get_bufl
>                 |          |                                xfs_bmap_extents_to_btree
>                 |          |                                xfs_bmap_add_extent_hole_real
>                 |          |                                xfs_bmapi_write
>                 |          |                                xfs_iomap_write_direct
>                 |          |                                __xfs_get_blocks
>                 |          |                                xfs_get_blocks_direct
>                 |          |                                do_blockdev_direct_IO
>                 |          |                                __blockdev_direct_IO
>                 |          |                                xfs_vm_direct_IO
>                 |          |                                xfs_file_dio_aio_write
>                 |          |                                xfs_file_write_iter
>                 |          |                                aio_run_iocb
>                 |          |                                do_io_submit
>                 |          |                                sys_io_submit
>                 |          |                                entry_SYSCALL_64_fastpath
>                 |          |                                io_submit
>                 |          |                                0x46d98a
>                 |          |          
>                 |          |--5.53%-- _xfs_log_force_lsn
>                 |          |          |          
>                 |          |          |--80.28%-- xfs_file_fsync
>                 |          |          |          vfs_fsync_range
>                 |          |          |          do_fsync
>                 |          |          |          sys_fdatasync
>                 |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          
>                 |          |          |           --100.00%-- 0x7f4ade4212ad
>                 |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |                     |          
>                 |          |          |                     |--97.92%-- 0x60d0000c3ec0
>                 |          |          |                     |          
>                 |          |          |                     |--1.04%-- 0x6020000c3ec0
>                 |          |          |                     |          
>                 |          |          |                      --1.04%-- 0x600000557ec0
>                 |          |          |          
>                 |          |           --19.72%-- xfs_dir_fsync
>                 |          |                     vfs_fsync_range
>                 |          |                     do_fsync
>                 |          |                     sys_fdatasync
>                 |          |                     entry_SYSCALL_64_fastpath
>                 |          |                     |          
>                 |          |                      --100.00%-- 0x7f4ade4212ad
>                 |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |                                0x6040000c3ec0
>                 |          |          
>                 |          |--1.25%-- rwsem_down_read_failed
>                 |          |          call_rwsem_down_read_failed
>                 |          |          |          
>                 |          |          |--90.62%-- xfs_ilock
>                 |          |          |          |          
>                 |          |          |          |--86.21%-- xfs_ilock_data_map_shared
>                 |          |          |          |          __xfs_get_blocks
>                 |          |          |          |          xfs_get_blocks_direct
>                 |          |          |          |          do_blockdev_direct_IO
>                 |          |          |          |          __blockdev_direct_IO
>                 |          |          |          |          xfs_vm_direct_IO
>                 |          |          |          |          xfs_file_dio_aio_write
>                 |          |          |          |          xfs_file_write_iter
>                 |          |          |          |          aio_run_iocb
>                 |          |          |          |          do_io_submit
>                 |          |          |          |          sys_io_submit
>                 |          |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          |          
>                 |          |          |          |           --100.00%-- io_submit
>                 |          |          |          |                     0x46d98a
>                 |          |          |          |          
>                 |          |          |          |--6.90%-- xfs_file_fsync
>                 |          |          |          |          vfs_fsync_range
>                 |          |          |          |          do_fsync
>                 |          |          |          |          sys_fdatasync
>                 |          |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          0x7f4ade4212ad
>                 |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |          |          0x6090000c3ec0
>                 |          |          |          |          
>                 |          |          |           --6.90%-- xfs_dir_fsync
>                 |          |          |                     vfs_fsync_range
>                 |          |          |                     do_fsync
>                 |          |          |                     sys_fdatasync
>                 |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |          |                     0x7f4ade4212ad
>                 |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |                     0x6070000c3ec0
>                 |          |          |          
>                 |          |           --9.38%-- xfs_log_commit_cil
>                 |          |                     __xfs_trans_commit
>                 |          |                     xfs_trans_commit
>                 |          |                     |          
>                 |          |                     |--33.33%-- xfs_setattr_size
>                 |          |                     |          xfs_vn_setattr
>                 |          |                     |          notify_change
>                 |          |                     |          do_truncate
>                 |          |                     |          do_sys_ftruncate.constprop.15
>                 |          |                     |          sys_ftruncate
>                 |          |                     |          entry_SYSCALL_64_fastpath
>                 |          |                     |          __GI___ftruncate64
>                 |          |                     |          syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process
>                 |          |                     |          0x7f4acedff700
>                 |          |                     |          
>                 |          |                     |--33.33%-- xfs_vn_update_time
>                 |          |                     |          file_update_time
>                 |          |                     |          xfs_file_aio_write_checks
>                 |          |                     |          xfs_file_dio_aio_write
>                 |          |                     |          xfs_file_write_iter
>                 |          |                     |          aio_run_iocb
>                 |          |                     |          do_io_submit
>                 |          |                     |          sys_io_submit
>                 |          |                     |          entry_SYSCALL_64_fastpath
>                 |          |                     |          io_submit
>                 |          |                     |          0x46d98a
>                 |          |                     |          
>                 |          |                      --33.33%-- xfs_bmap_add_attrfork
>                 |          |                                xfs_attr_set
>                 |          |                                xfs_initxattrs
>                 |          |                                security_inode_init_security
>                 |          |                                xfs_init_security
>                 |          |                                xfs_generic_create
>                 |          |                                xfs_vn_mknod
>                 |          |                                xfs_vn_create
>                 |          |                                vfs_create
>                 |          |                                path_openat
>                 |          |                                do_filp_open
>                 |          |                                do_sys_open
>                 |          |                                sys_open
>                 |          |                                entry_SYSCALL_64_fastpath
>                 |          |                                0x7f4ade6f7cdd
>                 |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, reactor::open_file_dma(basic_sstring<char, unsigned int, 15u>, open_flags, file_open_options)::{lambda()#1}>::process
>                 |          |                                0xffffffffffffffff
>                 |          |          
>                 |          |--0.97%-- rwsem_down_write_failed
>                 |          |          call_rwsem_down_write_failed
>                 |          |          xfs_ilock
>                 |          |          xfs_vn_update_time
>                 |          |          file_update_time
>                 |          |          xfs_file_aio_write_checks
>                 |          |          xfs_file_dio_aio_write
>                 |          |          xfs_file_write_iter
>                 |          |          aio_run_iocb
>                 |          |          do_io_submit
>                 |          |          sys_io_submit
>                 |          |          entry_SYSCALL_64_fastpath
>                 |          |          io_submit
>                 |          |          0x46d98a
>                 |          |          
>                 |          |--0.51%-- xlog_cil_force_lsn
>                 |          |          |          
>                 |          |          |--92.31%-- _xfs_log_force_lsn
>                 |          |          |          |          
>                 |          |          |          |--91.67%-- xfs_file_fsync
>                 |          |          |          |          vfs_fsync_range
>                 |          |          |          |          do_fsync
>                 |          |          |          |          sys_fdatasync
>                 |          |          |          |          entry_SYSCALL_64_fastpath
>                 |          |          |          |          0x7f4ade4212ad
>                 |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |          |          0x60b0000c3ec0
>                 |          |          |          |          
>                 |          |          |           --8.33%-- xfs_dir_fsync
>                 |          |          |                     vfs_fsync_range
>                 |          |          |                     do_fsync
>                 |          |          |                     sys_fdatasync
>                 |          |          |                     entry_SYSCALL_64_fastpath
>                 |          |          |                     0x7f4ade4212ad
>                 |          |          |                     syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                 |          |          |                     0x60d0000c3ec0
>                 |          |          |          
>                 |          |           --7.69%-- _xfs_log_force
>                 |          |                     xfs_log_force
>                 |          |                     xfs_buf_lock
>                 |          |                     _xfs_buf_find
>                 |          |                     xfs_buf_get_map
>                 |          |                     xfs_trans_get_buf_map
>                 |          |                     xfs_btree_get_bufl
>                 |          |                     xfs_bmap_extents_to_btree
>                 |          |                     xfs_bmap_add_extent_hole_real
>                 |          |                     xfs_bmapi_write
>                 |          |                     xfs_iomap_write_direct
>                 |          |                     __xfs_get_blocks
>                 |          |                     xfs_get_blocks_direct
>                 |          |                     do_blockdev_direct_IO
>                 |          |                     __blockdev_direct_IO
>                 |          |                     xfs_vm_direct_IO
>                 |          |                     xfs_file_dio_aio_write
>                 |          |                     xfs_file_write_iter
>                 |          |                     aio_run_iocb
>                 |          |                     do_io_submit
>                 |          |                     sys_io_submit
>                 |          |                     entry_SYSCALL_64_fastpath
>                 |          |                     io_submit
>                 |          |                     0x46d98a
>                 |           --0.04%-- [...]
>                 |          
>                  --3.82%-- preempt_schedule_common
>                            |          
>                            |--99.02%-- _cond_resched
>                            |          |          
>                            |          |--41.58%-- wait_for_completion
>                            |          |          |          
>                            |          |          |--66.67%-- flush_work
>                            |          |          |          xlog_cil_force_lsn
>                            |          |          |          |          
>                            |          |          |          |--96.43%-- _xfs_log_force_lsn
>                            |          |          |          |          |          
>                            |          |          |          |          |--77.78%-- xfs_file_fsync
>                            |          |          |          |          |          vfs_fsync_range
>                            |          |          |          |          |          do_fsync
>                            |          |          |          |          |          sys_fdatasync
>                            |          |          |          |          |          entry_SYSCALL_64_fastpath
>                            |          |          |          |          |          0x7f4ade4212ad
>                            |          |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                            |          |          |          |          |          0x6030000c3ec0
>                            |          |          |          |          |          
>                            |          |          |          |           --22.22%-- xfs_dir_fsync
>                            |          |          |          |                     vfs_fsync_range
>                            |          |          |          |                     do_fsync
>                            |          |          |          |                     sys_fdatasync
>                            |          |          |          |                     entry_SYSCALL_64_fastpath
>                            |          |          |          |                     |          
>                            |          |          |          |                      --100.00%-- 0x7f4ade4212ad
>                            |          |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                            |          |          |          |                                0x6030000c3ec0
>                            |          |          |          |          
>                            |          |          |           --3.57%-- _xfs_log_force
>                            |          |          |                     xfs_log_force
>                            |          |          |                     xfs_buf_lock
>                            |          |          |                     _xfs_buf_find
>                            |          |          |                     xfs_buf_get_map
>                            |          |          |                     xfs_trans_get_buf_map
>                            |          |          |                     xfs_btree_get_bufl
>                            |          |          |                     xfs_bmap_extents_to_btree
>                            |          |          |                     xfs_bmap_add_extent_hole_real
>                            |          |          |                     xfs_bmapi_write
>                            |          |          |                     xfs_iomap_write_direct
>                            |          |          |                     __xfs_get_blocks
>                            |          |          |                     xfs_get_blocks_direct
>                            |          |          |                     do_blockdev_direct_IO
>                            |          |          |                     __blockdev_direct_IO
>                            |          |          |                     xfs_vm_direct_IO
>                            |          |          |                     xfs_file_dio_aio_write
>                            |          |          |                     xfs_file_write_iter
>                            |          |          |                     aio_run_iocb
>                            |          |          |                     do_io_submit
>                            |          |          |                     sys_io_submit
>                            |          |          |                     entry_SYSCALL_64_fastpath
>                            |          |          |                     io_submit
>                            |          |          |                     0x46d98a
>                            |          |          |          
>                            |          |           --33.33%-- submit_bio_wait
>                            |          |                     blkdev_issue_flush
>                            |          |                     xfs_blkdev_issue_flush
>                            |          |                     xfs_file_fsync
>                            |          |                     vfs_fsync_range
>                            |          |                     do_fsync
>                            |          |                     sys_fdatasync
>                            |          |                     entry_SYSCALL_64_fastpath
>                            |          |                     |          
>                            |          |                      --100.00%-- 0x7f4ade4212ad
>                            |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                            |          |                                0x6030000c3ec0
>                            |          |          
>                            |          |--33.66%-- flush_work
>                            |          |          xlog_cil_force_lsn
>                            |          |          |          
>                            |          |          |--97.06%-- _xfs_log_force_lsn
>                            |          |          |          |          
>                            |          |          |          |--78.79%-- xfs_file_fsync
>                            |          |          |          |          vfs_fsync_range
>                            |          |          |          |          do_fsync
>                            |          |          |          |          sys_fdatasync
>                            |          |          |          |          entry_SYSCALL_64_fastpath
>                            |          |          |          |          0x7f4ade4212ad
>                            |          |          |          |          syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                            |          |          |          |          0x6030000c3ec0
>                            |          |          |          |          
>                            |          |          |           --21.21%-- xfs_dir_fsync
>                            |          |          |                     vfs_fsync_range
>                            |          |          |                     do_fsync
>                            |          |          |                     sys_fdatasync
>                            |          |          |                     entry_SYSCALL_64_fastpath
>                            |          |          |                     |          
>                            |          |          |                      --100.00%-- 0x7f4ade4212ad
>                            |          |          |                                syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process
>                            |          |          |                                0x6030000c3ec0
>                            |          |          |          
>                            |          |           --2.94%-- _xfs_log_force
>                            |          |                     xfs_log_force
>                            |          |                     xfs_buf_lock
>                            |          |                     _xfs_buf_find
>                            |          |                     xfs_buf_get_map
>                            |          |                     xfs_trans_get_buf_map
>                            |          |                     xfs_btree_get_bufl
>                            |          |                     xfs_bmap_extents_to_btree
>                            |          |                     xfs_bmap_add_extent_hole_real
>                            |          |                     xfs_bmapi_write
>                            |          |                     xfs_iomap_write_direct
>                            |          |                     __xfs_get_blocks
>                            |          |                     xfs_get_blocks_direct
>                            |          |                     do_blockdev_direct_IO
>                            |          |                     __blockdev_direct_IO
>                            |          |                     xfs_vm_direct_IO
>                            |          |                     xfs_file_dio_aio_write
>                            |          |                     xfs_file_write_iter
>                            |          |                     aio_run_iocb
>                            |          |                     do_io_submit
>                            |          |                     sys_io_submit
>                            |          |                     entry_SYSCALL_64_fastpath
>                            |          |                     io_submit
>                            |          |                     0x46d98a
>                            |          |          
>                            |          |--13.86%-- lock_sock_nested
>                            |          |          |          
>                            |          |          |--78.57%-- tcp_sendmsg
>                            |          |          |          inet_sendmsg
>                            |          |          |          sock_sendmsg
>                            |          |          |          SYSC_sendto
>                            |          |          |          sys_sendto
>                            |          |          |          entry_SYSCALL_64_fastpath
>                            |          |          |          __libc_send
>                            |          |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
>                            |          |          |          |          
>                            |          |          |          |--36.36%-- 0x7f4ad6bf8de0
>                            |          |          |          |          
>                            |          |          |          |--9.09%-- 0x4
>                            |          |          |          |          
>                            |          |          |          |--9.09%-- 0x7f4adadf8de0
>                            |          |          |          |          
>                            |          |          |          |--9.09%-- 0x7f4ada1f8de0
>                            |          |          |          |          
>                            |          |          |          |--9.09%-- 0x7f4ad89f8de0
>                            |          |          |          |          
>                            |          |          |          |--9.09%-- 0x7f4ad83f8de0
>                            |          |          |          |          
>                            |          |          |          |--9.09%-- 0x7f4ad4df8de0
>                            |          |          |          |          
>                            |          |          |           --9.09%-- 0x7f4ad35f8de0
>                            |          |          |          
>                            |          |           --21.43%-- tcp_recvmsg
>                            |          |                     inet_recvmsg
>                            |          |                     sock_recvmsg
>                            |          |                     sock_read_iter
>                            |          |                     __vfs_read
>                            |          |                     vfs_read
>                            |          |                     sys_read
>                            |          |                     entry_SYSCALL_64_fastpath
>                            |          |                     0x7f4ade6f754d
>                            |          |                     reactor::read_some
>                            |          |                     |          
>                            |          |                     |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv
>                            |          |                     |          reactor::del_timer
>                            |          |                     |          0x6160000e2040
>                            |          |                     |          
>                            |          |                      --33.33%-- continuation<future<> future<>::then_wrapped<future<> future<>::finally<auto seastar::with_gate<transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}>(seastar::gate&, transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>(seastar::gate&)::{lambda(future<>)#1}::operator()(future<>)::{lambda(seastar::gate)#1}, future<> >(seastar::gate&)::{lambda(seastar::gate&)#1}>::run
>                            |          |                                reactor::del_timer
>                            |          |                                0x6030000e2040
>                            |          |          
>                            |          |--3.96%-- generic_make_request_checks
>                            |          |          generic_make_request
>                            |          |          submit_bio
>                            |          |          do_blockdev_direct_IO
>                            |          |          __blockdev_direct_IO
>                            |          |          xfs_vm_direct_IO
>                            |          |          xfs_file_dio_aio_write
>                            |          |          xfs_file_write_iter
>                            |          |          aio_run_iocb
>                            |          |          do_io_submit
>                            |          |          sys_io_submit
>                            |          |          entry_SYSCALL_64_fastpath
>                            |          |          io_submit
>                            |          |          0x46d98a
>                            |          |          
>                            |          |--3.96%-- kmem_cache_alloc_node
>                            |          |          __alloc_skb
>                            |          |          sk_stream_alloc_skb
>                            |          |          tcp_sendmsg
>                            |          |          inet_sendmsg
>                            |          |          sock_sendmsg
>                            |          |          SYSC_sendto
>                            |          |          sys_sendto
>                            |          |          entry_SYSCALL_64_fastpath
>                            |          |          __libc_send
>                            |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
>                            |          |          |          
>                            |          |          |--25.00%-- 0x7f4ad9bf8de0
>                            |          |          |          
>                            |          |          |--25.00%-- 0x7f4ad7df8de0
>                            |          |          |          
>                            |          |          |--25.00%-- 0x7f4ad77f8de0
>                            |          |          |          
>                            |          |           --25.00%-- 0x7f4ad59f8de0
>                            |          |          
>                            |          |--0.99%-- unmap_underlying_metadata
>                            |          |          do_blockdev_direct_IO
>                            |          |          __blockdev_direct_IO
>                            |          |          xfs_vm_direct_IO
>                            |          |          xfs_file_dio_aio_write
>                            |          |          xfs_file_write_iter
>                            |          |          aio_run_iocb
>                            |          |          do_io_submit
>                            |          |          sys_io_submit
>                            |          |          entry_SYSCALL_64_fastpath
>                            |          |          io_submit
>                            |          |          0x46d98a
>                            |          |          
>                            |          |--0.99%-- __kmalloc_node_track_caller
>                            |          |          __kmalloc_reserve.isra.32
>                            |          |          __alloc_skb
>                            |          |          sk_stream_alloc_skb
>                            |          |          tcp_sendmsg
>                            |          |          inet_sendmsg
>                            |          |          sock_sendmsg
>                            |          |          SYSC_sendto
>                            |          |          sys_sendto
>                            |          |          entry_SYSCALL_64_fastpath
>                            |          |          __libc_send
>                            |          |          _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
>                            |          |          0x7f4ad6bf8de0
>                            |          |          
>                            |           --0.99%-- task_work_run
>                            |                     do_notify_resume
>                            |                     int_signal
>                            |                     __libc_close
>                            |                     std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access
>                            |          
>                             --0.98%-- __cond_resched_softirq
>                                       release_sock
>                                       tcp_sendmsg
>                                       inet_sendmsg
>                                       sock_sendmsg
>                                       SYSC_sendto
>                                       sys_sendto
>                                       entry_SYSCALL_64_fastpath
>                                       __libc_send
>                                       _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv
>                                       0x7f4ada1f8de0
> 
> 
> 
> #
> # (For a higher level overview, try: perf report --sort comm,dso)
> #

> [164814.835933] CPU: 22 PID: 48042 Comm: scylla Tainted: G            E   4.2.6-200.fc22.x86_64 #1
> [164814.835936] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
> [164814.835937]  0000000000000000 00000000a8713b7a ffff8802fb977ab8 ffffffff817729ea
> [164814.835941]  0000000000000000 ffff88076a69f780 ffff8802fb977ad8 ffffffffa03217a6
> [164814.835946]  ffff88077119bcb0 0000000000000000 ffff8802fb977b08 ffffffffa034e749
> [164814.835951] Call Trace:
> [164814.835954]  [<ffffffff817729ea>] dump_stack+0x45/0x57
> [164814.835971]  [<ffffffffa03217a6>] xfs_buf_stale+0x26/0x80 [xfs]
> [164814.835989]  [<ffffffffa034e749>] xfs_trans_binval+0x79/0x100 [xfs]
> [164814.836001]  [<ffffffffa02f479b>] xfs_bmap_btree_to_extents+0x12b/0x1a0 [xfs]
> [164814.836012]  [<ffffffffa02f8977>] xfs_bunmapi+0x967/0x9f0 [xfs]
> [164814.836027]  [<ffffffffa0334b9e>] xfs_itruncate_extents+0x10e/0x220 [xfs]
> [164814.836044]  [<ffffffffa033f75a>] ? kmem_zone_alloc+0x5a/0xe0 [xfs]
> [164814.836084]  [<ffffffffa0334d49>] xfs_inactive_truncate+0x99/0x110 [xfs]
> [164814.836120]  [<ffffffffa0335aa2>] xfs_inactive+0x102/0x120 [xfs]
> [164814.836135]  [<ffffffffa033a6cf>] xfs_fs_evict_inode+0x6f/0xa0 [xfs]
> [164814.836138]  [<ffffffff81238d76>] evict+0xa6/0x170
> [164814.836140]  [<ffffffff81239026>] iput+0x196/0x220
> [164814.836147]  [<ffffffff81234fe4>] __dentry_kill+0x174/0x1c0
> [164814.836150]  [<ffffffff8123514b>] dput+0x11b/0x200
> [164814.836155]  [<ffffffff8121fe02>] __fput+0x172/0x1e0
> [164814.836158]  [<ffffffff8121febe>] ____fput+0xe/0x10
> [164814.836161]  [<ffffffff810bab75>] task_work_run+0x85/0xb0
> [164814.836164]  [<ffffffff81014a4d>] do_notify_resume+0x8d/0x90
> [164814.836167]  [<ffffffff817795bc>] int_signal+0x12/0x17

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 14:10 ` Brian Foster
@ 2015-11-30 14:29   ` Avi Kivity
  2015-11-30 16:14     ` Brian Foster
  2015-11-30 15:49   ` Glauber Costa
  1 sibling, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-11-30 14:29 UTC (permalink / raw)
  To: Brian Foster, Glauber Costa; +Cc: xfs



On 11/30/2015 04:10 PM, Brian Foster wrote:
>> 2) xfs_buf_lock -> down
>> This is one I truly don't understand. What can be causing contention
>> in this lock? We never have two different cores writing to the same
>> buffer, nor should we have the same core doingCAP_FOWNER so.
>>
> This is not one single lock. An XFS buffer is the data structure used to
> modify/log/read-write metadata on-disk and each buffer has its own lock
> to prevent corruption. Buffer lock contention is possible because the
> filesystem has bits of "global" metadata that has to be updated via
> buffers.
>
> For example, usually one has multiple allocation groups to maximize
> parallelism, but we still have per-ag metadata that has to be tracked
> globally with respect to each AG (e.g., free space trees, inode
> allocation trees, etc.). Any operation that affects this metadata (e.g.,
> block/inode allocation) has to lock the agi/agf buffers along with any
> buffers associated with the modified btree leaf/node blocks, etc.
>
> One example in your attached perf traces has several threads looking to
> acquire the AGF, which is a per-AG data structure for tracking free
> space in the AG. One thread looks like the inode eviction case noted
> above (freeing blocks), another looks like a file truncate (also freeing
> blocks), and yet another is a block allocation due to a direct I/O
> write. Were any of these operations directed to an inode in a separate
> AG, they would be able to proceed in parallel (but I believe they would
> still hit the same codepaths as far as perf can tell).

I guess we can mitigate (but not eliminate) this by creating more 
allocation groups.  What is the default value for agsize?  Are there any 
downsides to decreasing it, besides consuming more memory?

Are those locks held around I/O, or just CPU operations, or a mix?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 14:10 ` Brian Foster
  2015-11-30 14:29   ` Avi Kivity
@ 2015-11-30 15:49   ` Glauber Costa
  2015-12-01 13:11     ` Brian Foster
  1 sibling, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2015-11-30 15:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, xfs

Hi Brian

>> 1) xfs_buf_lock -> xfs_log_force.
>>
>> I've started wondering what would make xfs_log_force sleep. But then I
>> have noticed that xfs_log_force will only be called when a buffer is
>> marked stale. Most of the times a buffer is marked stale seems to be
>> due to errors. Although that is not my case (more on that), it got me
>> thinking that maybe the right thing to do would be to avoid hitting
>> this case altogether?
>>
>
> I'm not following where you get the "only if marked stale" part..? It
> certainly looks like that's one potential purpose for the call, but this
> is called in a variety of other places as well. E.g., forcing the log
> via pushing on the ail when it has pinned items is another case. The ail
> push itself can originate from transaction reservation, etc., when log
> space is needed. In other words, I'm not sure this is something that's
> easily controlled from userspace, if at all. Rather, it's a significant
> part of the wider state machine the fs uses to manage logging.

I understand that in general xfs_log_force can be called from many
places. But in our traces the ones we see sleeping are coming from
xfs_buf_lock. The code for xfs_buf_lock reads:

    if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
        xfs_log_force(bp->b_target->bt_mount, 0);


which if I read correctly, will be called only for stale buffers. True
thing they happen to be pinned as well, but somehow the stale part
caught my attention. It seemed to me from briefly looking that the
stale condition was a more "avoidable" one. (keep in mind I am not an
awesome XFSer, may be missing something)

>
>> The file example-stale.txt contains a backtrace of the case where we
>> are being marked as stale. It seems to be happening when we convert
>> the the inode's extents from unwritten to real. Can this case be
>> avoided? I won't pretend I know the intricacies of this, but couldn't
>> we be keeping extents from the very beginning to avoid creating stale
>> buffers?
>>
>
> This is down in xfs_fs_evict_inode()->xfs_inactive(), which is generally
> when an inode is evicted from cache. In this case, it looks like the
> inode is unlinked (permanently removed), the extents are being removed
> and a bmap btree block is being invalidated as part of that overall
> process. I don't think this has anything to do with unwritten extents.
>

Cool. If the inode is indeed unliked, could that sill be triggering
that condition in xfs_buf_lock? I am not even close to fully
understanding how XFS manages and/or recycles buffers, but it seems to
me that if an inode is going away, there isn't really any reason to
contend for its buffers.

>> 2) xfs_buf_lock -> down
>> This is one I truly don't understand. What can be causing contention
>> in this lock? We never have two different cores writing to the same
>> buffer, nor should we have the same core doingCAP_FOWNER so.
>>
>
> This is not one single lock. An XFS buffer is the data structure used to
> modify/log/read-write metadata on-disk and each buffer has its own lock
> to prevent corruption. Buffer lock contention is possible because the
> filesystem has bits of "global" metadata that has to be updated via
> buffers.

I see. Since I hate guessing, is there any way you would recommend for
us to probe the system to determine if this contention scenario is
indeed the one we are seeing?

We usually open a file, write to it from a single core only,
sequentially, direct IO only, as well behavedly as we can, with all
the effort in the world to be good kids to the extent Santa will bring
us presents without us even asking.

So we were very puzzled to see contention. Contention for global
metadata updates is the best explanation we've had so far, and would
be great if we could verify it is indeed the case.

>
> For example, usually one has multiple allocation groups to maximize
> parallelism, but we still have per-ag metadata that has to be tracked
> globally with respect to each AG (e.g., free space trees, inode
> allocation trees, etc.). Any operation that affects this metadata (e.g.,
> block/inode allocation) has to lock the agi/agf buffers along with any
> buffers associated with the modified btree leaf/node blocks, etc.
>
> One example in your attached perf traces has several threads looking to
> acquire the AGF, which is a per-AG data structure for tracking free
> space in the AG. One thread looks like the inode eviction case noted
> above (freeing blocks), another looks like a file truncate (also freeing
> blocks), and yet another is a block allocation due to a direct I/O
> write. Were any of these operations directed to an inode in a separate
> AG, they would be able to proceed in parallel (but I believe they would
> still hit the same codepaths as far as perf can tell).

This is great, great, awesome info Brian. Thanks. We are so far
allocating inodes and truncating them when we need a new one, but
maybe there is some allocation pattern that is friendlier to the AG? I
understand that with such a data structure it may very well be
impossible to get rid of all waiting, but we will certainly do all we
can to mitigate it.

>
>> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time
>>
>> You guys seem to have an interface to avoid that, by setting the
>> FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl,
>> which will set this flag for all regular files. That's great, but that
>> ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run
>> our server as an unprivileged user. I don't understand, however, why
>> such an strict check is needed. If we have full rights on the
>> filesystem, why can't we issue this operation? In my view, CAP_FOWNER
>> should already be enough.I do understand the handles have to be stable
>> and a file can have its ownership changed, in which case the previous
>> owner would keep the handle valid. Is that the reason you went with
>> the most restrictive capability ?
>
> I'm not familiar enough with the open-by-handle stuff to comment on the
> permission constraints. Perhaps Dave or others can comment further on
> this bit...
>
> Brian

Thanks again Brian. The pointer to the AG stuff was really helpful.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 14:29   ` Avi Kivity
@ 2015-11-30 16:14     ` Brian Foster
  2015-12-01  9:08       ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-11-30 16:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> 
> 
> On 11/30/2015 04:10 PM, Brian Foster wrote:
> >>2) xfs_buf_lock -> down
> >>This is one I truly don't understand. What can be causing contention
> >>in this lock? We never have two different cores writing to the same
> >>buffer, nor should we have the same core doingCAP_FOWNER so.
> >>
> >This is not one single lock. An XFS buffer is the data structure used to
> >modify/log/read-write metadata on-disk and each buffer has its own lock
> >to prevent corruption. Buffer lock contention is possible because the
> >filesystem has bits of "global" metadata that has to be updated via
> >buffers.
> >
> >For example, usually one has multiple allocation groups to maximize
> >parallelism, but we still have per-ag metadata that has to be tracked
> >globally with respect to each AG (e.g., free space trees, inode
> >allocation trees, etc.). Any operation that affects this metadata (e.g.,
> >block/inode allocation) has to lock the agi/agf buffers along with any
> >buffers associated with the modified btree leaf/node blocks, etc.
> >
> >One example in your attached perf traces has several threads looking to
> >acquire the AGF, which is a per-AG data structure for tracking free
> >space in the AG. One thread looks like the inode eviction case noted
> >above (freeing blocks), another looks like a file truncate (also freeing
> >blocks), and yet another is a block allocation due to a direct I/O
> >write. Were any of these operations directed to an inode in a separate
> >AG, they would be able to proceed in parallel (but I believe they would
> >still hit the same codepaths as far as perf can tell).
> 
> I guess we can mitigate (but not eliminate) this by creating more allocation
> groups.  What is the default value for agsize?  Are there any downsides to
> decreasing it, besides consuming more memory?
> 

I suppose so, but I would be careful to check that you actually see
contention and test that increasing agcount actually helps. As
mentioned, I'm not sure off hand if the perf trace alone would look any
different if you have multiple metadata operations in progress on
separate AGs.

My understanding is that there are diminishing returns to high AG counts
and usually 32-64 is sufficient for most storage. Dave might be able to
elaborate more on that... (I think this would make a good FAQ entry,
actually).

The agsize/agcount mkfs-time heuristics change depending on the type of
storage. A single AG can be up to 1TB and if the fs is not considered
"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
default up to 4TB. If a stripe unit is set, the agsize/agcount is
adjusted depending on the size of the overall volume (see
xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).

> Are those locks held around I/O, or just CPU operations, or a mix?

I believe it's a mix of modifications and I/O, though it looks like some
of the I/O cases don't necessarily wait on the lock. E.g., the AIL
pushing case will trylock and defer to the next list iteration if the
buffer is busy.

Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-28  2:43 sleeps and waits during io_submit Glauber Costa
  2015-11-30 14:10 ` Brian Foster
@ 2015-11-30 23:10 ` Dave Chinner
  2015-11-30 23:51   ` Glauber Costa
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-11-30 23:10 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, xfs

On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote:
> Hello my dear XFSers,
> 
> For those of you who don't know, we at ScyllaDB produce a modern NoSQL
> data store that, at the moment, runs on top of XFS only. We deal
> exclusively with asynchronous and direct IO, due to our
> thread-per-core architecture. Due to that, we avoid issuing any
> operation that will sleep.
> 
> While debugging an extreme case of bad performance (most likely
> related to a not-so-great disk), I have found a variety of cases in
> which XFS blocks. To find those, I have used perf record -e
> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
> as xfs-sched_switch.log. Please note that this doesn't tell me for how
> long we block, but as mentioned before, blocking operations outside
> our control are detrimental to us regardless of the elapsed time.
> 
> For those who are not acquainted to our internals, please ignore
> everything in that file but the xfs functions. For the xfs symbols,
> there are two kinds of events: the ones that are a children of
> io_submit, where we don't tolerate blocking, and the ones that are
> children of our helper IO thread, to where we push big operations that
> we know will block until we can get rid of them all. We care about the
> former and ignore the latter.
> 
> Please allow me to ask you a couple of questions about those findings.
> If we are doing anything wrong, advise on best practices is truly
> welcome.
> 
> 1) xfs_buf_lock -> xfs_log_force.
> 
> I've started wondering what would make xfs_log_force sleep. But then I
> have noticed that xfs_log_force will only be called when a buffer is
> marked stale. Most of the times a buffer is marked stale seems to be
> due to errors. Although that is not my case (more on that), it got me
> thinking that maybe the right thing to do would be to avoid hitting
> this case altogether?

The buffer is stale because it has recently been freed, and we
cannot re-use it until the transaction that freed it has been
committed to the journal. e.g. this trace:

   --3.15%-- _xfs_log_force
             xfs_log_force
             xfs_buf_lock
             _xfs_buf_find
             xfs_buf_get_map
             xfs_trans_get_buf_map
             xfs_btree_get_bufl
             xfs_bmap_extents_to_btree
             xfs_bmap_add_extent_hole_real
             xfs_bmapi_write
             xfs_iomap_write_direct
             __xfs_get_blocks
             xfs_get_blocks_direct
             do_blockdev_direct_IO
             __blockdev_direct_IO
             xfs_vm_direct_IO
             xfs_file_dio_aio_write
             xfs_file_write_iter
             aio_run_iocb
             do_io_submit
             sys_io_submit
             entry_SYSCALL_64_fastpath
             io_submit
             0x46d98a

implies something like this has happened:

truncate
  free extent
    extent list now fits inline in inode
    btree-to-extent format change
      free last bmbt block X
        mark block contents stale
	add block to busy extent list
	place block on AGFL

AIO write
  allocate extent
    inline extents full
    extent-to-btree conversion
        allocate bmbt block
	  grab block X from free list
	  get locked buffer for block X
	    xfs_buf_lock
	      buffer stale
	        log force to commit previous free transaction to disk
		.....
		log force completes
		  buffer removed from busy extent list
		  buffer no longer stale
	 add bmbt record to new block
	 update btree indexes
	 ....


And, looking at the trace you attached, we see this:

dump_stack
xfs_buf_stale
xfs_trans_binval
xfs_bmap_btree_to_extents		<<<<<<<<<
xfs_bunmapi
xfs_itruncate_extents

So I'd say this is a pretty clear indication that we're immediately
recycling freed blocks from the free list here....

> The file example-stale.txt contains a backtrace of the case where we
> are being marked as stale. It seems to be happening when we convert
> the the inode's extents from unwritten to real. Can this case be
> avoided? I won't pretend I know the intricacies of this, but couldn't
> we be keeping extents from the very beginning to avoid creating stale
> buffers?
> 
> 2) xfs_buf_lock -> down
> This is one I truly don't understand. What can be causing contention
> in this lock? We never have two different cores writing to the same
> buffer, nor should we have the same core doingCAP_FOWNER so.

As Brian pointed out, this is probably AGF or AGI contention -
attempting to allocate/free extents or inodes in the same AG at the
same time will show this sort of pattern. This trace shows AGF
contention:

  down
  xfs_buf_lock
  _xfs_buf_find
  xfs_buf_get_map
  xfs_buf_read_map
  xfs_trans_read_buf_map
  xfs_read_agf
  .....

This trace shows AGI contention:

  down
  xfs_buf_lock
  _xfs_buf_find
  xfs_buf_get_map
  xfs_buf_read_map
  xfs_trans_read_buf_map
  xfs_read_agi
  ....

> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time

This trace?:

 rwsem_down_write_failed
 call_rwsem_down_write_failed
 xfs_ilock
 xfs_vn_update_time
 file_update_time
 xfs_file_aio_write_checks
 xfs_file_dio_aio_write
 xfs_file_write_iter
 aio_run_iocb
 do_io_submit
 sys_io_submit
 entry_SYSCALL_64_fastpath
 io_submit
 0x46d98a

Which is an mtime timestamp update racing with another operation
that takes the internal metadata lock (e.g. block mapping/allocation
for that inode).

> You guys seem to have an interface to avoid that, by setting the
> FMODE_NOCMTIME flag.

We've talked about exposing this through open() for Ceph.

http://www.kernelhub.org/?p=2&msg=744325
https://lkml.org/lkml/2015/5/15/671

Read the first thread for why it's problematic to expose this to
userspace - I won't repeat it all here.

As it is, there was some code recently hacked into ext4 to reduce
mtime overhead - the MS_LAZYTIME superblock option. What it does it
prevent the inode from being marked dirty when timestamps are
updated and hence the timestamps are never journalled or written
until something else marks the inode metadata dirty (e.g. block
allocation). ext4 gets away with this because it doesn't actually
journal timestamp changes - they get captured in the journal by
other modifications that are journalled, but still rely on the inod
being marked dirty for fsync, writeback and inode cache eviction
doing the right thing.

The ext4 implementation looks like the timestamp updates can be
thrown away, as the inodes are not marked dirty and so on memory
pressure they will simply be reclaimed without writing back the
updated timestamps that are held in memory. I suspect fsync will
also have problems on ext4 as the inode is not metadata dirty or
journalled, and hence the timestamp changes will never get written
back.

And, IMO, the worst part of the ext4 implementation is that the
inode buffer writeback code in ext4 now checks to see if any of the
other inodes in the buffer being written back need to have their
inode timestamps updated. IOWs, ext4 now does writeback of
/unjournalled metadata/ to inodes that are purely timestamp dirty.

We /used/ to do shit like this in XFS. We got rid of it in
preference of journalling everything because the corner cases in log
recovery meant that after a crash the inodes were in inconsistent
states, and that meant we had unexpected, unpredictable recovery
behaviour where files weren't the expected size and/or didn't
contain the expected data. Hence going back to the bad old days of
hacking around the journal "for speed" doesn't exactly fill me with
joy.

Let me have a think about how we can implement lazytime in a sane
way, such that fsync() works correctly, we don't throw away
timstamp changes in memory reclaim and we don't write unlogged
changes to the on-disk locations....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 23:10 ` Dave Chinner
@ 2015-11-30 23:51   ` Glauber Costa
  2015-12-01 20:30     ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2015-11-30 23:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Avi Kivity, xfs

Hi Dave


On Mon, Nov 30, 2015 at 6:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote:
>> Hello my dear XFSers,
>>
>> For those of you who don't know, we at ScyllaDB produce a modern NoSQL
>> data store that, at the moment, runs on top of XFS only. We deal
>> exclusively with asynchronous and direct IO, due to our
>> thread-per-core architecture. Due to that, we avoid issuing any
>> operation that will sleep.
>>
>> While debugging an extreme case of bad performance (most likely
>> related to a not-so-great disk), I have found a variety of cases in
>> which XFS blocks. To find those, I have used perf record -e
>> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
>> as xfs-sched_switch.log. Please note that this doesn't tell me for how
>> long we block, but as mentioned before, blocking operations outside
>> our control are detrimental to us regardless of the elapsed time.
>>
>> For those who are not acquainted to our internals, please ignore
>> everything in that file but the xfs functions. For the xfs symbols,
>> there are two kinds of events: the ones that are a children of
>> io_submit, where we don't tolerate blocking, and the ones that are
>> children of our helper IO thread, to where we push big operations that
>> we know will block until we can get rid of them all. We care about the
>> former and ignore the latter.
>>
>> Please allow me to ask you a couple of questions about those findings.
>> If we are doing anything wrong, advise on best practices is truly
>> welcome.
>>
>> 1) xfs_buf_lock -> xfs_log_force.
>>
>> I've started wondering what would make xfs_log_force sleep. But then I
>> have noticed that xfs_log_force will only be called when a buffer is
>> marked stale. Most of the times a buffer is marked stale seems to be
>> due to errors. Although that is not my case (more on that), it got me
>> thinking that maybe the right thing to do would be to avoid hitting
>> this case altogether?
>
> The buffer is stale because it has recently been freed, and we
> cannot re-use it until the transaction that freed it has been
> committed to the journal. e.g. this trace:
>
>    --3.15%-- _xfs_log_force
>              xfs_log_force
>              xfs_buf_lock
>              _xfs_buf_find
>              xfs_buf_get_map
>              xfs_trans_get_buf_map
>              xfs_btree_get_bufl
>              xfs_bmap_extents_to_btree
>              xfs_bmap_add_extent_hole_real
>              xfs_bmapi_write
>              xfs_iomap_write_direct
>              __xfs_get_blocks
>              xfs_get_blocks_direct
>              do_blockdev_direct_IO
>              __blockdev_direct_IO
>              xfs_vm_direct_IO
>              xfs_file_dio_aio_write
>              xfs_file_write_iter
>              aio_run_iocb
>              do_io_submit
>              sys_io_submit
>              entry_SYSCALL_64_fastpath
>              io_submit
>              0x46d98a
>
> implies something like this has happened:
>
> truncate
>   free extent
>     extent list now fits inline in inode
>     btree-to-extent format change
>       free last bmbt block X
>         mark block contents stale
>         add block to busy extent list
>         place block on AGFL
>
> AIO write
>   allocate extent
>     inline extents full
>     extent-to-btree conversion
>         allocate bmbt block
>           grab block X from free list
>           get locked buffer for block X
>             xfs_buf_lock
>               buffer stale
>                 log force to commit previous free transaction to disk
>                 .....
>                 log force completes
>                   buffer removed from busy extent list
>                   buffer no longer stale
>          add bmbt record to new block
>          update btree indexes
>          ....
>
>
> And, looking at the trace you attached, we see this:
>
> dump_stack
> xfs_buf_stale
> xfs_trans_binval
> xfs_bmap_btree_to_extents               <<<<<<<<<
> xfs_bunmapi
> xfs_itruncate_extents
>
> So I'd say this is a pretty clear indication that we're immediately
> recycling freed blocks from the free list here....
>
>> The file example-stale.txt contains a backtrace of the case where we
>> are being marked as stale. It seems to be happening when we convert
>> the the inode's extents from unwritten to real. Can this case be
>> avoided? I won't pretend I know the intricacies of this, but couldn't
>> we be keeping extents from the very beginning to avoid creating stale
>> buffers?
>>
>> 2) xfs_buf_lock -> down
>> This is one I truly don't understand. What can be causing contention
>> in this lock? We never have two different cores writing to the same
>> buffer, nor should we have the same core doingCAP_FOWNER so.
>
> As Brian pointed out, this is probably AGF or AGI contention -
> attempting to allocate/free extents or inodes in the same AG at the
> same time will show this sort of pattern. This trace shows AGF
> contention:
>
>   down
>   xfs_buf_lock
>   _xfs_buf_find
>   xfs_buf_get_map
>   xfs_buf_read_map
>   xfs_trans_read_buf_map
>   xfs_read_agf
>   .....
>
> This trace shows AGI contention:
>
>   down
>   xfs_buf_lock
>   _xfs_buf_find
>   xfs_buf_get_map
>   xfs_buf_read_map
>   xfs_trans_read_buf_map
>   xfs_read_agi
>   ....
>

Great. I will take a look at how can we mitigate those on our side. I
will need some time to understand all of that better, so for now I'd
just leave you guys with a big thank you.

>> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time
>
> This trace?:
>
>  rwsem_down_write_failed
>  call_rwsem_down_write_failed
>  xfs_ilock
>  xfs_vn_update_time
>  file_update_time
>  xfs_file_aio_write_checks
>  xfs_file_dio_aio_write
>  xfs_file_write_iter
>  aio_run_iocb
>  do_io_submit
>  sys_io_submit
>  entry_SYSCALL_64_fastpath
>  io_submit
>  0x46d98a
>
> Which is an mtime timestamp update racing with another operation
> that takes the internal metadata lock (e.g. block mapping/allocation
> for that inode).
>
>> You guys seem to have an interface to avoid that, by setting the
>> FMODE_NOCMTIME flag.
>
> We've talked about exposing this through open() for Ceph.
>
> http://www.kernelhub.org/?p=2&msg=744325
> https://lkml.org/lkml/2015/5/15/671
>
> Read the first thread for why it's problematic to expose this to
> userspace - I won't repeat it all here.
>
> As it is, there was some code recently hacked into ext4 to reduce
> mtime overhead - the MS_LAZYTIME superblock option. What it does it
> prevent the inode from being marked dirty when timestamps are
> updated and hence the timestamps are never journalled or written
> until something else marks the inode metadata dirty (e.g. block
> allocation). ext4 gets away with this because it doesn't actually
> journal timestamp changes - they get captured in the journal by
> other modifications that are journalled, but still rely on the inod
> being marked dirty for fsync, writeback and inode cache eviction
> doing the right thing.
>
> The ext4 implementation looks like the timestamp updates can be
> thrown away, as the inodes are not marked dirty and so on memory
> pressure they will simply be reclaimed without writing back the
> updated timestamps that are held in memory. I suspect fsync will
> also have problems on ext4 as the inode is not metadata dirty or
> journalled, and hence the timestamp changes will never get written
> back.
>
> And, IMO, the worst part of the ext4 implementation is that the
> inode buffer writeback code in ext4 now checks to see if any of the
> other inodes in the buffer being written back need to have their
> inode timestamps updated. IOWs, ext4 now does writeback of
> /unjournalled metadata/ to inodes that are purely timestamp dirty.
>
> We /used/ to do shit like this in XFS. We got rid of it in
> preference of journalling everything because the corner cases in log
> recovery meant that after a crash the inodes were in inconsistent
> states, and that meant we had unexpected, unpredictable recovery
> behaviour where files weren't the expected size and/or didn't
> contain the expected data. Hence going back to the bad old days of
> hacking around the journal "for speed" doesn't exactly fill me with
> joy.
>
> Let me have a think about how we can implement lazytime in a sane
> way, such that fsync() works correctly, we don't throw away
> timstamp changes in memory reclaim and we don't write unlogged
> changes to the on-disk locations....

I trust you fully for matters related to speed.

Keep in mind, though, that at least for us the fact that it blocks is
a lot worse than the fact that it is slow. We can work around slow,
but blocking basically means that we won't have any more work to push
- since we don't do threading. The processor that stales just sits
idle until the lock is released. So any non-blocking solution to this
would already be a win for us.



>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 16:14     ` Brian Foster
@ 2015-12-01  9:08       ` Avi Kivity
  2015-12-01 13:11         ` Brian Foster
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-01  9:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs

On 11/30/2015 06:14 PM, Brian Foster wrote:
> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>
>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>>> 2) xfs_buf_lock -> down
>>>> This is one I truly don't understand. What can be causing contention
>>>> in this lock? We never have two different cores writing to the same
>>>> buffer, nor should we have the same core doingCAP_FOWNER so.
>>>>
>>> This is not one single lock. An XFS buffer is the data structure used to
>>> modify/log/read-write metadata on-disk and each buffer has its own lock
>>> to prevent corruption. Buffer lock contention is possible because the
>>> filesystem has bits of "global" metadata that has to be updated via
>>> buffers.
>>>
>>> For example, usually one has multiple allocation groups to maximize
>>> parallelism, but we still have per-ag metadata that has to be tracked
>>> globally with respect to each AG (e.g., free space trees, inode
>>> allocation trees, etc.). Any operation that affects this metadata (e.g.,
>>> block/inode allocation) has to lock the agi/agf buffers along with any
>>> buffers associated with the modified btree leaf/node blocks, etc.
>>>
>>> One example in your attached perf traces has several threads looking to
>>> acquire the AGF, which is a per-AG data structure for tracking free
>>> space in the AG. One thread looks like the inode eviction case noted
>>> above (freeing blocks), another looks like a file truncate (also freeing
>>> blocks), and yet another is a block allocation due to a direct I/O
>>> write. Were any of these operations directed to an inode in a separate
>>> AG, they would be able to proceed in parallel (but I believe they would
>>> still hit the same codepaths as far as perf can tell).
>> I guess we can mitigate (but not eliminate) this by creating more allocation
>> groups.  What is the default value for agsize?  Are there any downsides to
>> decreasing it, besides consuming more memory?
>>
> I suppose so, but I would be careful to check that you actually see
> contention and test that increasing agcount actually helps. As
> mentioned, I'm not sure off hand if the perf trace alone would look any
> different if you have multiple metadata operations in progress on
> separate AGs.
>
> My understanding is that there are diminishing returns to high AG counts
> and usually 32-64 is sufficient for most storage. Dave might be able to
> elaborate more on that... (I think this would make a good FAQ entry,
> actually).
>
> The agsize/agcount mkfs-time heuristics change depending on the type of
> storage. A single AG can be up to 1TB and if the fs is not considered
> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
> default up to 4TB. If a stripe unit is set, the agsize/agcount is
> adjusted depending on the size of the overall volume (see
> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).

We'll experiment with this.  Surely it depends on more than the amount 
of storage?  If you have a high op rate you'll be more likely to excite 
contention, no?

>
>> Are those locks held around I/O, or just CPU operations, or a mix?
> I believe it's a mix of modifications and I/O, though it looks like some
> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
> pushing case will trylock and defer to the next list iteration if the
> buffer is busy.
>

Ok.  For us sleeping in io_submit() is death because we have no other 
thread on that core to take its place.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01  9:08       ` Avi Kivity
@ 2015-12-01 13:11         ` Brian Foster
  2015-12-01 13:58           ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-12-01 13:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> On 11/30/2015 06:14 PM, Brian Foster wrote:
> >On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>
> >>On 11/30/2015 04:10 PM, Brian Foster wrote:
...
> >The agsize/agcount mkfs-time heuristics change depending on the type of
> >storage. A single AG can be up to 1TB and if the fs is not considered
> >"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
> >default up to 4TB. If a stripe unit is set, the agsize/agcount is
> >adjusted depending on the size of the overall volume (see
> >xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
> 
> We'll experiment with this.  Surely it depends on more than the amount of
> storage?  If you have a high op rate you'll be more likely to excite
> contention, no?
> 

Sure. The absolute optimal configuration for your workload probably
depends on more than storage size, but mkfs doesn't have that
information. In general, it tries to use the most reasonable
configuration based on the storage and expected workload. If you want to
tweak it beyond that, indeed, the best bet is to experiment with what
works.

> >
> >>Are those locks held around I/O, or just CPU operations, or a mix?
> >I believe it's a mix of modifications and I/O, though it looks like some
> >of the I/O cases don't necessarily wait on the lock. E.g., the AIL
> >pushing case will trylock and defer to the next list iteration if the
> >buffer is busy.
> >
> 
> Ok.  For us sleeping in io_submit() is death because we have no other thread
> on that core to take its place.
> 

The above is with regard to metadata I/O, whereas io_submit() is
obviously for user I/O. io_submit() can probably block in a variety of
places afaict... it might have to read in the inode extent map, allocate
blocks, take inode/ag locks, reserve log space for transactions, etc.

It sounds to me that first and foremost you want to make sure you don't
have however many parallel operations you typically have running
contending on the same inodes or AGs. Hint: creating files under
separate subdirectories is a quick and easy way to allocate inodes under
separate AGs (the agno is encoded into the upper bits of the inode
number). Reducing the frequency of block allocation/frees might also be
another help (e.g., preallocate and reuse files, 'mount -o ikeep,'
etc.). Beyond that, you probably want to make sure the log is large
enough to support all concurrent operations. See the xfs_log_grant_*
tracepoints for a window into if/how long transaction reservations might
be waiting on the log.

Brian

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 15:49   ` Glauber Costa
@ 2015-12-01 13:11     ` Brian Foster
  2015-12-01 13:39       ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-12-01 13:11 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, xfs

On Mon, Nov 30, 2015 at 10:49:27AM -0500, Glauber Costa wrote:
> Hi Brian
> 
> >> 1) xfs_buf_lock -> xfs_log_force.
> >>
> >> I've started wondering what would make xfs_log_force sleep. But then I
> >> have noticed that xfs_log_force will only be called when a buffer is
> >> marked stale. Most of the times a buffer is marked stale seems to be
> >> due to errors. Although that is not my case (more on that), it got me
> >> thinking that maybe the right thing to do would be to avoid hitting
> >> this case altogether?
> >>
> >
> > I'm not following where you get the "only if marked stale" part..? It
> > certainly looks like that's one potential purpose for the call, but this
> > is called in a variety of other places as well. E.g., forcing the log
> > via pushing on the ail when it has pinned items is another case. The ail
> > push itself can originate from transaction reservation, etc., when log
> > space is needed. In other words, I'm not sure this is something that's
> > easily controlled from userspace, if at all. Rather, it's a significant
> > part of the wider state machine the fs uses to manage logging.
> 
> I understand that in general xfs_log_force can be called from many
> places. But in our traces the ones we see sleeping are coming from
> xfs_buf_lock. The code for xfs_buf_lock reads:
> 
>     if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
>         xfs_log_force(bp->b_target->bt_mount, 0);
> 
> 
> which if I read correctly, will be called only for stale buffers. True
> thing they happen to be pinned as well, but somehow the stale part
> caught my attention. It seemed to me from briefly looking that the
> stale condition was a more "avoidable" one. (keep in mind I am not an
> awesome XFSer, may be missing something)
> 

It's not really avoidable. It's an expected buffer state when metadata
blocks/buffers are freed as actions must be taken if they are reused.
Dave's breakdown describes how you might be hitting this based on your
traces.

> >
> >> The file example-stale.txt contains a backtrace of the case where we
> >> are being marked as stale. It seems to be happening when we convert
> >> the the inode's extents from unwritten to real. Can this case be
> >> avoided? I won't pretend I know the intricacies of this, but couldn't
> >> we be keeping extents from the very beginning to avoid creating stale
> >> buffers?
> >>
> >
> > This is down in xfs_fs_evict_inode()->xfs_inactive(), which is generally
> > when an inode is evicted from cache. In this case, it looks like the
> > inode is unlinked (permanently removed), the extents are being removed
> > and a bmap btree block is being invalidated as part of that overall
> > process. I don't think this has anything to do with unwritten extents.
> >
> 
> Cool. If the inode is indeed unliked, could that sill be triggering
> that condition in xfs_buf_lock? I am not even close to fully
> understanding how XFS manages and/or recycles buffers, but it seems to
> me that if an inode is going away, there isn't really any reason to
> contend for its buffers.
> 

I think so.. the inode removal will free various metadata blocks and
they could still be in the stale state by the time something else comes
along and allocates them (re: Dave's example covers this).

> >> 2) xfs_buf_lock -> down
> >> This is one I truly don't understand. What can be causing contention
> >> in this lock? We never have two different cores writing to the same
> >> buffer, nor should we have the same core doingCAP_FOWNER so.
> >>
> >
> > This is not one single lock. An XFS buffer is the data structure used to
> > modify/log/read-write metadata on-disk and each buffer has its own lock
> > to prevent corruption. Buffer lock contention is possible because the
> > filesystem has bits of "global" metadata that has to be updated via
> > buffers.
> 
> I see. Since I hate guessing, is there any way you would recommend for
> us to probe the system to determine if this contention scenario is
> indeed the one we are seeing?
> 

I'd probably use perf as you are, I'm just not sure if there's any real
way to tell which threads are contending on which AGs. I'm not terribly
experienced with perf. I suppose that if the AGF/AGI read/lock traces
are high up on the list, the chances are higher you're spending a lot of
time waiting on AGs. It's relatively easy to increase the AG count and
allocate inodes under separate AGs (see my previous mail) as an
experiment to see if such contention is reduced.

> We usually open a file, write to it from a single core only,
> sequentially, direct IO only, as well behavedly as we can, with all
> the effort in the world to be good kids to the extent Santa will bring
> us presents without us even asking.
> 
> So we were very puzzled to see contention. Contention for global
> metadata updates is the best explanation we've had so far, and would
> be great if we could verify it is indeed the case.
> 
> >
> > For example, usually one has multiple allocation groups to maximize
> > parallelism, but we still have per-ag metadata that has to be tracked
> > globally with respect to each AG (e.g., free space trees, inode
> > allocation trees, etc.). Any operation that affects this metadata (e.g.,
> > block/inode allocation) has to lock the agi/agf buffers along with any
> > buffers associated with the modified btree leaf/node blocks, etc.
> >
> > One example in your attached perf traces has several threads looking to
> > acquire the AGF, which is a per-AG data structure for tracking free
> > space in the AG. One thread looks like the inode eviction case noted
> > above (freeing blocks), another looks like a file truncate (also freeing
> > blocks), and yet another is a block allocation due to a direct I/O
> > write. Were any of these operations directed to an inode in a separate
> > AG, they would be able to proceed in parallel (but I believe they would
> > still hit the same codepaths as far as perf can tell).
> 
> This is great, great, awesome info Brian. Thanks. We are so far
> allocating inodes and truncating them when we need a new one, but
> maybe there is some allocation pattern that is friendlier to the AG? I
> understand that with such a data structure it may very well be
> impossible to get rid of all waiting, but we will certainly do all we
> can to mitigate it.
> 

The truncate will free blocks and require block allocation on subsequent
writes. That might be something you could look into trying to avoid
(e.g., keeping files around and reusing space), but that depends on your
application design. Inodes chunks are allocated and freed dynamically by
default as well. The 'ikeep' mount option keeps inode chunks around
indefinitely (even if individual inodes are all freed) if you wanted to
avoid inode chunk reallocation and know you have a fairly stable working
set of inodes. Per-inode extent size hints might be another option to
increase the size of allocations and perhaps reduce the number of them.

Brian

> >
> >> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time
> >>
> >> You guys seem to have an interface to avoid that, by setting the
> >> FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl,
> >> which will set this flag for all regular files. That's great, but that
> >> ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run
> >> our server as an unprivileged user. I don't understand, however, why
> >> such an strict check is needed. If we have full rights on the
> >> filesystem, why can't we issue this operation? In my view, CAP_FOWNER
> >> should already be enough.I do understand the handles have to be stable
> >> and a file can have its ownership changed, in which case the previous
> >> owner would keep the handle valid. Is that the reason you went with
> >> the most restrictive capability ?
> >
> > I'm not familiar enough with the open-by-handle stuff to comment on the
> > permission constraints. Perhaps Dave or others can comment further on
> > this bit...
> >
> > Brian
> 
> Thanks again Brian. The pointer to the AG stuff was really helpful.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 13:11     ` Brian Foster
@ 2015-12-01 13:39       ` Glauber Costa
  2015-12-01 14:02         ` Brian Foster
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2015-12-01 13:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, xfs

>
> The truncate will free blocks and require block allocation on subsequent
> writes. That might be something you could look into trying to avoid
> (e.g., keeping files around and reusing space), but that depends on your
> application design.


This one is a bit hard. We have a journal-like structure for the
modifications issued to the data store, which dominates most of our
write workloads (including this one that I am discussing here). We
could keep they around by renaming them outside of user visibility and
then renaming them back, but that would mean that we are now using
twice as much space. Perhaps we could use a pool that can at least
guarantee one or two allocations from a pre-existing file. I am
assuming here that renaming the file won't block. If it does, we are
better off not doing so.

> Inodes chunks are allocated and freed dynamically by
> default as well. The 'ikeep' mount option keeps inode chunks around
> indefinitely (even if individual inodes are all freed) if you wanted to
> avoid inode chunk reallocation and know you have a fairly stable working
> set of inodes.

I believe we do have a fairly stable inode working set, even though
that depends a bit on what's considered stable. For our journal-like
structure, we will keep them around until we are sure the information
is safe and them delete them - creating new ones as we receive more
data. But that's always bounded in size.

Am I correct to understand that ikeep being passed, new allocations
would just reuse space from the empty chunks on disk?


> Per-inode extent size hints might be another option to
> increase the size of allocations and perhaps reduce the number of them.
>

That's absolutely greatastic. Our files for that journal are all more
or less the same size. That's a great candidate for a hint.

> Brian

Thanks again, Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 13:11         ` Brian Foster
@ 2015-12-01 13:58           ` Avi Kivity
  2015-12-01 14:01             ` Glauber Costa
  2015-12-01 14:56             ` Brian Foster
  0 siblings, 2 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 13:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs



On 12/01/2015 03:11 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
> ...
>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>> adjusted depending on the size of the overall volume (see
>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>> We'll experiment with this.  Surely it depends on more than the amount of
>> storage?  If you have a high op rate you'll be more likely to excite
>> contention, no?
>>
> Sure. The absolute optimal configuration for your workload probably
> depends on more than storage size, but mkfs doesn't have that
> information. In general, it tries to use the most reasonable
> configuration based on the storage and expected workload. If you want to
> tweak it beyond that, indeed, the best bet is to experiment with what
> works.

We will do that.

>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>> I believe it's a mix of modifications and I/O, though it looks like some
>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>> pushing case will trylock and defer to the next list iteration if the
>>> buffer is busy.
>>>
>> Ok.  For us sleeping in io_submit() is death because we have no other thread
>> on that core to take its place.
>>
> The above is with regard to metadata I/O, whereas io_submit() is
> obviously for user I/O.

Won't io_submit() also trigger metadata I/O?  Or is that all deferred to 
async tasks?  I don't mind them blocking each other as long as they let 
my io_submit alone.

>   io_submit() can probably block in a variety of
> places afaict... it might have to read in the inode extent map, allocate
> blocks, take inode/ag locks, reserve log space for transactions, etc.

Any chance of changing all that to be asynchronous?  Doesn't sound too 
hard, if somebody else has to do it.

>
> It sounds to me that first and foremost you want to make sure you don't
> have however many parallel operations you typically have running
> contending on the same inodes or AGs. Hint: creating files under
> separate subdirectories is a quick and easy way to allocate inodes under
> separate AGs (the agno is encoded into the upper bits of the inode
> number).

Unfortunately our directory layout cannot be changed.  And doesn't this 
require having agcount == O(number of active files)?  That is easily in 
the thousands.

>   Reducing the frequency of block allocation/frees might also be
> another help (e.g., preallocate and reuse files,

Isn't that discouraged for SSDs?

We can do that for a subset of our files.

We do use XFS_IOC_FSSETXATTR though.

> 'mount -o ikeep,'

Interesting.  Our files are large so we could try this.

> etc.). Beyond that, you probably want to make sure the log is large
> enough to support all concurrent operations. See the xfs_log_grant_*
> tracepoints for a window into if/how long transaction reservations might
> be waiting on the log.

I see that on an 400G fs, the log is 180MB.  Seems plenty large for 
write operations that are mostly large sequential, though I've no real 
feel for the numbers.  Will keep an eye on this.

Thanks for all the info.

> Brian
>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 13:58           ` Avi Kivity
@ 2015-12-01 14:01             ` Glauber Costa
  2015-12-01 14:37               ` Avi Kivity
  2015-12-01 20:45               ` Dave Chinner
  2015-12-01 14:56             ` Brian Foster
  1 sibling, 2 replies; 58+ messages in thread
From: Glauber Costa @ 2015-12-01 14:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, xfs

On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote:
>
>
> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>
>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>
>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>
>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>
>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>
>> ...
>>>>
>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>> adjusted depending on the size of the overall volume (see
>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>
>>> We'll experiment with this.  Surely it depends on more than the amount of
>>> storage?  If you have a high op rate you'll be more likely to excite
>>> contention, no?
>>>
>> Sure. The absolute optimal configuration for your workload probably
>> depends on more than storage size, but mkfs doesn't have that
>> information. In general, it tries to use the most reasonable
>> configuration based on the storage and expected workload. If you want to
>> tweak it beyond that, indeed, the best bet is to experiment with what
>> works.
>
>
> We will do that.
>
>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>
>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>> pushing case will trylock and defer to the next list iteration if the
>>>> buffer is busy.
>>>>
>>> Ok.  For us sleeping in io_submit() is death because we have no other
>>> thread
>>> on that core to take its place.
>>>
>> The above is with regard to metadata I/O, whereas io_submit() is
>> obviously for user I/O.
>
>
> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
> async tasks?  I don't mind them blocking each other as long as they let my
> io_submit alone.
>
>>   io_submit() can probably block in a variety of
>> places afaict... it might have to read in the inode extent map, allocate
>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>
>
> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> if somebody else has to do it.
>
>>
>> It sounds to me that first and foremost you want to make sure you don't
>> have however many parallel operations you typically have running
>> contending on the same inodes or AGs. Hint: creating files under
>> separate subdirectories is a quick and easy way to allocate inodes under
>> separate AGs (the agno is encoded into the upper bits of the inode
>> number).
>
>
> Unfortunately our directory layout cannot be changed.  And doesn't this
> require having agcount == O(number of active files)?  That is easily in the
> thousands.

Actually, wouldn't agcount == O(nr_cpus) be good enough?

>
>>   Reducing the frequency of block allocation/frees might also be
>> another help (e.g., preallocate and reuse files,
>
>
> Isn't that discouraged for SSDs?
>
> We can do that for a subset of our files.
>
> We do use XFS_IOC_FSSETXATTR though.
>
>> 'mount -o ikeep,'
>
>
> Interesting.  Our files are large so we could try this.
>
>> etc.). Beyond that, you probably want to make sure the log is large
>> enough to support all concurrent operations. See the xfs_log_grant_*
>> tracepoints for a window into if/how long transaction reservations might
>> be waiting on the log.
>
>
> I see that on an 400G fs, the log is 180MB.  Seems plenty large for write
> operations that are mostly large sequential, though I've no real feel for
> the numbers.  Will keep an eye on this.
>
> Thanks for all the info.
>
>
>> Brian
>>
>>> _______________________________________________
>>> xfs mailing list
>>> xfs@oss.sgi.com
>>> http://oss.sgi.com/mailman/listinfo/xfs
>
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 13:39       ` Glauber Costa
@ 2015-12-01 14:02         ` Brian Foster
  0 siblings, 0 replies; 58+ messages in thread
From: Brian Foster @ 2015-12-01 14:02 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, xfs

On Tue, Dec 01, 2015 at 08:39:06AM -0500, Glauber Costa wrote:
> >
> > The truncate will free blocks and require block allocation on subsequent
> > writes. That might be something you could look into trying to avoid
> > (e.g., keeping files around and reusing space), but that depends on your
> > application design.
> 
> 
> This one is a bit hard. We have a journal-like structure for the
> modifications issued to the data store, which dominates most of our
> write workloads (including this one that I am discussing here). We
> could keep they around by renaming them outside of user visibility and
> then renaming them back, but that would mean that we are now using
> twice as much space. Perhaps we could use a pool that can at least
> guarantee one or two allocations from a pre-existing file. I am
> assuming here that renaming the file won't block. If it does, we are
> better off not doing so.
> 
> > Inodes chunks are allocated and freed dynamically by
> > default as well. The 'ikeep' mount option keeps inode chunks around
> > indefinitely (even if individual inodes are all freed) if you wanted to
> > avoid inode chunk reallocation and know you have a fairly stable working
> > set of inodes.
> 
> I believe we do have a fairly stable inode working set, even though
> that depends a bit on what's considered stable. For our journal-like
> structure, we will keep them around until we are sure the information
> is safe and them delete them - creating new ones as we receive more
> data. But that's always bounded in size.
> 
> Am I correct to understand that ikeep being passed, new allocations
> would just reuse space from the empty chunks on disk?
> 

Yes.. current behavior is that inodes are allocated and freed in chunks
of 64. When the entire chunk of inodes is freed from the namespace, the
chunk is freed (i.e., it is now free space). With ikeep, inode chunks
are never freed. When an individual inode allocation request is made,
the inode is allocated from one of the existing inode chunks before a
new chunk is allocated.

The tradeoff is that you could consume a significant amount of space
with inodes, free a bunch of them and that space is not freed. So that
is something to be aware of for your use case, particularly if the fs
has other uses from your journaling mechanism described above because it
affects the entire fs.

> 
> > Per-inode extent size hints might be another option to
> > increase the size of allocations and perhaps reduce the number of them.
> >
> 
> That's absolutely greatastic. Our files for that journal are all more
> or less the same size. That's a great candidate for a hint.
> 

You could consider preallocation (fallocate()) as well if you know the
full size in advance.

Brian

> > Brian
> 
> Thanks again, Brian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 14:01             ` Glauber Costa
@ 2015-12-01 14:37               ` Avi Kivity
  2015-12-01 20:45               ` Dave Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 14:37 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Brian Foster, xfs



On 12/01/2015 04:01 PM, Glauber Costa wrote:
> On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote:
>>
>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>> ...
>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>>> adjusted depending on the size of the overall volume (see
>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>> We'll experiment with this.  Surely it depends on more than the amount of
>>>> storage?  If you have a high op rate you'll be more likely to excite
>>>> contention, no?
>>>>
>>> Sure. The absolute optimal configuration for your workload probably
>>> depends on more than storage size, but mkfs doesn't have that
>>> information. In general, it tries to use the most reasonable
>>> configuration based on the storage and expected workload. If you want to
>>> tweak it beyond that, indeed, the best bet is to experiment with what
>>> works.
>>
>> We will do that.
>>
>>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>>> pushing case will trylock and defer to the next list iteration if the
>>>>> buffer is busy.
>>>>>
>>>> Ok.  For us sleeping in io_submit() is death because we have no other
>>>> thread
>>>> on that core to take its place.
>>>>
>>> The above is with regard to metadata I/O, whereas io_submit() is
>>> obviously for user I/O.
>>
>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>> async tasks?  I don't mind them blocking each other as long as they let my
>> io_submit alone.
>>
>>>    io_submit() can probably block in a variety of
>>> places afaict... it might have to read in the inode extent map, allocate
>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>
>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>> if somebody else has to do it.
>>
>>> It sounds to me that first and foremost you want to make sure you don't
>>> have however many parallel operations you typically have running
>>> contending on the same inodes or AGs. Hint: creating files under
>>> separate subdirectories is a quick and easy way to allocate inodes under
>>> separate AGs (the agno is encoded into the upper bits of the inode
>>> number).
>>
>> Unfortunately our directory layout cannot be changed.  And doesn't this
>> require having agcount == O(number of active files)?  That is easily in the
>> thousands.
> Actually, wouldn't agcount == O(nr_cpus) be good enough?

Depends on whether the locks are around I/O or cpu access only.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 13:58           ` Avi Kivity
  2015-12-01 14:01             ` Glauber Costa
@ 2015-12-01 14:56             ` Brian Foster
  2015-12-01 15:22               ` Avi Kivity
  1 sibling, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-12-01 14:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> 
> 
> On 12/01/2015 03:11 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
> >...
> >>>The agsize/agcount mkfs-time heuristics change depending on the type of
> >>>storage. A single AG can be up to 1TB and if the fs is not considered
> >>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
> >>>default up to 4TB. If a stripe unit is set, the agsize/agcount is
> >>>adjusted depending on the size of the overall volume (see
> >>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
> >>We'll experiment with this.  Surely it depends on more than the amount of
> >>storage?  If you have a high op rate you'll be more likely to excite
> >>contention, no?
> >>
> >Sure. The absolute optimal configuration for your workload probably
> >depends on more than storage size, but mkfs doesn't have that
> >information. In general, it tries to use the most reasonable
> >configuration based on the storage and expected workload. If you want to
> >tweak it beyond that, indeed, the best bet is to experiment with what
> >works.
> 
> We will do that.
> 
> >>>>Are those locks held around I/O, or just CPU operations, or a mix?
> >>>I believe it's a mix of modifications and I/O, though it looks like some
> >>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL
> >>>pushing case will trylock and defer to the next list iteration if the
> >>>buffer is busy.
> >>>
> >>Ok.  For us sleeping in io_submit() is death because we have no other thread
> >>on that core to take its place.
> >>
> >The above is with regard to metadata I/O, whereas io_submit() is
> >obviously for user I/O.
> 
> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
> async tasks?  I don't mind them blocking each other as long as they let my
> io_submit alone.
> 

Yeah, it can trigger metadata reads, force the log (the stale buffer
example) or push the AIL (wait on log space). Metadata changes made
directly via your I/O request are logged/committed via transactions,
which are generally processed asynchronously from that point on.

> >  io_submit() can probably block in a variety of
> >places afaict... it might have to read in the inode extent map, allocate
> >blocks, take inode/ag locks, reserve log space for transactions, etc.
> 
> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> if somebody else has to do it.
> 

I'm not following... if the fs needs to read in the inode extent map to
prepare for an allocation, what else can the thread do but wait? Are you
suggesting the request kick off whatever the blocking action happens to
be asynchronously and return with an error such that the request can be
retried later?

> >
> >It sounds to me that first and foremost you want to make sure you don't
> >have however many parallel operations you typically have running
> >contending on the same inodes or AGs. Hint: creating files under
> >separate subdirectories is a quick and easy way to allocate inodes under
> >separate AGs (the agno is encoded into the upper bits of the inode
> >number).
> 
> Unfortunately our directory layout cannot be changed.  And doesn't this
> require having agcount == O(number of active files)?  That is easily in the
> thousands.
> 

I think Glauber's O(nr_cpus) comment is probably the more likely
ballpark, but really it's something you'll probably just need to test to
see how far you need to go to avoid AG contention.

I'm primarily throwing the subdir thing out there for testing purposes.
It's just an easy way to create inodes in a bunch of separate AGs so you
can determine whether/how much it really helps with modified AG counts.
I don't know enough about your application design to really comment on
that...

> >  Reducing the frequency of block allocation/frees might also be
> >another help (e.g., preallocate and reuse files,
> 
> Isn't that discouraged for SSDs?
> 

Perhaps, if you're referring to the fact that the blocks are never freed
and thus never discarded..? Are you running fstrim?

If so, it would certainly impact that by holding blocks as allocated to
inodes as opposed to putting them in free space trees where they can be
discarded. If not, I don't see how it would make a difference, but
perhaps I misunderstand the point. That said, there's probably others on
the list who can more definitively discuss SSD characteristics than I...

> We can do that for a subset of our files.
> 
> We do use XFS_IOC_FSSETXATTR though.
> 
> >'mount -o ikeep,'
> 
> Interesting.  Our files are large so we could try this.
> 

Just to be clear... this behavior change is more directly associated
with file count than file size (though indirectly larger files might
mean you have less of them, if that's your point).

To generalize a bit, I'd be more weary of using this option if your
filesystem can be used in an unstructured manner in any way. For
example, if the file count can balloon up and back down temporarily,
that's going to allocate a bunch of metadata space for inodes that won't
ever be reclaimed or reused for anything other than inodes.

> >etc.). Beyond that, you probably want to make sure the log is large
> >enough to support all concurrent operations. See the xfs_log_grant_*
> >tracepoints for a window into if/how long transaction reservations might
> >be waiting on the log.
> 
> I see that on an 400G fs, the log is 180MB.  Seems plenty large for write
> operations that are mostly large sequential, though I've no real feel for
> the numbers.  Will keep an eye on this.
> 

FWIW, XFS on recent kernels has grown some sysfs entries that might help
give an idea of log reservation state at runtime. See the entries under
/sys/fs/xfs/<dev>/log for details.

Brian

> Thanks for all the info.
> 
> >Brian
> >
> >>_______________________________________________
> >>xfs mailing list
> >>xfs@oss.sgi.com
> >>http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 14:56             ` Brian Foster
@ 2015-12-01 15:22               ` Avi Kivity
  2015-12-01 16:01                 ` Brian Foster
  2015-12-01 21:04                 ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 15:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs



On 12/01/2015 04:56 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>
>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>> ...
>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>>> adjusted depending on the size of the overall volume (see
>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>> We'll experiment with this.  Surely it depends on more than the amount of
>>>> storage?  If you have a high op rate you'll be more likely to excite
>>>> contention, no?
>>>>
>>> Sure. The absolute optimal configuration for your workload probably
>>> depends on more than storage size, but mkfs doesn't have that
>>> information. In general, it tries to use the most reasonable
>>> configuration based on the storage and expected workload. If you want to
>>> tweak it beyond that, indeed, the best bet is to experiment with what
>>> works.
>> We will do that.
>>
>>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>>> pushing case will trylock and defer to the next list iteration if the
>>>>> buffer is busy.
>>>>>
>>>> Ok.  For us sleeping in io_submit() is death because we have no other thread
>>>> on that core to take its place.
>>>>
>>> The above is with regard to metadata I/O, whereas io_submit() is
>>> obviously for user I/O.
>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>> async tasks?  I don't mind them blocking each other as long as they let my
>> io_submit alone.
>>
> Yeah, it can trigger metadata reads, force the log (the stale buffer
> example) or push the AIL (wait on log space). Metadata changes made
> directly via your I/O request are logged/committed via transactions,
> which are generally processed asynchronously from that point on.
>
>>>   io_submit() can probably block in a variety of
>>> places afaict... it might have to read in the inode extent map, allocate
>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>> if somebody else has to do it.
>>
> I'm not following... if the fs needs to read in the inode extent map to
> prepare for an allocation, what else can the thread do but wait? Are you
> suggesting the request kick off whatever the blocking action happens to
> be asynchronously and return with an error such that the request can be
> retried later?

Not quite, it should be invisible to the caller.

That is, the code called by io_submit() (file_operations::write_iter, it 
seems to be called today) can kick off this operation and have it 
continue from where it left off.

Seastar (the async user framework which we use to drive xfs) makes 
writing code like this easy, using continuations; but of course from 
ordinary threaded code it can be quite hard.

btw, there was an attempt to make ext[34] async using this method, but I 
think it was ripped out.  Yes, the mortal remains can still be seen with 
'git grep EIOCBQUEUED'.

>
>>> It sounds to me that first and foremost you want to make sure you don't
>>> have however many parallel operations you typically have running
>>> contending on the same inodes or AGs. Hint: creating files under
>>> separate subdirectories is a quick and easy way to allocate inodes under
>>> separate AGs (the agno is encoded into the upper bits of the inode
>>> number).
>> Unfortunately our directory layout cannot be changed.  And doesn't this
>> require having agcount == O(number of active files)?  That is easily in the
>> thousands.
>>
> I think Glauber's O(nr_cpus) comment is probably the more likely
> ballpark, but really it's something you'll probably just need to test to
> see how far you need to go to avoid AG contention.
>
> I'm primarily throwing the subdir thing out there for testing purposes.
> It's just an easy way to create inodes in a bunch of separate AGs so you
> can determine whether/how much it really helps with modified AG counts.
> I don't know enough about your application design to really comment on
> that...

We have O(cpus) shards that operate independently.  Each shard writes 
32MB commitlog files (that are pre-truncated to 32MB to allow concurrent 
writes without blocking); the files are then flushed and closed, and 
later removed.  In parallel there are sequential writes and reads of 
large files using 128kB buffers), as well as random reads.  Files are 
immutable (append-only), and if a file is being written, it is not 
concurrently read.  In general files are not shared across shards.  All 
I/O is async and O_DIRECT.  open(), truncate(), fdatasync(), and friends 
are called from a helper thread.

As far as I can tell it should a very friendly load for XFS and SSDs.

>
>>>   Reducing the frequency of block allocation/frees might also be
>>> another help (e.g., preallocate and reuse files,
>> Isn't that discouraged for SSDs?
>>
> Perhaps, if you're referring to the fact that the blocks are never freed
> and thus never discarded..? Are you running fstrim?

mount -o discard.  And yes, overwrites are supposedly more expensive 
than trim old data + allocate new data, but maybe if you compare it with 
the work XFS has to do, perhaps the tradeoff is bad.


>
> If so, it would certainly impact that by holding blocks as allocated to
> inodes as opposed to putting them in free space trees where they can be
> discarded. If not, I don't see how it would make a difference, but
> perhaps I misunderstand the point. That said, there's probably others on
> the list who can more definitively discuss SSD characteristics than I...



>
>> We can do that for a subset of our files.
>>
>> We do use XFS_IOC_FSSETXATTR though.
>>
>>> 'mount -o ikeep,'
>> Interesting.  Our files are large so we could try this.
>>
> Just to be clear... this behavior change is more directly associated
> with file count than file size (though indirectly larger files might
> mean you have less of them, if that's your point).

Yes, that's what I meant, and especially that if a lot of files are 
removed we'd be losing the inode space allocated to them.

>
> To generalize a bit, I'd be more weary of using this option if your
> filesystem can be used in an unstructured manner in any way. For
> example, if the file count can balloon up and back down temporarily,
> that's going to allocate a bunch of metadata space for inodes that won't
> ever be reclaimed or reused for anything other than inodes.

Exactly.  File count can balloon, but files will be large, so even the 
worst case waste is very limited.

>
>>> etc.). Beyond that, you probably want to make sure the log is large
>>> enough to support all concurrent operations. See the xfs_log_grant_*
>>> tracepoints for a window into if/how long transaction reservations might
>>> be waiting on the log.
>> I see that on an 400G fs, the log is 180MB.  Seems plenty large for write
>> operations that are mostly large sequential, though I've no real feel for
>> the numbers.  Will keep an eye on this.
>>
> FWIW, XFS on recent kernels has grown some sysfs entries that might help
> give an idea of log reservation state at runtime. See the entries under
> /sys/fs/xfs/<dev>/log for details.

Great.  We will study those with great interest.

>
> Brian
>
>> Thanks for all the info.
>>
>>> Brian
>>>
>>>> _______________________________________________
>>>> xfs mailing list
>>>> xfs@oss.sgi.com
>>>> http://oss.sgi.com/mailman/listinfo/xfs


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 15:22               ` Avi Kivity
@ 2015-12-01 16:01                 ` Brian Foster
  2015-12-01 16:08                   ` Avi Kivity
  2015-12-01 21:04                 ` Dave Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-12-01 16:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> 
> 
> On 12/01/2015 04:56 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>
> >>On 12/01/2015 03:11 PM, Brian Foster wrote:
> >>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>>>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
> >>>...
> >>>>>The agsize/agcount mkfs-time heuristics change depending on the type of
> >>>>>storage. A single AG can be up to 1TB and if the fs is not considered
> >>>>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
> >>>>>default up to 4TB. If a stripe unit is set, the agsize/agcount is
> >>>>>adjusted depending on the size of the overall volume (see
> >>>>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
> >>>>We'll experiment with this.  Surely it depends on more than the amount of
> >>>>storage?  If you have a high op rate you'll be more likely to excite
> >>>>contention, no?
> >>>>
> >>>Sure. The absolute optimal configuration for your workload probably
> >>>depends on more than storage size, but mkfs doesn't have that
> >>>information. In general, it tries to use the most reasonable
> >>>configuration based on the storage and expected workload. If you want to
> >>>tweak it beyond that, indeed, the best bet is to experiment with what
> >>>works.
> >>We will do that.
> >>
> >>>>>>Are those locks held around I/O, or just CPU operations, or a mix?
> >>>>>I believe it's a mix of modifications and I/O, though it looks like some
> >>>>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL
> >>>>>pushing case will trylock and defer to the next list iteration if the
> >>>>>buffer is busy.
> >>>>>
> >>>>Ok.  For us sleeping in io_submit() is death because we have no other thread
> >>>>on that core to take its place.
> >>>>
> >>>The above is with regard to metadata I/O, whereas io_submit() is
> >>>obviously for user I/O.
> >>Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
> >>async tasks?  I don't mind them blocking each other as long as they let my
> >>io_submit alone.
> >>
> >Yeah, it can trigger metadata reads, force the log (the stale buffer
> >example) or push the AIL (wait on log space). Metadata changes made
> >directly via your I/O request are logged/committed via transactions,
> >which are generally processed asynchronously from that point on.
> >
> >>>  io_submit() can probably block in a variety of
> >>>places afaict... it might have to read in the inode extent map, allocate
> >>>blocks, take inode/ag locks, reserve log space for transactions, etc.
> >>Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> >>if somebody else has to do it.
> >>
> >I'm not following... if the fs needs to read in the inode extent map to
> >prepare for an allocation, what else can the thread do but wait? Are you
> >suggesting the request kick off whatever the blocking action happens to
> >be asynchronously and return with an error such that the request can be
> >retried later?
> 
> Not quite, it should be invisible to the caller.
> 
> That is, the code called by io_submit() (file_operations::write_iter, it
> seems to be called today) can kick off this operation and have it continue
> from where it left off.
> 

Isn't that generally what happens today? We submit an I/O which is
asynchronous in nature and wait on a completion, which causes the cpu to
schedule and execute another task until the completion is set by I/O
completion (via an async callback). At that point, the issuing thread
continues where it left off. I suspect I'm missing something... can you
elaborate on what you'd do differently here (and how it helps)?

> Seastar (the async user framework which we use to drive xfs) makes writing
> code like this easy, using continuations; but of course from ordinary
> threaded code it can be quite hard.
> 
> btw, there was an attempt to make ext[34] async using this method, but I
> think it was ripped out.  Yes, the mortal remains can still be seen with
> 'git grep EIOCBQUEUED'.
> 
> >
> >>>It sounds to me that first and foremost you want to make sure you don't
> >>>have however many parallel operations you typically have running
> >>>contending on the same inodes or AGs. Hint: creating files under
> >>>separate subdirectories is a quick and easy way to allocate inodes under
> >>>separate AGs (the agno is encoded into the upper bits of the inode
> >>>number).
> >>Unfortunately our directory layout cannot be changed.  And doesn't this
> >>require having agcount == O(number of active files)?  That is easily in the
> >>thousands.
> >>
> >I think Glauber's O(nr_cpus) comment is probably the more likely
> >ballpark, but really it's something you'll probably just need to test to
> >see how far you need to go to avoid AG contention.
> >
> >I'm primarily throwing the subdir thing out there for testing purposes.
> >It's just an easy way to create inodes in a bunch of separate AGs so you
> >can determine whether/how much it really helps with modified AG counts.
> >I don't know enough about your application design to really comment on
> >that...
> 
> We have O(cpus) shards that operate independently.  Each shard writes 32MB
> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
> without blocking); the files are then flushed and closed, and later removed.
> In parallel there are sequential writes and reads of large files using 128kB
> buffers), as well as random reads.  Files are immutable (append-only), and
> if a file is being written, it is not concurrently read.  In general files
> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
> truncate(), fdatasync(), and friends are called from a helper thread.
> 
> As far as I can tell it should a very friendly load for XFS and SSDs.
> 
> >
> >>>  Reducing the frequency of block allocation/frees might also be
> >>>another help (e.g., preallocate and reuse files,
> >>Isn't that discouraged for SSDs?
> >>
> >Perhaps, if you're referring to the fact that the blocks are never freed
> >and thus never discarded..? Are you running fstrim?
> 
> mount -o discard.  And yes, overwrites are supposedly more expensive than
> trim old data + allocate new data, but maybe if you compare it with the work
> XFS has to do, perhaps the tradeoff is bad.
> 

Ok, my understanding is that '-o discard' is not recommended in favor of
periodic fstrim for performance reasons, but that may or may not still
be the case.

Brian

> 
> >
> >If so, it would certainly impact that by holding blocks as allocated to
> >inodes as opposed to putting them in free space trees where they can be
> >discarded. If not, I don't see how it would make a difference, but
> >perhaps I misunderstand the point. That said, there's probably others on
> >the list who can more definitively discuss SSD characteristics than I...
> 
> 
> 
> >
> >>We can do that for a subset of our files.
> >>
> >>We do use XFS_IOC_FSSETXATTR though.
> >>
> >>>'mount -o ikeep,'
> >>Interesting.  Our files are large so we could try this.
> >>
> >Just to be clear... this behavior change is more directly associated
> >with file count than file size (though indirectly larger files might
> >mean you have less of them, if that's your point).
> 
> Yes, that's what I meant, and especially that if a lot of files are removed
> we'd be losing the inode space allocated to them.
> 
> >
> >To generalize a bit, I'd be more weary of using this option if your
> >filesystem can be used in an unstructured manner in any way. For
> >example, if the file count can balloon up and back down temporarily,
> >that's going to allocate a bunch of metadata space for inodes that won't
> >ever be reclaimed or reused for anything other than inodes.
> 
> Exactly.  File count can balloon, but files will be large, so even the worst
> case waste is very limited.
> 
> >
> >>>etc.). Beyond that, you probably want to make sure the log is large
> >>>enough to support all concurrent operations. See the xfs_log_grant_*
> >>>tracepoints for a window into if/how long transaction reservations might
> >>>be waiting on the log.
> >>I see that on an 400G fs, the log is 180MB.  Seems plenty large for write
> >>operations that are mostly large sequential, though I've no real feel for
> >>the numbers.  Will keep an eye on this.
> >>
> >FWIW, XFS on recent kernels has grown some sysfs entries that might help
> >give an idea of log reservation state at runtime. See the entries under
> >/sys/fs/xfs/<dev>/log for details.
> 
> Great.  We will study those with great interest.
> 
> >
> >Brian
> >
> >>Thanks for all the info.
> >>
> >>>Brian
> >>>
> >>>>_______________________________________________
> >>>>xfs mailing list
> >>>>xfs@oss.sgi.com
> >>>>http://oss.sgi.com/mailman/listinfo/xfs
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 16:01                 ` Brian Foster
@ 2015-12-01 16:08                   ` Avi Kivity
  2015-12-01 16:29                     ` Brian Foster
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 16:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs



On 12/01/2015 06:01 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>
>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>>>> ...
>>>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>>>>> adjusted depending on the size of the overall volume (see
>>>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>>>> We'll experiment with this.  Surely it depends on more than the amount of
>>>>>> storage?  If you have a high op rate you'll be more likely to excite
>>>>>> contention, no?
>>>>>>
>>>>> Sure. The absolute optimal configuration for your workload probably
>>>>> depends on more than storage size, but mkfs doesn't have that
>>>>> information. In general, it tries to use the most reasonable
>>>>> configuration based on the storage and expected workload. If you want to
>>>>> tweak it beyond that, indeed, the best bet is to experiment with what
>>>>> works.
>>>> We will do that.
>>>>
>>>>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>>>>> pushing case will trylock and defer to the next list iteration if the
>>>>>>> buffer is busy.
>>>>>>>
>>>>>> Ok.  For us sleeping in io_submit() is death because we have no other thread
>>>>>> on that core to take its place.
>>>>>>
>>>>> The above is with regard to metadata I/O, whereas io_submit() is
>>>>> obviously for user I/O.
>>>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>>>> async tasks?  I don't mind them blocking each other as long as they let my
>>>> io_submit alone.
>>>>
>>> Yeah, it can trigger metadata reads, force the log (the stale buffer
>>> example) or push the AIL (wait on log space). Metadata changes made
>>> directly via your I/O request are logged/committed via transactions,
>>> which are generally processed asynchronously from that point on.
>>>
>>>>>   io_submit() can probably block in a variety of
>>>>> places afaict... it might have to read in the inode extent map, allocate
>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>>>> if somebody else has to do it.
>>>>
>>> I'm not following... if the fs needs to read in the inode extent map to
>>> prepare for an allocation, what else can the thread do but wait? Are you
>>> suggesting the request kick off whatever the blocking action happens to
>>> be asynchronously and return with an error such that the request can be
>>> retried later?
>> Not quite, it should be invisible to the caller.
>>
>> That is, the code called by io_submit() (file_operations::write_iter, it
>> seems to be called today) can kick off this operation and have it continue
>> from where it left off.
>>
> Isn't that generally what happens today?

You tell me.  According to $subject, apparently not enough.  Maybe we're 
triggering it more often, or we suffer more when it does trigger (the 
latter probably more likely).

>   We submit an I/O which is
> asynchronous in nature and wait on a completion, which causes the cpu to
> schedule and execute another task until the completion is set by I/O
> completion (via an async callback). At that point, the issuing thread
> continues where it left off. I suspect I'm missing something... can you
> elaborate on what you'd do differently here (and how it helps)?

Just apply the same technique everywhere: convert locks to trylock + 
schedule a continuation on failure.

>
>> Seastar (the async user framework which we use to drive xfs) makes writing
>> code like this easy, using continuations; but of course from ordinary
>> threaded code it can be quite hard.
>>
>> btw, there was an attempt to make ext[34] async using this method, but I
>> think it was ripped out.  Yes, the mortal remains can still be seen with
>> 'git grep EIOCBQUEUED'.
>>
>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>> have however many parallel operations you typically have running
>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>> number).
>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>> require having agcount == O(number of active files)?  That is easily in the
>>>> thousands.
>>>>
>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>> ballpark, but really it's something you'll probably just need to test to
>>> see how far you need to go to avoid AG contention.
>>>
>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>> can determine whether/how much it really helps with modified AG counts.
>>> I don't know enough about your application design to really comment on
>>> that...
>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>> without blocking); the files are then flushed and closed, and later removed.
>> In parallel there are sequential writes and reads of large files using 128kB
>> buffers), as well as random reads.  Files are immutable (append-only), and
>> if a file is being written, it is not concurrently read.  In general files
>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>> truncate(), fdatasync(), and friends are called from a helper thread.
>>
>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>
>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>> another help (e.g., preallocate and reuse files,
>>>> Isn't that discouraged for SSDs?
>>>>
>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>> and thus never discarded..? Are you running fstrim?
>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>> trim old data + allocate new data, but maybe if you compare it with the work
>> XFS has to do, perhaps the tradeoff is bad.
>>
> Ok, my understanding is that '-o discard' is not recommended in favor of
> periodic fstrim for performance reasons, but that may or may not still
> be the case.

I understand that most SSDs have queued trim these days, but maybe I'm 
optimistic.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 16:08                   ` Avi Kivity
@ 2015-12-01 16:29                     ` Brian Foster
  2015-12-01 17:09                       ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-12-01 16:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
> 
> 
> On 12/01/2015 06:01 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> >>
> >>On 12/01/2015 04:56 PM, Brian Foster wrote:
> >>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 03:11 PM, Brian Foster wrote:
> >>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
> >>>>>...
> >>>>>>>The agsize/agcount mkfs-time heuristics change depending on the type of
> >>>>>>>storage. A single AG can be up to 1TB and if the fs is not considered
> >>>>>>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
> >>>>>>>default up to 4TB. If a stripe unit is set, the agsize/agcount is
> >>>>>>>adjusted depending on the size of the overall volume (see
> >>>>>>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
> >>>>>>We'll experiment with this.  Surely it depends on more than the amount of
> >>>>>>storage?  If you have a high op rate you'll be more likely to excite
> >>>>>>contention, no?
> >>>>>>
> >>>>>Sure. The absolute optimal configuration for your workload probably
> >>>>>depends on more than storage size, but mkfs doesn't have that
> >>>>>information. In general, it tries to use the most reasonable
> >>>>>configuration based on the storage and expected workload. If you want to
> >>>>>tweak it beyond that, indeed, the best bet is to experiment with what
> >>>>>works.
> >>>>We will do that.
> >>>>
> >>>>>>>>Are those locks held around I/O, or just CPU operations, or a mix?
> >>>>>>>I believe it's a mix of modifications and I/O, though it looks like some
> >>>>>>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL
> >>>>>>>pushing case will trylock and defer to the next list iteration if the
> >>>>>>>buffer is busy.
> >>>>>>>
> >>>>>>Ok.  For us sleeping in io_submit() is death because we have no other thread
> >>>>>>on that core to take its place.
> >>>>>>
> >>>>>The above is with regard to metadata I/O, whereas io_submit() is
> >>>>>obviously for user I/O.
> >>>>Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
> >>>>async tasks?  I don't mind them blocking each other as long as they let my
> >>>>io_submit alone.
> >>>>
> >>>Yeah, it can trigger metadata reads, force the log (the stale buffer
> >>>example) or push the AIL (wait on log space). Metadata changes made
> >>>directly via your I/O request are logged/committed via transactions,
> >>>which are generally processed asynchronously from that point on.
> >>>
> >>>>>  io_submit() can probably block in a variety of
> >>>>>places afaict... it might have to read in the inode extent map, allocate
> >>>>>blocks, take inode/ag locks, reserve log space for transactions, etc.
> >>>>Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> >>>>if somebody else has to do it.
> >>>>
> >>>I'm not following... if the fs needs to read in the inode extent map to
> >>>prepare for an allocation, what else can the thread do but wait? Are you
> >>>suggesting the request kick off whatever the blocking action happens to
> >>>be asynchronously and return with an error such that the request can be
> >>>retried later?
> >>Not quite, it should be invisible to the caller.
> >>
> >>That is, the code called by io_submit() (file_operations::write_iter, it
> >>seems to be called today) can kick off this operation and have it continue
> >>from where it left off.
> >>
> >Isn't that generally what happens today?
> 
> You tell me.  According to $subject, apparently not enough.  Maybe we're
> triggering it more often, or we suffer more when it does trigger (the latter
> probably more likely).
> 

The original mail describes looking at the sched:sched_switch tracepoint
which on a quick look, appears to fire whenever a cpu context switch
occurs. This likely triggers any time we wait on an I/O or a contended
lock (among other situations I'm sure), and it signifies that something
else is going to execute in our place until this thread can make
progress.

> >  We submit an I/O which is
> >asynchronous in nature and wait on a completion, which causes the cpu to
> >schedule and execute another task until the completion is set by I/O
> >completion (via an async callback). At that point, the issuing thread
> >continues where it left off. I suspect I'm missing something... can you
> >elaborate on what you'd do differently here (and how it helps)?
> 
> Just apply the same technique everywhere: convert locks to trylock +
> schedule a continuation on failure.
> 

I'm certainly not an expert on the kernel scheduling, locking and
serialization mechanisms, but my understanding is that most things
outside of spin locks are reschedule points. For example, the
wait_for_completion() calls XFS uses to wait on I/O boil down to
schedule_timeout() calls. Buffer locks are implemented as semaphores and
down() can end up in the same place.

Brian

> >
> >>Seastar (the async user framework which we use to drive xfs) makes writing
> >>code like this easy, using continuations; but of course from ordinary
> >>threaded code it can be quite hard.
> >>
> >>btw, there was an attempt to make ext[34] async using this method, but I
> >>think it was ripped out.  Yes, the mortal remains can still be seen with
> >>'git grep EIOCBQUEUED'.
> >>
> >>>>>It sounds to me that first and foremost you want to make sure you don't
> >>>>>have however many parallel operations you typically have running
> >>>>>contending on the same inodes or AGs. Hint: creating files under
> >>>>>separate subdirectories is a quick and easy way to allocate inodes under
> >>>>>separate AGs (the agno is encoded into the upper bits of the inode
> >>>>>number).
> >>>>Unfortunately our directory layout cannot be changed.  And doesn't this
> >>>>require having agcount == O(number of active files)?  That is easily in the
> >>>>thousands.
> >>>>
> >>>I think Glauber's O(nr_cpus) comment is probably the more likely
> >>>ballpark, but really it's something you'll probably just need to test to
> >>>see how far you need to go to avoid AG contention.
> >>>
> >>>I'm primarily throwing the subdir thing out there for testing purposes.
> >>>It's just an easy way to create inodes in a bunch of separate AGs so you
> >>>can determine whether/how much it really helps with modified AG counts.
> >>>I don't know enough about your application design to really comment on
> >>>that...
> >>We have O(cpus) shards that operate independently.  Each shard writes 32MB
> >>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
> >>without blocking); the files are then flushed and closed, and later removed.
> >>In parallel there are sequential writes and reads of large files using 128kB
> >>buffers), as well as random reads.  Files are immutable (append-only), and
> >>if a file is being written, it is not concurrently read.  In general files
> >>are not shared across shards.  All I/O is async and O_DIRECT.  open(),
> >>truncate(), fdatasync(), and friends are called from a helper thread.
> >>
> >>As far as I can tell it should a very friendly load for XFS and SSDs.
> >>
> >>>>>  Reducing the frequency of block allocation/frees might also be
> >>>>>another help (e.g., preallocate and reuse files,
> >>>>Isn't that discouraged for SSDs?
> >>>>
> >>>Perhaps, if you're referring to the fact that the blocks are never freed
> >>>and thus never discarded..? Are you running fstrim?
> >>mount -o discard.  And yes, overwrites are supposedly more expensive than
> >>trim old data + allocate new data, but maybe if you compare it with the work
> >>XFS has to do, perhaps the tradeoff is bad.
> >>
> >Ok, my understanding is that '-o discard' is not recommended in favor of
> >periodic fstrim for performance reasons, but that may or may not still
> >be the case.
> 
> I understand that most SSDs have queued trim these days, but maybe I'm
> optimistic.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 16:29                     ` Brian Foster
@ 2015-12-01 17:09                       ` Avi Kivity
  2015-12-01 18:03                         ` Carlos Maiolino
  2015-12-01 18:51                         ` Brian Foster
  0 siblings, 2 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 17:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs



On 12/01/2015 06:29 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
>>
>> On 12/01/2015 06:01 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>>>>>> ...
>>>>>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>>>>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>>>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>>>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>>>>>>> adjusted depending on the size of the overall volume (see
>>>>>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>>>>>> We'll experiment with this.  Surely it depends on more than the amount of
>>>>>>>> storage?  If you have a high op rate you'll be more likely to excite
>>>>>>>> contention, no?
>>>>>>>>
>>>>>>> Sure. The absolute optimal configuration for your workload probably
>>>>>>> depends on more than storage size, but mkfs doesn't have that
>>>>>>> information. In general, it tries to use the most reasonable
>>>>>>> configuration based on the storage and expected workload. If you want to
>>>>>>> tweak it beyond that, indeed, the best bet is to experiment with what
>>>>>>> works.
>>>>>> We will do that.
>>>>>>
>>>>>>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>>>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>>>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>>>>>>> pushing case will trylock and defer to the next list iteration if the
>>>>>>>>> buffer is busy.
>>>>>>>>>
>>>>>>>> Ok.  For us sleeping in io_submit() is death because we have no other thread
>>>>>>>> on that core to take its place.
>>>>>>>>
>>>>>>> The above is with regard to metadata I/O, whereas io_submit() is
>>>>>>> obviously for user I/O.
>>>>>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>>>>>> async tasks?  I don't mind them blocking each other as long as they let my
>>>>>> io_submit alone.
>>>>>>
>>>>> Yeah, it can trigger metadata reads, force the log (the stale buffer
>>>>> example) or push the AIL (wait on log space). Metadata changes made
>>>>> directly via your I/O request are logged/committed via transactions,
>>>>> which are generally processed asynchronously from that point on.
>>>>>
>>>>>>>   io_submit() can probably block in a variety of
>>>>>>> places afaict... it might have to read in the inode extent map, allocate
>>>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>>>>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>>>>>> if somebody else has to do it.
>>>>>>
>>>>> I'm not following... if the fs needs to read in the inode extent map to
>>>>> prepare for an allocation, what else can the thread do but wait? Are you
>>>>> suggesting the request kick off whatever the blocking action happens to
>>>>> be asynchronously and return with an error such that the request can be
>>>>> retried later?
>>>> Not quite, it should be invisible to the caller.
>>>>
>>>> That is, the code called by io_submit() (file_operations::write_iter, it
>>>> seems to be called today) can kick off this operation and have it continue
>>> >from where it left off.
>>> Isn't that generally what happens today?
>> You tell me.  According to $subject, apparently not enough.  Maybe we're
>> triggering it more often, or we suffer more when it does trigger (the latter
>> probably more likely).
>>
> The original mail describes looking at the sched:sched_switch tracepoint
> which on a quick look, appears to fire whenever a cpu context switch
> occurs. This likely triggers any time we wait on an I/O or a contended
> lock (among other situations I'm sure), and it signifies that something
> else is going to execute in our place until this thread can make
> progress.

For us, nothing else can execute in our place, we usually have exactly 
one thread per logical core.  So we are heavily dependent on io_submit 
not sleeping.

The case of a contended lock is, to me, less worrying.  It can be 
reduced by using more allocation groups, which is apparently the shared 
resource under contention.

The case of waiting for I/O is much more worrying, because I/O latency 
are much higher.  But it seems like most of the DIO path does not 
trigger locking around I/O (and we are careful to avoid the ones that 
do, like writing beyond eof).

(sorry for repeating myself, I have the feeling we are talking past each 
other and want to be on the same page)

>
>>>   We submit an I/O which is
>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>> schedule and execute another task until the completion is set by I/O
>>> completion (via an async callback). At that point, the issuing thread
>>> continues where it left off. I suspect I'm missing something... can you
>>> elaborate on what you'd do differently here (and how it helps)?
>> Just apply the same technique everywhere: convert locks to trylock +
>> schedule a continuation on failure.
>>
> I'm certainly not an expert on the kernel scheduling, locking and
> serialization mechanisms, but my understanding is that most things
> outside of spin locks are reschedule points. For example, the
> wait_for_completion() calls XFS uses to wait on I/O boil down to
> schedule_timeout() calls. Buffer locks are implemented as semaphores and
> down() can end up in the same place.

But, for the most part, XFS seems to be able to avoid sleeping.  The 
call to __blockdev_direct_IO only launches the I/O, so any locking is 
only around cpu operations and, unless there is contention, won't cause 
us to sleep in io_submit().

Trying to follow the code, it looks like xfs_get_blocks_direct (and 
__blockdev_direct_IO's get_block parameter in general) is synchronous, 
so we're just lucky to have everything in cache.  If it isn't, we block 
right there.  I really hope I'm misreading this and some other magic is 
happening elsewhere instead of this.

> Brian
>
>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>> code like this easy, using continuations; but of course from ordinary
>>>> threaded code it can be quite hard.
>>>>
>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>> think it was ripped out.  Yes, the mortal remains can still be seen with
>>>> 'git grep EIOCBQUEUED'.
>>>>
>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>> have however many parallel operations you typically have running
>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>> number).
>>>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>>>> require having agcount == O(number of active files)?  That is easily in the
>>>>>> thousands.
>>>>>>
>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>> see how far you need to go to avoid AG contention.
>>>>>
>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>> I don't know enough about your application design to really comment on
>>>>> that...
>>>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>> without blocking); the files are then flushed and closed, and later removed.
>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>> buffers), as well as random reads.  Files are immutable (append-only), and
>>>> if a file is being written, it is not concurrently read.  In general files
>>>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>
>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>
>>>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>> Isn't that discouraged for SSDs?
>>>>>>
>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>> and thus never discarded..? Are you running fstrim?
>>>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>
>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>> periodic fstrim for performance reasons, but that may or may not still
>>> be the case.
>> I understand that most SSDs have queued trim these days, but maybe I'm
>> optimistic.
>>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 17:09                       ` Avi Kivity
@ 2015-12-01 18:03                         ` Carlos Maiolino
  2015-12-01 19:07                           ` Avi Kivity
  2015-12-01 18:51                         ` Brian Foster
  1 sibling, 1 reply; 58+ messages in thread
From: Carlos Maiolino @ 2015-12-01 18:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

Hi Avi,

> >else is going to execute in our place until this thread can make
> >progress.
> 
> For us, nothing else can execute in our place, we usually have exactly one
> thread per logical core.  So we are heavily dependent on io_submit not
> sleeping.
> 
> The case of a contended lock is, to me, less worrying.  It can be reduced by
> using more allocation groups, which is apparently the shared resource under
> contention.
> 

I apologize if I misread your previous comments, but, IIRC you said you can't
change the directory structure your application is using, and IIRC your
application does not spread files across several directories.

XFS spread files across the allocation groups, based on the directory these
files are created, trying to keep files as close as possible from their
metadata.
Directories are spreaded across the AGs in a 'round-robin' way, each
new directory, will be created in the next allocation group, and, xfs will try
to allocate the files in the same AG as its parent directory. (Take a look at
the 'rotorstep' sysctl option for xfs).

So, unless you have the files distributed across enough directories, increasing
the number of allocation groups may not change the lock contention you're
facing in this case.

I really don't remember if it has been mentioned already, but if not, it might
be worth to take this point in consideration.

anyway, just my 0.02

> The case of waiting for I/O is much more worrying, because I/O latency are
> much higher.  But it seems like most of the DIO path does not trigger
> locking around I/O (and we are careful to avoid the ones that do, like
> writing beyond eof).
> 
> (sorry for repeating myself, I have the feeling we are talking past each
> other and want to be on the same page)
> 
> >
> >>>  We submit an I/O which is
> >>>asynchronous in nature and wait on a completion, which causes the cpu to
> >>>schedule and execute another task until the completion is set by I/O
> >>>completion (via an async callback). At that point, the issuing thread
> >>>continues where it left off. I suspect I'm missing something... can you
> >>>elaborate on what you'd do differently here (and how it helps)?
> >>Just apply the same technique everywhere: convert locks to trylock +
> >>schedule a continuation on failure.
> >>
> >I'm certainly not an expert on the kernel scheduling, locking and
> >serialization mechanisms, but my understanding is that most things
> >outside of spin locks are reschedule points. For example, the
> >wait_for_completion() calls XFS uses to wait on I/O boil down to
> >schedule_timeout() calls. Buffer locks are implemented as semaphores and
> >down() can end up in the same place.
> 
> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
> __blockdev_direct_IO only launches the I/O, so any locking is only around
> cpu operations and, unless there is contention, won't cause us to sleep in
> io_submit().
> 
> Trying to follow the code, it looks like xfs_get_blocks_direct (and
> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
> we're just lucky to have everything in cache.  If it isn't, we block right
> there.  I really hope I'm misreading this and some other magic is happening
> elsewhere instead of this.
> 
> >Brian
> >
> >>>>Seastar (the async user framework which we use to drive xfs) makes writing
> >>>>code like this easy, using continuations; but of course from ordinary
> >>>>threaded code it can be quite hard.
> >>>>
> >>>>btw, there was an attempt to make ext[34] async using this method, but I
> >>>>think it was ripped out.  Yes, the mortal remains can still be seen with
> >>>>'git grep EIOCBQUEUED'.
> >>>>
> >>>>>>>It sounds to me that first and foremost you want to make sure you don't
> >>>>>>>have however many parallel operations you typically have running
> >>>>>>>contending on the same inodes or AGs. Hint: creating files under
> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under
> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode
> >>>>>>>number).
> >>>>>>Unfortunately our directory layout cannot be changed.  And doesn't this
> >>>>>>require having agcount == O(number of active files)?  That is easily in the
> >>>>>>thousands.
> >>>>>>
> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely
> >>>>>ballpark, but really it's something you'll probably just need to test to
> >>>>>see how far you need to go to avoid AG contention.
> >>>>>
> >>>>>I'm primarily throwing the subdir thing out there for testing purposes.
> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you
> >>>>>can determine whether/how much it really helps with modified AG counts.
> >>>>>I don't know enough about your application design to really comment on
> >>>>>that...
> >>>>We have O(cpus) shards that operate independently.  Each shard writes 32MB
> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
> >>>>without blocking); the files are then flushed and closed, and later removed.
> >>>>In parallel there are sequential writes and reads of large files using 128kB
> >>>>buffers), as well as random reads.  Files are immutable (append-only), and
> >>>>if a file is being written, it is not concurrently read.  In general files
> >>>>are not shared across shards.  All I/O is async and O_DIRECT.  open(),
> >>>>truncate(), fdatasync(), and friends are called from a helper thread.
> >>>>
> >>>>As far as I can tell it should a very friendly load for XFS and SSDs.
> >>>>
> >>>>>>>  Reducing the frequency of block allocation/frees might also be
> >>>>>>>another help (e.g., preallocate and reuse files,
> >>>>>>Isn't that discouraged for SSDs?
> >>>>>>
> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed
> >>>>>and thus never discarded..? Are you running fstrim?
> >>>>mount -o discard.  And yes, overwrites are supposedly more expensive than
> >>>>trim old data + allocate new data, but maybe if you compare it with the work
> >>>>XFS has to do, perhaps the tradeoff is bad.
> >>>>
> >>>Ok, my understanding is that '-o discard' is not recommended in favor of
> >>>periodic fstrim for performance reasons, but that may or may not still
> >>>be the case.
> >>I understand that most SSDs have queued trim these days, but maybe I'm
> >>optimistic.
> >>
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 17:09                       ` Avi Kivity
  2015-12-01 18:03                         ` Carlos Maiolino
@ 2015-12-01 18:51                         ` Brian Foster
  2015-12-01 19:07                           ` Glauber Costa
  2015-12-01 19:26                           ` Avi Kivity
  1 sibling, 2 replies; 58+ messages in thread
From: Brian Foster @ 2015-12-01 18:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
> 
> 
> On 12/01/2015 06:29 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
> >>
> >>On 12/01/2015 06:01 PM, Brian Foster wrote:
> >>>On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 04:56 PM, Brian Foster wrote:
> >>>>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>>>>>On 12/01/2015 03:11 PM, Brian Foster wrote:
> >>>>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>>>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
> >>>>>>>...
...
> >>>>>>Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
> >>>>>>async tasks?  I don't mind them blocking each other as long as they let my
> >>>>>>io_submit alone.
> >>>>>>
> >>>>>Yeah, it can trigger metadata reads, force the log (the stale buffer
> >>>>>example) or push the AIL (wait on log space). Metadata changes made
> >>>>>directly via your I/O request are logged/committed via transactions,
> >>>>>which are generally processed asynchronously from that point on.
> >>>>>
> >>>>>>>  io_submit() can probably block in a variety of
> >>>>>>>places afaict... it might have to read in the inode extent map, allocate
> >>>>>>>blocks, take inode/ag locks, reserve log space for transactions, etc.
> >>>>>>Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> >>>>>>if somebody else has to do it.
> >>>>>>
> >>>>>I'm not following... if the fs needs to read in the inode extent map to
> >>>>>prepare for an allocation, what else can the thread do but wait? Are you
> >>>>>suggesting the request kick off whatever the blocking action happens to
> >>>>>be asynchronously and return with an error such that the request can be
> >>>>>retried later?
> >>>>Not quite, it should be invisible to the caller.
> >>>>
> >>>>That is, the code called by io_submit() (file_operations::write_iter, it
> >>>>seems to be called today) can kick off this operation and have it continue
> >>>>from where it left off.
> >>>Isn't that generally what happens today?
> >>You tell me.  According to $subject, apparently not enough.  Maybe we're
> >>triggering it more often, or we suffer more when it does trigger (the latter
> >>probably more likely).
> >>
> >The original mail describes looking at the sched:sched_switch tracepoint
> >which on a quick look, appears to fire whenever a cpu context switch
> >occurs. This likely triggers any time we wait on an I/O or a contended
> >lock (among other situations I'm sure), and it signifies that something
> >else is going to execute in our place until this thread can make
> >progress.
> 
> For us, nothing else can execute in our place, we usually have exactly one
> thread per logical core.  So we are heavily dependent on io_submit not
> sleeping.
> 

Yes, this "coroutine model" makes more sense to me from the application
perspective. I'm just trying to understand what you're after from the
kernel perspective.

> The case of a contended lock is, to me, less worrying.  It can be reduced by
> using more allocation groups, which is apparently the shared resource under
> contention.
> 

Yep.

> The case of waiting for I/O is much more worrying, because I/O latency are
> much higher.  But it seems like most of the DIO path does not trigger
> locking around I/O (and we are careful to avoid the ones that do, like
> writing beyond eof).
> 
> (sorry for repeating myself, I have the feeling we are talking past each
> other and want to be on the same page)
> 

Yeah, my point is just that just because the thread blocked on I/O,
doesn't mean the cpu can't carry on with some useful work for another
task.

> >
> >>>  We submit an I/O which is
> >>>asynchronous in nature and wait on a completion, which causes the cpu to
> >>>schedule and execute another task until the completion is set by I/O
> >>>completion (via an async callback). At that point, the issuing thread
> >>>continues where it left off. I suspect I'm missing something... can you
> >>>elaborate on what you'd do differently here (and how it helps)?
> >>Just apply the same technique everywhere: convert locks to trylock +
> >>schedule a continuation on failure.
> >>
> >I'm certainly not an expert on the kernel scheduling, locking and
> >serialization mechanisms, but my understanding is that most things
> >outside of spin locks are reschedule points. For example, the
> >wait_for_completion() calls XFS uses to wait on I/O boil down to
> >schedule_timeout() calls. Buffer locks are implemented as semaphores and
> >down() can end up in the same place.
> 
> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
> __blockdev_direct_IO only launches the I/O, so any locking is only around
> cpu operations and, unless there is contention, won't cause us to sleep in
> io_submit().
> 
> Trying to follow the code, it looks like xfs_get_blocks_direct (and
> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
> we're just lucky to have everything in cache.  If it isn't, we block right
> there.  I really hope I'm misreading this and some other magic is happening
> elsewhere instead of this.
> 

Nope, it's synchronous from a code perspective. The
xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
inode bmap metadata if it hasn't been done already. Note that this
should only happen once as everything is stored in-core, so in most
cases this is skipped. It's also possible extents are read in via some
other path/operation on the inode before an async I/O happens to be
submitted (e.g., see some of the other xfs_bmapi_read() callers).

Either way, the extents have to be read in at some point and I'd expect
that cpu to schedule onto some other task while that thread waits on I/O
to complete (read-ahead could also be a factor here, but I haven't
really dug into how that is triggered for buffers).

Brian

> >Brian
> >
> >>>>Seastar (the async user framework which we use to drive xfs) makes writing
> >>>>code like this easy, using continuations; but of course from ordinary
> >>>>threaded code it can be quite hard.
> >>>>
> >>>>btw, there was an attempt to make ext[34] async using this method, but I
> >>>>think it was ripped out.  Yes, the mortal remains can still be seen with
> >>>>'git grep EIOCBQUEUED'.
> >>>>
> >>>>>>>It sounds to me that first and foremost you want to make sure you don't
> >>>>>>>have however many parallel operations you typically have running
> >>>>>>>contending on the same inodes or AGs. Hint: creating files under
> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under
> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode
> >>>>>>>number).
> >>>>>>Unfortunately our directory layout cannot be changed.  And doesn't this
> >>>>>>require having agcount == O(number of active files)?  That is easily in the
> >>>>>>thousands.
> >>>>>>
> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely
> >>>>>ballpark, but really it's something you'll probably just need to test to
> >>>>>see how far you need to go to avoid AG contention.
> >>>>>
> >>>>>I'm primarily throwing the subdir thing out there for testing purposes.
> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you
> >>>>>can determine whether/how much it really helps with modified AG counts.
> >>>>>I don't know enough about your application design to really comment on
> >>>>>that...
> >>>>We have O(cpus) shards that operate independently.  Each shard writes 32MB
> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
> >>>>without blocking); the files are then flushed and closed, and later removed.
> >>>>In parallel there are sequential writes and reads of large files using 128kB
> >>>>buffers), as well as random reads.  Files are immutable (append-only), and
> >>>>if a file is being written, it is not concurrently read.  In general files
> >>>>are not shared across shards.  All I/O is async and O_DIRECT.  open(),
> >>>>truncate(), fdatasync(), and friends are called from a helper thread.
> >>>>
> >>>>As far as I can tell it should a very friendly load for XFS and SSDs.
> >>>>
> >>>>>>>  Reducing the frequency of block allocation/frees might also be
> >>>>>>>another help (e.g., preallocate and reuse files,
> >>>>>>Isn't that discouraged for SSDs?
> >>>>>>
> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed
> >>>>>and thus never discarded..? Are you running fstrim?
> >>>>mount -o discard.  And yes, overwrites are supposedly more expensive than
> >>>>trim old data + allocate new data, but maybe if you compare it with the work
> >>>>XFS has to do, perhaps the tradeoff is bad.
> >>>>
> >>>Ok, my understanding is that '-o discard' is not recommended in favor of
> >>>periodic fstrim for performance reasons, but that may or may not still
> >>>be the case.
> >>I understand that most SSDs have queued trim these days, but maybe I'm
> >>optimistic.
> >>
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 18:03                         ` Carlos Maiolino
@ 2015-12-01 19:07                           ` Avi Kivity
  2015-12-01 21:19                             ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 19:07 UTC (permalink / raw)
  To: Glauber Costa, xfs

On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
> Hi Avi,
>
>>> else is going to execute in our place until this thread can make
>>> progress.
>> For us, nothing else can execute in our place, we usually have exactly one
>> thread per logical core.  So we are heavily dependent on io_submit not
>> sleeping.
>>
>> The case of a contended lock is, to me, less worrying.  It can be reduced by
>> using more allocation groups, which is apparently the shared resource under
>> contention.
>>
> I apologize if I misread your previous comments, but, IIRC you said you can't
> change the directory structure your application is using, and IIRC your
> application does not spread files across several directories.

I miswrote somewhat: the application writes data files and commitlog 
files.  The data file directory structure is fixed due to compatibility 
concerns (it is not a single directory, but some workloads will see most 
access on files in a single directory.  The commitlog directory 
structure is more relaxed, and we can split it to a directory per shard 
(=cpu) or something else.

If worst comes to worst, we'll hack around this and distribute the data 
files into more directories, and provide some hack for compatibility.

> XFS spread files across the allocation groups, based on the directory these
> files are created,

Idea: create the files in some subdirectory, and immediately move them 
to their required location.

>   trying to keep files as close as possible from their
> metadata.

This is pointless for an SSD. Perhaps XFS should randomize the ag on 
nonrotational media instead.


> Directories are spreaded across the AGs in a 'round-robin' way, each
> new directory, will be created in the next allocation group, and, xfs will try
> to allocate the files in the same AG as its parent directory. (Take a look at
> the 'rotorstep' sysctl option for xfs).
>
> So, unless you have the files distributed across enough directories, increasing
> the number of allocation groups may not change the lock contention you're
> facing in this case.
>
> I really don't remember if it has been mentioned already, but if not, it might
> be worth to take this point in consideration.

Thanks.  I think you should really consider randomizing the ag for SSDs, 
and meanwhile, we can just use the creation-directory hack to get the 
same effect, at the cost of an extra system call.  So at least for this 
problem, there is a solution.

> anyway, just my 0.02
>
>> The case of waiting for I/O is much more worrying, because I/O latency are
>> much higher.  But it seems like most of the DIO path does not trigger
>> locking around I/O (and we are careful to avoid the ones that do, like
>> writing beyond eof).
>>
>> (sorry for repeating myself, I have the feeling we are talking past each
>> other and want to be on the same page)
>>
>>>>>   We submit an I/O which is
>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>> schedule and execute another task until the completion is set by I/O
>>>>> completion (via an async callback). At that point, the issuing thread
>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>> schedule a continuation on failure.
>>>>
>>> I'm certainly not an expert on the kernel scheduling, locking and
>>> serialization mechanisms, but my understanding is that most things
>>> outside of spin locks are reschedule points. For example, the
>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>> down() can end up in the same place.
>> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>> cpu operations and, unless there is contention, won't cause us to sleep in
>> io_submit().
>>
>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>> we're just lucky to have everything in cache.  If it isn't, we block right
>> there.  I really hope I'm misreading this and some other magic is happening
>> elsewhere instead of this.
>>
>>> Brian
>>>
>>>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>>>> code like this easy, using continuations; but of course from ordinary
>>>>>> threaded code it can be quite hard.
>>>>>>
>>>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>>>> think it was ripped out.  Yes, the mortal remains can still be seen with
>>>>>> 'git grep EIOCBQUEUED'.
>>>>>>
>>>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>>>> have however many parallel operations you typically have running
>>>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>>>> number).
>>>>>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>>>>>> require having agcount == O(number of active files)?  That is easily in the
>>>>>>>> thousands.
>>>>>>>>
>>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>>>> see how far you need to go to avoid AG contention.
>>>>>>>
>>>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>>>> I don't know enough about your application design to really comment on
>>>>>>> that...
>>>>>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>>>> without blocking); the files are then flushed and closed, and later removed.
>>>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>>>> buffers), as well as random reads.  Files are immutable (append-only), and
>>>>>> if a file is being written, it is not concurrently read.  In general files
>>>>>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>>>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>>>
>>>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>>>
>>>>>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>>>> Isn't that discouraged for SSDs?
>>>>>>>>
>>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>>>> and thus never discarded..? Are you running fstrim?
>>>>>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>>>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>>>
>>>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>>>> periodic fstrim for performance reasons, but that may or may not still
>>>>> be the case.
>>>> I understand that most SSDs have queued trim these days, but maybe I'm
>>>> optimistic.
>>>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 18:51                         ` Brian Foster
@ 2015-12-01 19:07                           ` Glauber Costa
  2015-12-01 19:35                             ` Brian Foster
  2015-12-01 19:26                           ` Avi Kivity
  1 sibling, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2015-12-01 19:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, xfs

Hi Brian,


>
> Either way, the extents have to be read in at some point and I'd expect
> that cpu to schedule onto some other task while that thread waits on I/O
> to complete (read-ahead could also be a factor here, but I haven't
> really dug into how that is triggered for buffers).
>


Being a datastore, we expect to run practically alone in any box we're
at. That means that there is no other task to run. If io_submit
blocks, the system blocks. The assumption that blocking will just
yield the processor for another thread makes sense in the general case
where you assume more than one application running and/or more than
one thread within the same application.

>From our user's perspective, however, every time that happens we can't
make progress. It doesn't really matter where it blocks.

If io_submit returns without blocking, we can still push more work,
even though the kernel is still not ready to proceed. If it blocks,
we're dead.

> Brian
>
>> >Brian
>> >
>> >>>>Seastar (the async user framework which we use to drive xfs) makes writing
>> >>>>code like this easy, using continuations; but of course from ordinary
>> >>>>threaded code it can be quite hard.
>> >>>>
>> >>>>btw, there was an attempt to make ext[34] async using this method, but I
>> >>>>think it was ripped out.  Yes, the mortal remains can still be seen with
>> >>>>'git grep EIOCBQUEUED'.
>> >>>>
>> >>>>>>>It sounds to me that first and foremost you want to make sure you don't
>> >>>>>>>have however many parallel operations you typically have running
>> >>>>>>>contending on the same inodes or AGs. Hint: creating files under
>> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under
>> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode
>> >>>>>>>number).
>> >>>>>>Unfortunately our directory layout cannot be changed.  And doesn't this
>> >>>>>>require having agcount == O(number of active files)?  That is easily in the
>> >>>>>>thousands.
>> >>>>>>
>> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely
>> >>>>>ballpark, but really it's something you'll probably just need to test to
>> >>>>>see how far you need to go to avoid AG contention.
>> >>>>>
>> >>>>>I'm primarily throwing the subdir thing out there for testing purposes.
>> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you
>> >>>>>can determine whether/how much it really helps with modified AG counts.
>> >>>>>I don't know enough about your application design to really comment on
>> >>>>>that...
>> >>>>We have O(cpus) shards that operate independently.  Each shard writes 32MB
>> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>> >>>>without blocking); the files are then flushed and closed, and later removed.
>> >>>>In parallel there are sequential writes and reads of large files using 128kB
>> >>>>buffers), as well as random reads.  Files are immutable (append-only), and
>> >>>>if a file is being written, it is not concurrently read.  In general files
>> >>>>are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>> >>>>truncate(), fdatasync(), and friends are called from a helper thread.
>> >>>>
>> >>>>As far as I can tell it should a very friendly load for XFS and SSDs.
>> >>>>
>> >>>>>>>  Reducing the frequency of block allocation/frees might also be
>> >>>>>>>another help (e.g., preallocate and reuse files,
>> >>>>>>Isn't that discouraged for SSDs?
>> >>>>>>
>> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed
>> >>>>>and thus never discarded..? Are you running fstrim?
>> >>>>mount -o discard.  And yes, overwrites are supposedly more expensive than
>> >>>>trim old data + allocate new data, but maybe if you compare it with the work
>> >>>>XFS has to do, perhaps the tradeoff is bad.
>> >>>>
>> >>>Ok, my understanding is that '-o discard' is not recommended in favor of
>> >>>periodic fstrim for performance reasons, but that may or may not still
>> >>>be the case.
>> >>I understand that most SSDs have queued trim these days, but maybe I'm
>> >>optimistic.
>> >>
>>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 18:51                         ` Brian Foster
  2015-12-01 19:07                           ` Glauber Costa
@ 2015-12-01 19:26                           ` Avi Kivity
  2015-12-01 19:41                             ` Christoph Hellwig
  2015-12-02  0:13                             ` Brian Foster
  1 sibling, 2 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 19:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs

On 12/01/2015 08:51 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
>>
>> On 12/01/2015 06:29 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 06:01 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>>>>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>>>>>>>> ...
> ...
>>>>>>>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>>>>>>>> async tasks?  I don't mind them blocking each other as long as they let my
>>>>>>>> io_submit alone.
>>>>>>>>
>>>>>>> Yeah, it can trigger metadata reads, force the log (the stale buffer
>>>>>>> example) or push the AIL (wait on log space). Metadata changes made
>>>>>>> directly via your I/O request are logged/committed via transactions,
>>>>>>> which are generally processed asynchronously from that point on.
>>>>>>>
>>>>>>>>>   io_submit() can probably block in a variety of
>>>>>>>>> places afaict... it might have to read in the inode extent map, allocate
>>>>>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>>>>>>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>>>>>>>> if somebody else has to do it.
>>>>>>>>
>>>>>>> I'm not following... if the fs needs to read in the inode extent map to
>>>>>>> prepare for an allocation, what else can the thread do but wait? Are you
>>>>>>> suggesting the request kick off whatever the blocking action happens to
>>>>>>> be asynchronously and return with an error such that the request can be
>>>>>>> retried later?
>>>>>> Not quite, it should be invisible to the caller.
>>>>>>
>>>>>> That is, the code called by io_submit() (file_operations::write_iter, it
>>>>>> seems to be called today) can kick off this operation and have it continue
>>>>> >from where it left off.
>>>>> Isn't that generally what happens today?
>>>> You tell me.  According to $subject, apparently not enough.  Maybe we're
>>>> triggering it more often, or we suffer more when it does trigger (the latter
>>>> probably more likely).
>>>>
>>> The original mail describes looking at the sched:sched_switch tracepoint
>>> which on a quick look, appears to fire whenever a cpu context switch
>>> occurs. This likely triggers any time we wait on an I/O or a contended
>>> lock (among other situations I'm sure), and it signifies that something
>>> else is going to execute in our place until this thread can make
>>> progress.
>> For us, nothing else can execute in our place, we usually have exactly one
>> thread per logical core.  So we are heavily dependent on io_submit not
>> sleeping.
>>
> Yes, this "coroutine model" makes more sense to me from the application
> perspective. I'm just trying to understand what you're after from the
> kernel perspective.

It's basically the same thing.  To to this, we'd have get_block either 
return the block's address (if it was in some metadata cache), or, if it 
was not, issue an I/O that fills (part of) that cache, and as its 
completion function, a continuation that reruns __blockdev_direct_IO 
from the point it was stopped so it can submit the data I/O (if the 
metadata cache was completely updated) or issue the next I/O aiming to 
fill that metadata cache, if it was not.

Without that (and the more complicated code for the write path) 
io_submit is basically unusable. Yes parts of it are asynchronous, but 
if other parts of it are still synchronous, we end up requiring 
thread_count > cpu_count and now we have to context switch constantly.

>
>> The case of a contended lock is, to me, less worrying.  It can be reduced by
>> using more allocation groups, which is apparently the shared resource under
>> contention.
>>
> Yep.
>
>> The case of waiting for I/O is much more worrying, because I/O latency are
>> much higher.  But it seems like most of the DIO path does not trigger
>> locking around I/O (and we are careful to avoid the ones that do, like
>> writing beyond eof).
>>
>> (sorry for repeating myself, I have the feeling we are talking past each
>> other and want to be on the same page)
>>
> Yeah, my point is just that just because the thread blocked on I/O,
> doesn't mean the cpu can't carry on with some useful work for another
> task.

In our case, there is no other task.  We run one thread per logical 
core, so if that thread gets blocked, the cpu idles.

The whole point of io_submit() is to issue an I/O and let the caller 
continue processing immediately.  It is the equivalent of O_NONBLOCK for 
networking code.  If O_NONBLOCK did block from time to time, practically 
all modern network applications would see a huge performance drop.

>
>>>>>   We submit an I/O which is
>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>> schedule and execute another task until the completion is set by I/O
>>>>> completion (via an async callback). At that point, the issuing thread
>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>> schedule a continuation on failure.
>>>>
>>> I'm certainly not an expert on the kernel scheduling, locking and
>>> serialization mechanisms, but my understanding is that most things
>>> outside of spin locks are reschedule points. For example, the
>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>> down() can end up in the same place.
>> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>> cpu operations and, unless there is contention, won't cause us to sleep in
>> io_submit().
>>
>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>> we're just lucky to have everything in cache.  If it isn't, we block right
>> there.  I really hope I'm misreading this and some other magic is happening
>> elsewhere instead of this.
>>
> Nope, it's synchronous from a code perspective. The
> xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
> inode bmap metadata if it hasn't been done already. Note that this
> should only happen once as everything is stored in-core, so in most
> cases this is skipped. It's also possible extents are read in via some
> other path/operation on the inode before an async I/O happens to be
> submitted (e.g., see some of the other xfs_bmapi_read() callers).

Is there (could we add) some ioctl to prime this cache?  We could call 
it from a worker thread where we don't mind blocking during open.

What is the eviction policy for this cache?   Is it simply the block 
device's page cache?

What about the write path, will we see the same problems there?  I would 
guess the problem is less severe there if the metadata is written with 
writeback policy.

>
> Either way, the extents have to be read in at some point and I'd expect
> that cpu to schedule onto some other task while that thread waits on I/O
> to complete (read-ahead could also be a factor here, but I haven't
> really dug into how that is triggered for buffers).

To provide an example, our application, which is a database, faces this 
problem exact at a higher level.  Data is stored in data files, and data 
items' locations are stored in index files. When we read a bit of data, 
we issue an index read, and pass it a continuation to be executed when 
the read completes.  This latter continuation parses the data and passes 
it to the code that prepares it for merging with data from other data 
files, and an eventual return to the user.

Having written code for over a year in this style, I've come to expect 
it to be used everywhere asynchronous I/O is used, but I realize it is 
fairly hard without good support from a framework that allows 
continuations to be composed in a natural way.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 19:07                           ` Glauber Costa
@ 2015-12-01 19:35                             ` Brian Foster
  2015-12-01 19:45                               ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Brian Foster @ 2015-12-01 19:35 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, xfs

On Tue, Dec 01, 2015 at 02:07:41PM -0500, Glauber Costa wrote:
> Hi Brian,
> 
> 
> >
> > Either way, the extents have to be read in at some point and I'd expect
> > that cpu to schedule onto some other task while that thread waits on I/O
> > to complete (read-ahead could also be a factor here, but I haven't
> > really dug into how that is triggered for buffers).
> >
> 
> 
> Being a datastore, we expect to run practically alone in any box we're
> at. That means that there is no other task to run. If io_submit
> blocks, the system blocks. The assumption that blocking will just
> yield the processor for another thread makes sense in the general case
> where you assume more than one application running and/or more than
> one thread within the same application.
> 

Hmm, well that helps me understand the concern a bit more. That said, I
still question how likely this condition is. Even if this is a
completely stripped down userspace with no other applications running,
the kernel (or even XFS) alone might have plenty of threads/work items
to execute to take care of "background" tasks for various subsystems.

Of course, we don't have all of the details of your environment so
perhaps this is not the case. Perhaps a more productive approach here
might be to find a way to detect this particular case (once you've
worked out the other AG count tunings and whatnot that you want to use)
where a thread into the fs is blocked and actually has nothing else to
do and work from there. I _think_ there is such a thing as an idle task
somewhere that might be useful to help quantify this, but I'd have to
dig around to understand it better.

That actually gives us a concrete scenario to work with, try to
reproduce and improve on. It also facilitates improvements that might be
beneficial to the general use case as opposed to tailored for this
particular use case and highly specific environment. For example, if we
find a particular sustained workload that repetitively blocks with
nothing else to do, document and characterize it for the list and I'm
sure people will come up with a variety of ideas to try and address it.
Otherwise, we're kind of just looking around for context switch points
and assuming that they will all just block with nothing else to do. For
one, I don't think that's really accurate. It's also not very productive
an approach and doesn't have any measurable benefit if it doesn't come
along with a test case or reproducible condition.

Brian

> From our user's perspective, however, every time that happens we can't
> make progress. It doesn't really matter where it blocks.
> 
> If io_submit returns without blocking, we can still push more work,
> even though the kernel is still not ready to proceed. If it blocks,
> we're dead.
> 
> > Brian
> >
> >> >Brian
> >> >
> >> >>>>Seastar (the async user framework which we use to drive xfs) makes writing
> >> >>>>code like this easy, using continuations; but of course from ordinary
> >> >>>>threaded code it can be quite hard.
> >> >>>>
> >> >>>>btw, there was an attempt to make ext[34] async using this method, but I
> >> >>>>think it was ripped out.  Yes, the mortal remains can still be seen with
> >> >>>>'git grep EIOCBQUEUED'.
> >> >>>>
> >> >>>>>>>It sounds to me that first and foremost you want to make sure you don't
> >> >>>>>>>have however many parallel operations you typically have running
> >> >>>>>>>contending on the same inodes or AGs. Hint: creating files under
> >> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under
> >> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode
> >> >>>>>>>number).
> >> >>>>>>Unfortunately our directory layout cannot be changed.  And doesn't this
> >> >>>>>>require having agcount == O(number of active files)?  That is easily in the
> >> >>>>>>thousands.
> >> >>>>>>
> >> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely
> >> >>>>>ballpark, but really it's something you'll probably just need to test to
> >> >>>>>see how far you need to go to avoid AG contention.
> >> >>>>>
> >> >>>>>I'm primarily throwing the subdir thing out there for testing purposes.
> >> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you
> >> >>>>>can determine whether/how much it really helps with modified AG counts.
> >> >>>>>I don't know enough about your application design to really comment on
> >> >>>>>that...
> >> >>>>We have O(cpus) shards that operate independently.  Each shard writes 32MB
> >> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes
> >> >>>>without blocking); the files are then flushed and closed, and later removed.
> >> >>>>In parallel there are sequential writes and reads of large files using 128kB
> >> >>>>buffers), as well as random reads.  Files are immutable (append-only), and
> >> >>>>if a file is being written, it is not concurrently read.  In general files
> >> >>>>are not shared across shards.  All I/O is async and O_DIRECT.  open(),
> >> >>>>truncate(), fdatasync(), and friends are called from a helper thread.
> >> >>>>
> >> >>>>As far as I can tell it should a very friendly load for XFS and SSDs.
> >> >>>>
> >> >>>>>>>  Reducing the frequency of block allocation/frees might also be
> >> >>>>>>>another help (e.g., preallocate and reuse files,
> >> >>>>>>Isn't that discouraged for SSDs?
> >> >>>>>>
> >> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed
> >> >>>>>and thus never discarded..? Are you running fstrim?
> >> >>>>mount -o discard.  And yes, overwrites are supposedly more expensive than
> >> >>>>trim old data + allocate new data, but maybe if you compare it with the work
> >> >>>>XFS has to do, perhaps the tradeoff is bad.
> >> >>>>
> >> >>>Ok, my understanding is that '-o discard' is not recommended in favor of
> >> >>>periodic fstrim for performance reasons, but that may or may not still
> >> >>>be the case.
> >> >>I understand that most SSDs have queued trim these days, but maybe I'm
> >> >>optimistic.
> >> >>
> >>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 19:26                           ` Avi Kivity
@ 2015-12-01 19:41                             ` Christoph Hellwig
  2015-12-01 19:50                               ` Avi Kivity
  2015-12-02  0:13                             ` Brian Foster
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2015-12-01 19:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs

On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
> It's basically the same thing.  To to this, we'd have get_block either
> return the block's address (if it was in some metadata cache), or, if it was
> not, issue an I/O that fills (part of) that cache, and as its completion
> function, a continuation that reruns __blockdev_direct_IO from the point it
> was stopped so it can submit the data I/O (if the metadata cache was
> completely updated) or issue the next I/O aiming to fill that metadata
> cache, if it was not.

We did something this for blocking reads with great results, and it could be
done similarly for direct I/O I think:

	https://lwn.net/Articles/612483/

Unfortunately Andrew shut it down for odd reasons so it didn't get in.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 19:35                             ` Brian Foster
@ 2015-12-01 19:45                               ` Avi Kivity
  0 siblings, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 19:45 UTC (permalink / raw)
  To: Brian Foster, Glauber Costa; +Cc: xfs

On 12/01/2015 09:35 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 02:07:41PM -0500, Glauber Costa wrote:
>> Hi Brian,
>>
>>
>>> Either way, the extents have to be read in at some point and I'd expect
>>> that cpu to schedule onto some other task while that thread waits on I/O
>>> to complete (read-ahead could also be a factor here, but I haven't
>>> really dug into how that is triggered for buffers).
>>>
>>
>> Being a datastore, we expect to run practically alone in any box we're
>> at. That means that there is no other task to run. If io_submit
>> blocks, the system blocks. The assumption that blocking will just
>> yield the processor for another thread makes sense in the general case
>> where you assume more than one application running and/or more than
>> one thread within the same application.
>>
> Hmm, well that helps me understand the concern a bit more. That said, I
> still question how likely this condition is. Even if this is a
> completely stripped down userspace with no other applications running,
> the kernel (or even XFS) alone might have plenty of threads/work items
> to execute to take care of "background" tasks for various subsystems.

There are not.  We grab almost all of memory.  All our I/O is O_DIRECT 
so there is no page cache to write back.  There may be softirq work from 
networking, but in one mode we have (not yet in production), we use a 
userspace networking stack, so no softirq at all.

That said, I doubt this is a problem now.  Because the files are large 
and well laid out, the amount of metadata is small and can easily be cached.

We might prime the metadata cache before launching the application, or 
just ignore the whole problem.  It would be much worse with small files, 
but that isn't the case for us.

>
> Of course, we don't have all of the details of your environment so
> perhaps this is not the case. Perhaps a more productive approach here
> might be to find a way to detect this particular case (once you've
> worked out the other AG count tunings and whatnot that you want to use)
> where a thread into the fs is blocked and actually has nothing else to
> do and work from there. I _think_ there is such a thing as an idle task
> somewhere that might be useful to help quantify this, but I'd have to
> dig around to understand it better.

We simply observe the idle cpu counter going above zero.

Once we resolve the other issues, we'll instrument the kernel with 
systemtap and see where the other blockages come from.

> That actually gives us a concrete scenario to work with, try to
> reproduce and improve on. It also facilitates improvements that might be
> beneficial to the general use case as opposed to tailored for this
> particular use case and highly specific environment. For example, if we
> find a particular sustained workload that repetitively blocks with
> nothing else to do, document and characterize it for the list and I'm
> sure people will come up with a variety of ideas to try and address it.
> Otherwise, we're kind of just looking around for context switch points
> and assuming that they will all just block with nothing else to do. For
> one, I don't think that's really accurate. It's also not very productive
> an approach and doesn't have any measurable benefit if it doesn't come
> along with a test case or reproducible condition.

I agree completely.  We'll try to find better probe points than 
schedule().  We'll also be able to come up with reproducers, this should 
not be too hard once we have good instrumentation.


> Brian
>
>>  From our user's perspective, however, every time that happens we can't
>> make progress. It doesn't really matter where it blocks.
>>
>> If io_submit returns without blocking, we can still push more work,
>> even though the kernel is still not ready to proceed. If it blocks,
>> we're dead.
>>
>>> Brian
>>>
>>>>> Brian
>>>>>
>>>>>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>>>>>> code like this easy, using continuations; but of course from ordinary
>>>>>>>> threaded code it can be quite hard.
>>>>>>>>
>>>>>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>>>>>> think it was ripped out.  Yes, the mortal remains can still be seen with
>>>>>>>> 'git grep EIOCBQUEUED'.
>>>>>>>>
>>>>>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>>>>>> have however many parallel operations you typically have running
>>>>>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>>>>>> number).
>>>>>>>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>>>>>>>> require having agcount == O(number of active files)?  That is easily in the
>>>>>>>>>> thousands.
>>>>>>>>>>
>>>>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>>>>>> see how far you need to go to avoid AG contention.
>>>>>>>>>
>>>>>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>>>>>> I don't know enough about your application design to really comment on
>>>>>>>>> that...
>>>>>>>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>>>>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>>>>>> without blocking); the files are then flushed and closed, and later removed.
>>>>>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>>>>>> buffers), as well as random reads.  Files are immutable (append-only), and
>>>>>>>> if a file is being written, it is not concurrently read.  In general files
>>>>>>>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>>>>>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>>>>>
>>>>>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>>>>>
>>>>>>>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>>>>>> Isn't that discouraged for SSDs?
>>>>>>>>>>
>>>>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>>>>>> and thus never discarded..? Are you running fstrim?
>>>>>>>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>>>>>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>>>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>>>>>
>>>>>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>>>>>> periodic fstrim for performance reasons, but that may or may not still
>>>>>>> be the case.
>>>>>> I understand that most SSDs have queued trim these days, but maybe I'm
>>>>>> optimistic.
>>>>>>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 19:41                             ` Christoph Hellwig
@ 2015-12-01 19:50                               ` Avi Kivity
  0 siblings, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 19:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, Glauber Costa, xfs

On 12/01/2015 09:41 PM, Christoph Hellwig wrote:
> On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
>> It's basically the same thing.  To to this, we'd have get_block either
>> return the block's address (if it was in some metadata cache), or, if it was
>> not, issue an I/O that fills (part of) that cache, and as its completion
>> function, a continuation that reruns __blockdev_direct_IO from the point it
>> was stopped so it can submit the data I/O (if the metadata cache was
>> completely updated) or issue the next I/O aiming to fill that metadata
>> cache, if it was not.
> We did something this for blocking reads with great results, and it could be
> done similarly for direct I/O I think:
>
> 	https://lwn.net/Articles/612483/
>
> Unfortunately Andrew shut it down for odd reasons so it didn't get in.

How would this work?  io_submit() returns -ENOTALLMETADATAISINCACHE, 
user calls io_submit() again from a worker thread, where he doesn't mind 
blocking?

In fact sys_io_submit() could catch this error and resubmit the I/O on 
its own using a work item, and io_submit() would become non-blocking, at 
least on I/O (lock contention may still be a problem, but a smaller one).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-11-30 23:51   ` Glauber Costa
@ 2015-12-01 20:30     ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 20:30 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, xfs

On Mon, Nov 30, 2015 at 06:51:51PM -0500, Glauber Costa wrote:
> On Mon, Nov 30, 2015 at 6:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > Let me have a think about how we can implement lazytime in a sane
> > way, such that fsync() works correctly, we don't throw away
> > timstamp changes in memory reclaim and we don't write unlogged
> > changes to the on-disk locations....
> 
> I trust you fully for matters related to speed.
> 
> Keep in mind, though, that at least for us the fact that it blocks is
> a lot worse than the fact that it is slow. We can work around slow,
> but blocking basically means that we won't have any more work to push
> - since we don't do threading. The processor that stales just sits
> idle until the lock is released. So any non-blocking solution to this
> would already be a win for us.

Right, the blocking is on the inode lock needed to do the
transactional update of the timestamp. lazytime would need to avoid
the timestamp update transaction completely, but we still need to
capture the timestamp and run the transaction later or capture it in
a subsequent change before we write back the inode.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 14:01             ` Glauber Costa
  2015-12-01 14:37               ` Avi Kivity
@ 2015-12-01 20:45               ` Dave Chinner
  2015-12-01 20:56                 ` Avi Kivity
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 20:45 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, Brian Foster, xfs

On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
> On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote:
> > On 12/01/2015 03:11 PM, Brian Foster wrote:
> >> It sounds to me that first and foremost you want to make sure you don't
> >> have however many parallel operations you typically have running
> >> contending on the same inodes or AGs. Hint: creating files under
> >> separate subdirectories is a quick and easy way to allocate inodes under
> >> separate AGs (the agno is encoded into the upper bits of the inode
> >> number).
> >
> >
> > Unfortunately our directory layout cannot be changed.  And doesn't this
> > require having agcount == O(number of active files)?  That is easily in the
> > thousands.
> 
> Actually, wouldn't agcount == O(nr_cpus) be good enough?

Not quite. What you need is agcount ~= O(nr_active_allocations).

The difference is an allocation can block waiting on IO, and the
CPU can then go off and run another process, which then tries to do
an allocation. So you might only have 4 CPUs, but a workload that
can have a hundred active allocations at once (not uncommon in
file server workloads).

On worklaods that are roughly 1 process per CPU, it's typical that
agcount = 2 * N cpus gives pretty good results on large filesystems.
If you've got 400GB filesystems or you are using spinning disks,
then you probably don't want to go above 16 AGs, because then you
have problems with maintaining continugous free space and you'll
seek the spinning disks to death....

> >> 'mount -o ikeep,'
> >
> >
> > Interesting.  Our files are large so we could try this.

Keep in mind that ikeep means that inode allocation permanently
fragments free space, which can affect how large files are allocated
once you truncate/rm the original files.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 20:45               ` Dave Chinner
@ 2015-12-01 20:56                 ` Avi Kivity
  2015-12-01 23:41                   ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 20:56 UTC (permalink / raw)
  To: Dave Chinner, Glauber Costa; +Cc: Brian Foster, xfs

On 12/01/2015 10:45 PM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
>> On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote:
>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>> It sounds to me that first and foremost you want to make sure you don't
>>>> have however many parallel operations you typically have running
>>>> contending on the same inodes or AGs. Hint: creating files under
>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>> number).
>>>
>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>> require having agcount == O(number of active files)?  That is easily in the
>>> thousands.
>> Actually, wouldn't agcount == O(nr_cpus) be good enough?
> Not quite. What you need is agcount ~= O(nr_active_allocations).

Yes, this is what I mean by "active files".

>
> The difference is an allocation can block waiting on IO, and the
> CPU can then go off and run another process, which then tries to do
> an allocation. So you might only have 4 CPUs, but a workload that
> can have a hundred active allocations at once (not uncommon in
> file server workloads).

But for us, probably not much more.  We try to restrict active I/Os to 
the effective disk queue depth (more than that and they just turn sour 
waiting in the disk queue).


> On worklaods that are roughly 1 process per CPU, it's typical that
> agcount = 2 * N cpus gives pretty good results on large filesystems.

This is probably using sync calls.  Using async calls you can have many 
more I/Os in progress (but still limited by effective disk queue depth).

> If you've got 400GB filesystems or you are using spinning disks,
> then you probably don't want to go above 16 AGs, because then you
> have problems with maintaining continugous free space and you'll
> seek the spinning disks to death....

We're concentrating on SSDs for now.

>
>>>> 'mount -o ikeep,'
>>>
>>> Interesting.  Our files are large so we could try this.
> Keep in mind that ikeep means that inode allocation permanently
> fragments free space, which can affect how large files are allocated
> once you truncate/rm the original files.
>
>

We can try to prime this by allocating a lot of inodes up front, then 
removing them, so that this doesn't happen.

Hurray ext2.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 15:22               ` Avi Kivity
  2015-12-01 16:01                 ` Brian Foster
@ 2015-12-01 21:04                 ` Dave Chinner
  2015-12-01 21:10                   ` Glauber Costa
  2015-12-01 21:24                   ` Avi Kivity
  1 sibling, 2 replies; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 21:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs

On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> On 12/01/2015 04:56 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>>  io_submit() can probably block in a variety of
> >>>places afaict... it might have to read in the inode extent map, allocate
> >>>blocks, take inode/ag locks, reserve log space for transactions, etc.
> >>Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> >>if somebody else has to do it.
> >>
> >I'm not following... if the fs needs to read in the inode extent map to
> >prepare for an allocation, what else can the thread do but wait? Are you
> >suggesting the request kick off whatever the blocking action happens to
> >be asynchronously and return with an error such that the request can be
> >retried later?
> 
> Not quite, it should be invisible to the caller.

I have a pony I can sell you.

> That is, the code called by io_submit()
> (file_operations::write_iter, it seems to be called today) can kick
> off this operation and have it continue from where it left off.

This is a problem that people have tried to solve in the past (e.g.
syslets, etc) where the thread executes until it has to block, and
then it's handled off to a worker thread/syslet to block and the
main process returns with EIOCBQUEUED.

Basically, you're asking for a real AIO infrastructure to
beintroduced into the kernel, and I think that's beyond what us XFS
guys can do...

> >>>  Reducing the frequency of block allocation/frees might also be
> >>>another help (e.g., preallocate and reuse files,
> >>Isn't that discouraged for SSDs?
> >>
> >Perhaps, if you're referring to the fact that the blocks are never freed
> >and thus never discarded..? Are you running fstrim?
> 
> mount -o discard.  And yes, overwrites are supposedly more expensive
> than trim old data + allocate new data, but maybe if you compare it
> with the work XFS has to do, perhaps the tradeoff is bad.

Oh, you do realise that using "-o discard" causes significant delays
in journal commit processing? i.e. the journal commit completion
blocks until all the discards have been submitted and waited on
*synchronously*. This is a problem with the linux block layer in
that blkdev_issue_discard() is a synchronous operation.....

Hence if you are seeing delays in transactions (e.g. timestamp updates)
it's entirely possible that things will get much better if you
remove the discard mount option. It's much better from a performance
perspective to use the fstrim command every so often - fstrim issues
discard operations in the context of the fstrim process - it does
not interact with the transaction subsystem at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 21:04                 ` Dave Chinner
@ 2015-12-01 21:10                   ` Glauber Costa
  2015-12-01 21:39                     ` Dave Chinner
  2015-12-01 21:24                   ` Avi Kivity
  1 sibling, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2015-12-01 21:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Avi Kivity, Brian Foster, xfs

On Tue, Dec 1, 2015 at 4:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>> >On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>> >>>  io_submit() can probably block in a variety of
>> >>>places afaict... it might have to read in the inode extent map, allocate
>> >>>blocks, take inode/ag locks, reserve log space for transactions, etc.
>> >>Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>> >>if somebody else has to do it.
>> >>
>> >I'm not following... if the fs needs to read in the inode extent map to
>> >prepare for an allocation, what else can the thread do but wait? Are you
>> >suggesting the request kick off whatever the blocking action happens to
>> >be asynchronously and return with an error such that the request can be
>> >retried later?
>>
>> Not quite, it should be invisible to the caller.
>
> I have a pony I can sell you.
>
>> That is, the code called by io_submit()
>> (file_operations::write_iter, it seems to be called today) can kick
>> off this operation and have it continue from where it left off.
>
> This is a problem that people have tried to solve in the past (e.g.
> syslets, etc) where the thread executes until it has to block, and
> then it's handled off to a worker thread/syslet to block and the
> main process returns with EIOCBQUEUED.
>
> Basically, you're asking for a real AIO infrastructure to
> beintroduced into the kernel, and I think that's beyond what us XFS
> guys can do...
>
>> >>>  Reducing the frequency of block allocation/frees might also be
>> >>>another help (e.g., preallocate and reuse files,
>> >>Isn't that discouraged for SSDs?
>> >>
>> >Perhaps, if you're referring to the fact that the blocks are never freed
>> >and thus never discarded..? Are you running fstrim?
>>
>> mount -o discard.  And yes, overwrites are supposedly more expensive
>> than trim old data + allocate new data, but maybe if you compare it
>> with the work XFS has to do, perhaps the tradeoff is bad.
>
> Oh, you do realise that using "-o discard" causes significant delays
> in journal commit processing? i.e. the journal commit completion
> blocks until all the discards have been submitted and waited on
> *synchronously*. This is a problem with the linux block layer in
> that blkdev_issue_discard() is a synchronous operation.....
>
> Hence if you are seeing delays in transactions (e.g. timestamp updates)
> it's entirely possible that things will get much better if you
> remove the discard mount option. It's much better from a performance
> perspective to use the fstrim command every so often - fstrim issues
> discard operations in the context of the fstrim process - it does
> not interact with the transaction subsystem at all.

Hi Dave,

This is news to me.

However, in the disk that we have used during the acquisition of this
trace, discard doesn't seem to be supported:
$ sudo fstrim /data/
fstrim: /data/: the discard operation is not supported

In that case, if I understand correctly the discard mount option
should be a noop, no?

That recommendation is great for our general case, though.


>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 19:07                           ` Avi Kivity
@ 2015-12-01 21:19                             ` Dave Chinner
  2015-12-01 21:38                               ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 21:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
> On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
> >Hi Avi,
> >
> >>>else is going to execute in our place until this thread can make
> >>>progress.
> >>For us, nothing else can execute in our place, we usually have exactly one
> >>thread per logical core.  So we are heavily dependent on io_submit not
> >>sleeping.
> >>
> >>The case of a contended lock is, to me, less worrying.  It can be reduced by
> >>using more allocation groups, which is apparently the shared resource under
> >>contention.
> >>
> >I apologize if I misread your previous comments, but, IIRC you said you can't
> >change the directory structure your application is using, and IIRC your
> >application does not spread files across several directories.
> 
> I miswrote somewhat: the application writes data files and commitlog
> files.  The data file directory structure is fixed due to
> compatibility concerns (it is not a single directory, but some
> workloads will see most access on files in a single directory.  The
> commitlog directory structure is more relaxed, and we can split it
> to a directory per shard (=cpu) or something else.
> 
> If worst comes to worst, we'll hack around this and distribute the
> data files into more directories, and provide some hack for
> compatibility.
> 
> >XFS spread files across the allocation groups, based on the directory these
> >files are created,
> 
> Idea: create the files in some subdirectory, and immediately move
> them to their required location.

See xfs_fsr.

> 
> >  trying to keep files as close as possible from their
> >metadata.
> 
> This is pointless for an SSD. Perhaps XFS should randomize the ag on
> nonrotational media instead.

Actually, no, it is not pointless. SSDs do not require optimisation
for minimal seek time, but data locality is still just as important
as spinning disks, if not moreso. Why? Because the garbage
collection routines in the SSDs are all about locality and we can't
drive garbage collection effectively via discard operations if the
filesystem is not keeping temporally related files close together in
it's block address space.

e.g. If the files in a directory are all close together, and the
directory is removed, we then leave a big empty contiguous region in
the filesystem free space map, and when we send discards over that
we end up with a single big trim and the drive handles that far more
effectively than lots of little trims (i.e. one per file) that the
drive cannot do anything useful with because they are all smaller
than the internal SSD page/block sizes and so get ignored.  This is
one of the reasons fstrim is so much more efficient and effective
than using the discard mount option.

And, well, XFS is designed to operate on storage devices made up of
more than one drive, so the way AGs are selected is designed to
given long term load balancing (both for space usage and
instantenous performance). With the existing algorithms we've not
had any issues with SSD lifetimes, long term performance
degradation, etc, so there's no evidence that we actually need to
change the fundamental allocation algorithms specially for SSDs.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 21:04                 ` Dave Chinner
  2015-12-01 21:10                   ` Glauber Costa
@ 2015-12-01 21:24                   ` Avi Kivity
  2015-12-01 21:31                     ` Glauber Costa
  1 sibling, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 21:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs

On 12/01/2015 11:04 PM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>>>   io_submit() can probably block in a variety of
>>>>> places afaict... it might have to read in the inode extent map, allocate
>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>>>> if somebody else has to do it.
>>>>
>>> I'm not following... if the fs needs to read in the inode extent map to
>>> prepare for an allocation, what else can the thread do but wait? Are you
>>> suggesting the request kick off whatever the blocking action happens to
>>> be asynchronously and return with an error such that the request can be
>>> retried later?
>> Not quite, it should be invisible to the caller.
> I have a pony I can sell you.

You already sold me a pony.

>> That is, the code called by io_submit()
>> (file_operations::write_iter, it seems to be called today) can kick
>> off this operation and have it continue from where it left off.
> This is a problem that people have tried to solve in the past (e.g.
> syslets, etc) where the thread executes until it has to block, and
> then it's handled off to a worker thread/syslet to block and the
> main process returns with EIOCBQUEUED.

Yes, I remember that.

> Basically, you're asking for a real AIO infrastructure to
> beintroduced into the kernel, and I think that's beyond what us XFS
> guys can do...

Sure you can, Dave.  In fact you feel an irresistible urge to do it.

But I don't think the EIOCBQUEUED thing need be repeated.  We can have a 
simpler implementation:

  - Add a task flag TIF_AIO, which causes any new I/O to fail with 
EAIOWOULDBLOCK.

  - have __blockdev_direct_IO() do its block-mapping operations with 
TIF_AIO set (but remove it just before issuing the bio).

  - sys_aio_submit() catches EAIOWOULDBLOCK and resubmits the aio in a 
work item, this time without TIF_AIO games.

The effect would be similar to EIOCBQUEUED, but simpler, as instead of 
issuing any metadata I/O you abort the operation and restart it from 
scratch.

>
>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>> another help (e.g., preallocate and reuse files,
>>>> Isn't that discouraged for SSDs?
>>>>
>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>> and thus never discarded..? Are you running fstrim?
>> mount -o discard.  And yes, overwrites are supposedly more expensive
>> than trim old data + allocate new data, but maybe if you compare it
>> with the work XFS has to do, perhaps the tradeoff is bad.
> Oh, you do realise that using "-o discard" causes significant delays
> in journal commit processing? i.e. the journal commit completion
> blocks until all the discards have been submitted and waited on
> *synchronously*. This is a problem with the linux block layer in
> that blkdev_issue_discard() is a synchronous operation.....

I do now. What's the unicode for a crying face?

> Hence if you are seeing delays in transactions (e.g. timestamp updates)
> it's entirely possible that things will get much better if you
> remove the discard mount option. It's much better from a performance
> perspective to use the fstrim command every so often - fstrim issues
> discard operations in the context of the fstrim process - it does
> not interact with the transaction subsystem at all.
>
>

All right.  On the other hand we have to know when to issue it. That 
would be when nn% of the disk area have been rewritten.  Is there some 
counter I can poll every minute or so for this?  Not doing the fstrim in 
time would cause the disk performance to tank.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 21:24                   ` Avi Kivity
@ 2015-12-01 21:31                     ` Glauber Costa
  0 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2015-12-01 21:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, xfs

>
>>> That is, the code called by io_submit()
>>> (file_operations::write_iter, it seems to be called today) can kick
>>> off this operation and have it continue from where it left off.
>>
>> This is a problem that people have tried to solve in the past (e.g.
>> syslets, etc) where the thread executes until it has to block, and
>> then it's handled off to a worker thread/syslet to block and the
>> main process returns with EIOCBQUEUED.
>
>
> Yes, I remember that.
>
>> Basically, you're asking for a real AIO infrastructure to
>> beintroduced into the kernel, and I think that's beyond what us XFS
>> guys can do...
>
>
> Sure you can, Dave.  In fact you feel an irresistible urge to do it.

What is that? Are you that anxious for the Star Wars premiere that you
are trying your very own jedi mind tricks??


> I do now. What's the unicode for a crying face?
>
>> Hence if you are seeing delays in transactions (e.g. timestamp updates)
>> it's entirely possible that things will get much better if you
>> remove the discard mount option. It's much better from a performance
>> perspective to use the fstrim command every so often - fstrim issues
>> discard operations in the context of the fstrim process - it does
>> not interact with the transaction subsystem at all.
>>
>>
>
> All right.  On the other hand we have to know when to issue it. That would
> be when nn% of the disk area have been rewritten.  Is there some counter I
> can poll every minute or so for this?  Not doing the fstrim in time would
> cause the disk performance to tank.

Note, as I said, that while this is a really good general
recommendation from down under, that was not likely to have had any
effect in the current trace - that disk does not support discard, and
I am assuming the mount option becomes a noop in this case.

>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 21:19                             ` Dave Chinner
@ 2015-12-01 21:38                               ` Avi Kivity
  2015-12-01 23:06                                 ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-01 21:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Glauber Costa, xfs

On 12/01/2015 11:19 PM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
>> On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
>>> Hi Avi,
>>>
>>>>> else is going to execute in our place until this thread can make
>>>>> progress.
>>>> For us, nothing else can execute in our place, we usually have exactly one
>>>> thread per logical core.  So we are heavily dependent on io_submit not
>>>> sleeping.
>>>>
>>>> The case of a contended lock is, to me, less worrying.  It can be reduced by
>>>> using more allocation groups, which is apparently the shared resource under
>>>> contention.
>>>>
>>> I apologize if I misread your previous comments, but, IIRC you said you can't
>>> change the directory structure your application is using, and IIRC your
>>> application does not spread files across several directories.
>> I miswrote somewhat: the application writes data files and commitlog
>> files.  The data file directory structure is fixed due to
>> compatibility concerns (it is not a single directory, but some
>> workloads will see most access on files in a single directory.  The
>> commitlog directory structure is more relaxed, and we can split it
>> to a directory per shard (=cpu) or something else.
>>
>> If worst comes to worst, we'll hack around this and distribute the
>> data files into more directories, and provide some hack for
>> compatibility.
>>
>>> XFS spread files across the allocation groups, based on the directory these
>>> files are created,
>> Idea: create the files in some subdirectory, and immediately move
>> them to their required location.
> See xfs_fsr.

Can you elaborate?  I don't see how it is applicable.

My hack involves creating the file in a random directory, and while it 
is still zero sized, move it to its final directory.  This is simply to 
defeat the ag selection heuristic.  No data is copied.

>>>   trying to keep files as close as possible from their
>>> metadata.
>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>> nonrotational media instead.
> Actually, no, it is not pointless. SSDs do not require optimisation
> for minimal seek time, but data locality is still just as important
> as spinning disks, if not moreso. Why? Because the garbage
> collection routines in the SSDs are all about locality and we can't
> drive garbage collection effectively via discard operations if the
> filesystem is not keeping temporally related files close together in
> it's block address space.

In my case, files in the same directory are not temporally related. But 
I understand where the heuristic comes from.

Maybe an ioctl to set a directory attribute "the files in this directory 
are not temporally related"?

I imagine this will be useful for many server applications.

> e.g. If the files in a directory are all close together, and the
> directory is removed, we then leave a big empty contiguous region in
> the filesystem free space map, and when we send discards over that
> we end up with a single big trim and the drive handles that far more

Would this not be defeated if a directory that happens to share the 
allocation group gets populated simultaneously?

> effectively than lots of little trims (i.e. one per file) that the
> drive cannot do anything useful with because they are all smaller
> than the internal SSD page/block sizes and so get ignored.  This is
> one of the reasons fstrim is so much more efficient and effective
> than using the discard mount option.

In my use case, the files are fairly large, and there is constant 
rewriting (not in-place: files are read, merged, and written back). So 
I'm worried an fstrim can happen too late.

>
> And, well, XFS is designed to operate on storage devices made up of
> more than one drive, so the way AGs are selected is designed to
> given long term load balancing (both for space usage and
> instantenous performance). With the existing algorithms we've not
> had any issues with SSD lifetimes, long term performance
> degradation, etc, so there's no evidence that we actually need to
> change the fundamental allocation algorithms specially for SSDs.
>

Ok.  Maybe the SSDs can deal with untrimmed overwrites efficiently, 
provided the io sizes are large enough.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 21:10                   ` Glauber Costa
@ 2015-12-01 21:39                     ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 21:39 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Avi Kivity, Brian Foster, xfs

On Tue, Dec 01, 2015 at 04:10:45PM -0500, Glauber Costa wrote:
> On Tue, Dec 1, 2015 at 4:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> >> On 12/01/2015 04:56 PM, Brian Foster wrote:
> >> mount -o discard.  And yes, overwrites are supposedly more expensive
> >> than trim old data + allocate new data, but maybe if you compare it
> >> with the work XFS has to do, perhaps the tradeoff is bad.
> >
> > Oh, you do realise that using "-o discard" causes significant delays
> > in journal commit processing? i.e. the journal commit completion
> > blocks until all the discards have been submitted and waited on
> > *synchronously*. This is a problem with the linux block layer in
> > that blkdev_issue_discard() is a synchronous operation.....
> >
> > Hence if you are seeing delays in transactions (e.g. timestamp updates)
> > it's entirely possible that things will get much better if you
> > remove the discard mount option. It's much better from a performance
> > perspective to use the fstrim command every so often - fstrim issues
> > discard operations in the context of the fstrim process - it does
> > not interact with the transaction subsystem at all.
> 
> Hi Dave,
> 
> This is news to me.
> 
> However, in the disk that we have used during the acquisition of this
> trace, discard doesn't seem to be supported:
> $ sudo fstrim /data/
> fstrim: /data/: the discard operation is not supported
> 
> In that case, if I understand correctly the discard mount option
> should be a noop, no?

XFS still makes the blkdev_issue_discard() calls, though, because
the block device can turn discard support on and off dynamically.
e.g. raid devices where a faulty drive is replaced temporarily with
a drive that doesn't have discard support. The block device suddenly
starts returning -EOPNOTSUPP to the filesystem from
blkdev_issue_discard() calls. However, the admin then replaces that
drive with a new one that des have discard support, and now
blkdev_issue_discard() works as exepected.

IOWs, if you set the mount option, XFS will always attempt to issue
discards...

> That recommendation is great for our general case, though.

For the moment. Given lots of time, reworking this code could
greatly reduce the impact/overhead of it and so make it practical to
enable. There's a lot of work to get to that point, though...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 21:38                               ` Avi Kivity
@ 2015-12-01 23:06                                 ` Dave Chinner
  2015-12-02  9:02                                   ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 23:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
> On 12/01/2015 11:19 PM, Dave Chinner wrote:
> >On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
> >>On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
> >>>Hi Avi,
> >>>
> >>>>>else is going to execute in our place until this thread can make
> >>>>>progress.
> >>>>For us, nothing else can execute in our place, we usually have exactly one
> >>>>thread per logical core.  So we are heavily dependent on io_submit not
> >>>>sleeping.
> >>>>
> >>>>The case of a contended lock is, to me, less worrying.  It can be reduced by
> >>>>using more allocation groups, which is apparently the shared resource under
> >>>>contention.
> >>>>
> >>>I apologize if I misread your previous comments, but, IIRC you said you can't
> >>>change the directory structure your application is using, and IIRC your
> >>>application does not spread files across several directories.
> >>I miswrote somewhat: the application writes data files and commitlog
> >>files.  The data file directory structure is fixed due to
> >>compatibility concerns (it is not a single directory, but some
> >>workloads will see most access on files in a single directory.  The
> >>commitlog directory structure is more relaxed, and we can split it
> >>to a directory per shard (=cpu) or something else.
> >>
> >>If worst comes to worst, we'll hack around this and distribute the
> >>data files into more directories, and provide some hack for
> >>compatibility.
> >>
> >>>XFS spread files across the allocation groups, based on the directory these
> >>>files are created,
> >>Idea: create the files in some subdirectory, and immediately move
> >>them to their required location.
> >See xfs_fsr.
> 
> Can you elaborate?  I don't see how it is applicable.

Just pointing out that this is what xfs_fsr does to control locality
of allocation for files it is defragmenting. Except that rather than
moving files, it uses XFS_IOC_SWAPEXT to switch the data between two
inodes atomically...

> My hack involves creating the file in a random directory, and while
> it is still zero sized, move it to its final directory.  This is
> simply to defeat the ag selection heuristic. 

Which you really don't want to do.

> >>>  trying to keep files as close as possible from their
> >>>metadata.
> >>This is pointless for an SSD. Perhaps XFS should randomize the ag on
> >>nonrotational media instead.
> >Actually, no, it is not pointless. SSDs do not require optimisation
> >for minimal seek time, but data locality is still just as important
> >as spinning disks, if not moreso. Why? Because the garbage
> >collection routines in the SSDs are all about locality and we can't
> >drive garbage collection effectively via discard operations if the
> >filesystem is not keeping temporally related files close together in
> >it's block address space.
> 
> In my case, files in the same directory are not temporally related.
> But I understand where the heuristic comes from.
> 
> Maybe an ioctl to set a directory attribute "the files in this
> directory are not temporally related"?

And exactly what does that gain us? Exactly what problem are you
trying to solve by manipulating file locality that can't be solved
by existing knobs and config options?

Perhaps you'd like to read up on how the inode32 allocator behaves?

> >e.g. If the files in a directory are all close together, and the
> >directory is removed, we then leave a big empty contiguous region in
> >the filesystem free space map, and when we send discards over that
> >we end up with a single big trim and the drive handles that far more
> 
> Would this not be defeated if a directory that happens to share the
> allocation group gets populated simultaneously?

Sure. But this sort of thing is rare in the real world, and when
they do occur, it generally only takes small tweaks to algorithms
and layouts make them go away.  I don't care to bikeshed about
theoretical problems - I'm in the business of finding the root cause
of the problems users are having and solving those problems. So far
what you've given us is a ball of "there's blocking in AIO
submission", and the only one that is clear cut is the timestamp
update.

Go back and categorise the types of blocking that you are seeing -
whether it be on the AGIs during inode manipulation, one the AGFs
becuse of concurrent extent allocation, on log forces because of
slow discards in the transcation completion, on the transaction
subsystem because of a lack of log space for concurrent
reservations, etc. And then determine if changing the layout of the
filesystem (e.g. number of AGs, size of log, etc) and different
mount options (e.g. turning off discard, using inode32 allocator,
etc) make any difference to the blocking issues you are seeing.

Once we know which of the different algorithms is causing the
blocking issues, we'll know a lot more about why we're having
problems and a better idea of what problems we actually need to
solve. 

> >effectively than lots of little trims (i.e. one per file) that the
> >drive cannot do anything useful with because they are all smaller
> >than the internal SSD page/block sizes and so get ignored.  This is
> >one of the reasons fstrim is so much more efficient and effective
> >than using the discard mount option.
> 
> In my use case, the files are fairly large, and there is constant
> rewriting (not in-place: files are read, merged, and written back).
> So I'm worried an fstrim can happen too late.

Have you measured the SSD performance degradation over time due to
large overwrites? If not, then again it is a good chance you are
trying to solve a theoretical problem rather than a real problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 20:56                 ` Avi Kivity
@ 2015-12-01 23:41                   ` Dave Chinner
  2015-12-02  8:23                     ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-01 23:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs

On Tue, Dec 01, 2015 at 10:56:01PM +0200, Avi Kivity wrote:
> On 12/01/2015 10:45 PM, Dave Chinner wrote:
> >On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
> >The difference is an allocation can block waiting on IO, and the
> >CPU can then go off and run another process, which then tries to do
> >an allocation. So you might only have 4 CPUs, but a workload that
> >can have a hundred active allocations at once (not uncommon in
> >file server workloads).
> 
> But for us, probably not much more.  We try to restrict active I/Os
> to the effective disk queue depth (more than that and they just turn
> sour waiting in the disk queue).
> 
> 
> >On worklaods that are roughly 1 process per CPU, it's typical that
> >agcount = 2 * N cpus gives pretty good results on large filesystems.
> 
> This is probably using sync calls.  Using async calls you can have
> many more I/Os in progress (but still limited by effective disk
> queue depth).

Ah, no. Even with async IO you don't want unbound allocation
concurrency. The allocation algorithms rely on having contiguous
free space extents that are much larger than the allocations being
done to work effeectively and minimise file fragmentation. If you
chop the filesystem up into lots of small AGs, then it accelerates
the rate at which the free space gets chopped up into smaller
extents and performance then suffers. It's the same problem as
running a large filesystem near ENOSPC for an extended period of
time, which again is something we most definitely don't recommend
you do in production systems.

> >If you've got 400GB filesystems or you are using spinning disks,
> >then you probably don't want to go above 16 AGs, because then you
> >have problems with maintaining continugous free space and you'll
> >seek the spinning disks to death....
> 
> We're concentrating on SSDs for now.

Sure, so "problems with maintaining continugous free space" is what
you need to be concerned about.

> >>>>'mount -o ikeep,'
> >>>
> >>>Interesting.  Our files are large so we could try this.
> >Keep in mind that ikeep means that inode allocation permanently
> >fragments free space, which can affect how large files are allocated
> >once you truncate/rm the original files.
> 
> We can try to prime this by allocating a lot of inodes up front,
> then removing them, so that this doesn't happen.

Again - what problem have you measured that inode preallocation will
solves in your application? Don't make changes just because you
*think* it will fix what you *think* is a problem. Measure, analyse,
solve, in that order.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 19:26                           ` Avi Kivity
  2015-12-01 19:41                             ` Christoph Hellwig
@ 2015-12-02  0:13                             ` Brian Foster
  2015-12-02  0:57                               ` Dave Chinner
  2015-12-02  8:34                               ` Avi Kivity
  1 sibling, 2 replies; 58+ messages in thread
From: Brian Foster @ 2015-12-02  0:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
> On 12/01/2015 08:51 PM, Brian Foster wrote:
> >On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
> >>
> >>On 12/01/2015 06:29 PM, Brian Foster wrote:
> >>>On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 06:01 PM, Brian Foster wrote:
> >>>>>On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
> >>>>>>On 12/01/2015 04:56 PM, Brian Foster wrote:
> >>>>>>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
> >>>>>>>>On 12/01/2015 03:11 PM, Brian Foster wrote:
> >>>>>>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
> >>>>>>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote:
> >>>>>>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
> >>>>>>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote:
...
> >>The case of waiting for I/O is much more worrying, because I/O latency are
> >>much higher.  But it seems like most of the DIO path does not trigger
> >>locking around I/O (and we are careful to avoid the ones that do, like
> >>writing beyond eof).
> >>
> >>(sorry for repeating myself, I have the feeling we are talking past each
> >>other and want to be on the same page)
> >>
> >Yeah, my point is just that just because the thread blocked on I/O,
> >doesn't mean the cpu can't carry on with some useful work for another
> >task.
> 
> In our case, there is no other task.  We run one thread per logical core, so
> if that thread gets blocked, the cpu idles.
> 
> The whole point of io_submit() is to issue an I/O and let the caller
> continue processing immediately.  It is the equivalent of O_NONBLOCK for
> networking code.  If O_NONBLOCK did block from time to time, practically all
> modern network applications would see a huge performance drop.
> 

Ok, but my understanding is that O_NONBLOCK would return an error code
in the blocking case such that userspace can do something else or retry
from a blockable context. I think this is similar to what hch posted wrt
to the pwrite2() bits for nonblocking buffered I/O or what I was asking
about earlier on with regard to returning an error if some blocking
would otherwise occur.

> >
> >>>>>  We submit an I/O which is
> >>>>>asynchronous in nature and wait on a completion, which causes the cpu to
> >>>>>schedule and execute another task until the completion is set by I/O
> >>>>>completion (via an async callback). At that point, the issuing thread
> >>>>>continues where it left off. I suspect I'm missing something... can you
> >>>>>elaborate on what you'd do differently here (and how it helps)?
> >>>>Just apply the same technique everywhere: convert locks to trylock +
> >>>>schedule a continuation on failure.
> >>>>
> >>>I'm certainly not an expert on the kernel scheduling, locking and
> >>>serialization mechanisms, but my understanding is that most things
> >>>outside of spin locks are reschedule points. For example, the
> >>>wait_for_completion() calls XFS uses to wait on I/O boil down to
> >>>schedule_timeout() calls. Buffer locks are implemented as semaphores and
> >>>down() can end up in the same place.
> >>But, for the most part, XFS seems to be able to avoid sleeping.  The call to
> >>__blockdev_direct_IO only launches the I/O, so any locking is only around
> >>cpu operations and, unless there is contention, won't cause us to sleep in
> >>io_submit().
> >>
> >>Trying to follow the code, it looks like xfs_get_blocks_direct (and
> >>__blockdev_direct_IO's get_block parameter in general) is synchronous, so
> >>we're just lucky to have everything in cache.  If it isn't, we block right
> >>there.  I really hope I'm misreading this and some other magic is happening
> >>elsewhere instead of this.
> >>
> >Nope, it's synchronous from a code perspective. The
> >xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
> >inode bmap metadata if it hasn't been done already. Note that this
> >should only happen once as everything is stored in-core, so in most
> >cases this is skipped. It's also possible extents are read in via some
> >other path/operation on the inode before an async I/O happens to be
> >submitted (e.g., see some of the other xfs_bmapi_read() callers).
> 
> Is there (could we add) some ioctl to prime this cache?  We could call it
> from a worker thread where we don't mind blocking during open.
> 

I suppose that's possible, or the worker thread could perform some
existing operation known to prime the cache. I don't think it's worth
getting into without a concrete example, however. The extent read
example we're batting around might not ever be a problem (as you've
noted due to file size), if files are truncated and recycled, for
example.

> What is the eviction policy for this cache?   Is it simply the block
> device's page cache?
> 

IIUC the extent list stays around until the inode is reclaimed. There's
a separate buffer cache for metadata buffers. Both types of objects
would be reclaimed based on memory pressure.

> What about the write path, will we see the same problems there?  I would
> guess the problem is less severe there if the metadata is written with
> writeback policy.
> 

Metadata is modified in-core and handed off to the logging
infrastructure via a transaction. The log is flushed to disk some time
later and metadata writeback occurs asynchronously via the xfsaild
thread.

Brian

> >
> >Either way, the extents have to be read in at some point and I'd expect
> >that cpu to schedule onto some other task while that thread waits on I/O
> >to complete (read-ahead could also be a factor here, but I haven't
> >really dug into how that is triggered for buffers).
> 
> To provide an example, our application, which is a database, faces this
> problem exact at a higher level.  Data is stored in data files, and data
> items' locations are stored in index files. When we read a bit of data, we
> issue an index read, and pass it a continuation to be executed when the read
> completes.  This latter continuation parses the data and passes it to the
> code that prepares it for merging with data from other data files, and an
> eventual return to the user.
> 
> Having written code for over a year in this style, I've come to expect it to
> be used everywhere asynchronous I/O is used, but I realize it is fairly hard
> without good support from a framework that allows continuations to be
> composed in a natural way.
> 
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02  0:13                             ` Brian Foster
@ 2015-12-02  0:57                               ` Dave Chinner
  2015-12-02  8:38                                 ` Avi Kivity
  2015-12-02  8:34                               ` Avi Kivity
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-02  0:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: Avi Kivity, Glauber Costa, xfs

On Tue, Dec 01, 2015 at 07:13:29PM -0500, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
> > On 12/01/2015 08:51 PM, Brian Foster wrote:
> > >On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
> > >Nope, it's synchronous from a code perspective. The
> > >xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
> > >inode bmap metadata if it hasn't been done already. Note that this
> > >should only happen once as everything is stored in-core, so in most
> > >cases this is skipped. It's also possible extents are read in via some
> > >other path/operation on the inode before an async I/O happens to be
> > >submitted (e.g., see some of the other xfs_bmapi_read() callers).
> > 
> > Is there (could we add) some ioctl to prime this cache?  We could call it
> > from a worker thread where we don't mind blocking during open.
> > 
> 
> I suppose that's possible, or the worker thread could perform some
> existing operation known to prime the cache. I don't think it's worth
> getting into without a concrete example, however.

You mean like EXT4_IOC_PRECACHE_EXTENTS?

You know, that ioctl that the ext4 googlers needed to add because
they already had AIO applications that depend on it and they hadn't
realised that the could do exactly the same thing with a FIEMAP
call? i.e. this call to count the number of extents in the file:

	struct fiemap fm = {
		.offset = 0,
		.length = FIEMAP_MAX_OFFSET,
	};

	res = ioctl(fd, FS_IOC_FIEMAP, &fm);

will cause XFS to read in the extent map and cache it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 23:41                   ` Dave Chinner
@ 2015-12-02  8:23                     ` Avi Kivity
  0 siblings, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-02  8:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs

On 12/02/2015 01:41 AM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 10:56:01PM +0200, Avi Kivity wrote:
>> On 12/01/2015 10:45 PM, Dave Chinner wrote:
>>> On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
>>> The difference is an allocation can block waiting on IO, and the
>>> CPU can then go off and run another process, which then tries to do
>>> an allocation. So you might only have 4 CPUs, but a workload that
>>> can have a hundred active allocations at once (not uncommon in
>>> file server workloads).
>> But for us, probably not much more.  We try to restrict active I/Os
>> to the effective disk queue depth (more than that and they just turn
>> sour waiting in the disk queue).
>>
>>
>>> On worklaods that are roughly 1 process per CPU, it's typical that
>>> agcount = 2 * N cpus gives pretty good results on large filesystems.
>> This is probably using sync calls.  Using async calls you can have
>> many more I/Os in progress (but still limited by effective disk
>> queue depth).
> Ah, no. Even with async IO you don't want unbound allocation
> concurrency.

Unbound, certainly not.

But if my disk want 100 concurrent operations to deliver maximum 
bandwidth, and XFS wants fewer concurrent allocations to satisfy some 
internal constraint, then I can't satisfy both.

To be fair, the number 100 was measured for 4k reads.  It's sure to be 
much lower for 128k writes, and since we set an extent size hint of 1MB, 
only 1/8th of those will be allocating.  So I expect things to work in 
practice, at least with the current generation of disks. Unfortunately 
disk bandwidth is growing faster than latency is improving, which means 
that the effective concurrency is increasing.

>   The allocation algorithms rely on having contiguous
> free space extents that are much larger than the allocations being
> done to work effeectively and minimise file fragmentation. If you
> chop the filesystem up into lots of small AGs, then it accelerates
> the rate at which the free space gets chopped up into smaller
> extents and performance then suffers. It's the same problem as
> running a large filesystem near ENOSPC for an extended period of
> time, which again is something we most definitely don't recommend
> you do in production systems.

I understand.  I guess it makes ag randomization even more important, 
for our use case.

What happens when an ag fills up?  Can a file overflow to another ag?

>
>>> If you've got 400GB filesystems or you are using spinning disks,
>>> then you probably don't want to go above 16 AGs, because then you
>>> have problems with maintaining continugous free space and you'll
>>> seek the spinning disks to death....
>> We're concentrating on SSDs for now.
> Sure, so "problems with maintaining continugous free space" is what
> you need to be concerned about.

Right.  Luckily our allocation patterns are very friendly towards that.  
We have append-only files that grow rapidly, then are immutable for a 
time, then are deleted. (It is a log-structured database so a natual fit 
for SSDs).

We can increase our extent size hint if it will help the SSD any.

>
>>>>>> 'mount -o ikeep,'
>>>>> Interesting.  Our files are large so we could try this.
>>> Keep in mind that ikeep means that inode allocation permanently
>>> fragments free space, which can affect how large files are allocated
>>> once you truncate/rm the original files.
>> We can try to prime this by allocating a lot of inodes up front,
>> then removing them, so that this doesn't happen.
> Again - what problem have you measured that inode preallocation will
> solves in your application? Don't make changes just because you
> *think* it will fix what you *think* is a problem. Measure, analyse,
> solve, in that order.

We are now investigating what we can do to fix the problem, we aren't 
committing to any solution yet.  Certainly we plan to be certain of what 
the problem is before we fix it.

Up until a few days ago we never saw any blocks with XFS, and were very 
happy -- but that was with 90us, 450k IOPS disks.  With the slower 
disks, accessed through a certain hypervisor, we do see XFS block, and 
it is very worrying.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02  0:13                             ` Brian Foster
  2015-12-02  0:57                               ` Dave Chinner
@ 2015-12-02  8:34                               ` Avi Kivity
  2015-12-08  6:03                                 ` Dave Chinner
  1 sibling, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-02  8:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: Glauber Costa, xfs



On 12/02/2015 02:13 AM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
>> On 12/01/2015 08:51 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 06:29 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
>>>>>> On 12/01/2015 06:01 PM, Brian Foster wrote:
>>>>>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>>>>>>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>>>>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>>>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
> ...
>>>> The case of waiting for I/O is much more worrying, because I/O latency are
>>>> much higher.  But it seems like most of the DIO path does not trigger
>>>> locking around I/O (and we are careful to avoid the ones that do, like
>>>> writing beyond eof).
>>>>
>>>> (sorry for repeating myself, I have the feeling we are talking past each
>>>> other and want to be on the same page)
>>>>
>>> Yeah, my point is just that just because the thread blocked on I/O,
>>> doesn't mean the cpu can't carry on with some useful work for another
>>> task.
>> In our case, there is no other task.  We run one thread per logical core, so
>> if that thread gets blocked, the cpu idles.
>>
>> The whole point of io_submit() is to issue an I/O and let the caller
>> continue processing immediately.  It is the equivalent of O_NONBLOCK for
>> networking code.  If O_NONBLOCK did block from time to time, practically all
>> modern network applications would see a huge performance drop.
>>
> Ok, but my understanding is that O_NONBLOCK would return an error code
> in the blocking case such that userspace can do something else or retry
> from a blockable context.

I did not mean the exact equivalent, but in the spirit of allowing a 
thread to perform an I/O task (networking or file I/O) in parallel with 
computation.

For networking, returning an error is fine because there exists a 
notification (epoll) to tell userspace when a retry would succeed. For 
file I/O, there isn't one.  Still, returning an error is better than 
nothing because then, as you say, you can retry in a blockable context.

>   I think this is similar to what hch posted wrt
> to the pwrite2() bits for nonblocking buffered I/O or what I was asking
> about earlier on with regard to returning an error if some blocking
> would otherwise occur.

Yes.  Anything except silently blocking!

>
>>>>>>>   We submit an I/O which is
>>>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>>>> schedule and execute another task until the completion is set by I/O
>>>>>>> completion (via an async callback). At that point, the issuing thread
>>>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>>>> schedule a continuation on failure.
>>>>>>
>>>>> I'm certainly not an expert on the kernel scheduling, locking and
>>>>> serialization mechanisms, but my understanding is that most things
>>>>> outside of spin locks are reschedule points. For example, the
>>>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>>>> down() can end up in the same place.
>>>> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
>>>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>>>> cpu operations and, unless there is contention, won't cause us to sleep in
>>>> io_submit().
>>>>
>>>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>>>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>>>> we're just lucky to have everything in cache.  If it isn't, we block right
>>>> there.  I really hope I'm misreading this and some other magic is happening
>>>> elsewhere instead of this.
>>>>
>>> Nope, it's synchronous from a code perspective. The
>>> xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
>>> inode bmap metadata if it hasn't been done already. Note that this
>>> should only happen once as everything is stored in-core, so in most
>>> cases this is skipped. It's also possible extents are read in via some
>>> other path/operation on the inode before an async I/O happens to be
>>> submitted (e.g., see some of the other xfs_bmapi_read() callers).
>> Is there (could we add) some ioctl to prime this cache?  We could call it
>> from a worker thread where we don't mind blocking during open.
>>
> I suppose that's possible, or the worker thread could perform some
> existing operation known to prime the cache. I don't think it's worth
> getting into without a concrete example, however. The extent read
> example we're batting around might not ever be a problem (as you've
> noted due to file size), if files are truncated and recycled, for
> example.
>
>> What is the eviction policy for this cache?   Is it simply the block
>> device's page cache?
>>
> IIUC the extent list stays around until the inode is reclaimed. There's
> a separate buffer cache for metadata buffers. Both types of objects
> would be reclaimed based on memory pressure.

It comes down to size of disk, size of memory, and average file size.  I 
expect that with current disk and memory sizes the metadata is quite 
small, so this might not be a problem, and even a cold start would 
self-prime in a reasonably short time.

>
>> What about the write path, will we see the same problems there?  I would
>> guess the problem is less severe there if the metadata is written with
>> writeback policy.
>>
> Metadata is modified in-core and handed off to the logging
> infrastructure via a transaction. The log is flushed to disk some time
> later and metadata writeback occurs asynchronously via the xfsaild
> thread.

Unless, I expect, if the log is full.  Since we're hammering on the disk 
quite heavily, the log would be fighting with user I/O and possibly losing.

Does XFS throttle user I/O in order to get the log buffers recycled faster?

Is there any way for us to keep track of it, and reduce disk pressure 
when it gets full?

Oh you answered that already, /sys/fs/xfs/device/log/*.

>
> Brian
>
>>> Either way, the extents have to be read in at some point and I'd expect
>>> that cpu to schedule onto some other task while that thread waits on I/O
>>> to complete (read-ahead could also be a factor here, but I haven't
>>> really dug into how that is triggered for buffers).
>> To provide an example, our application, which is a database, faces this
>> problem exact at a higher level.  Data is stored in data files, and data
>> items' locations are stored in index files. When we read a bit of data, we
>> issue an index read, and pass it a continuation to be executed when the read
>> completes.  This latter continuation parses the data and passes it to the
>> code that prepares it for merging with data from other data files, and an
>> eventual return to the user.
>>
>> Having written code for over a year in this style, I've come to expect it to
>> be used everywhere asynchronous I/O is used, but I realize it is fairly hard
>> without good support from a framework that allows continuations to be
>> composed in a natural way.
>>
>>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02  0:57                               ` Dave Chinner
@ 2015-12-02  8:38                                 ` Avi Kivity
  0 siblings, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-02  8:38 UTC (permalink / raw)
  To: Dave Chinner, Brian Foster; +Cc: Glauber Costa, xfs



On 12/02/2015 02:57 AM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 07:13:29PM -0500, Brian Foster wrote:
>> On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
>>> On 12/01/2015 08:51 PM, Brian Foster wrote:
>>>> On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
>>>> Nope, it's synchronous from a code perspective. The
>>>> xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
>>>> inode bmap metadata if it hasn't been done already. Note that this
>>>> should only happen once as everything is stored in-core, so in most
>>>> cases this is skipped. It's also possible extents are read in via some
>>>> other path/operation on the inode before an async I/O happens to be
>>>> submitted (e.g., see some of the other xfs_bmapi_read() callers).
>>> Is there (could we add) some ioctl to prime this cache?  We could call it
>>> from a worker thread where we don't mind blocking during open.
>>>
>> I suppose that's possible, or the worker thread could perform some
>> existing operation known to prime the cache. I don't think it's worth
>> getting into without a concrete example, however.
> You mean like EXT4_IOC_PRECACHE_EXTENTS?
>
> You know, that ioctl that the ext4 googlers needed to add because
> they already had AIO applications that depend on it and they hadn't
> realised that the could do exactly the same thing with a FIEMAP
> call? i.e. this call to count the number of extents in the file:
>
> 	struct fiemap fm = {
> 		.offset = 0,
> 		.length = FIEMAP_MAX_OFFSET,
> 	};
>
> 	res = ioctl(fd, FS_IOC_FIEMAP, &fm);
>
> will cause XFS to read in the extent map and cache it.
>

Cool, it even appears to be callable with CAP_WHATEVER.  So we would use 
this to prime the metadata caches before startup, if they turn out to be 
a problem in practice.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-01 23:06                                 ` Dave Chinner
@ 2015-12-02  9:02                                   ` Avi Kivity
  2015-12-02 12:57                                     ` Carlos Maiolino
  2015-12-02 23:19                                     ` Dave Chinner
  0 siblings, 2 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-02  9:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Glauber Costa, xfs



On 12/02/2015 01:06 AM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
>> On 12/01/2015 11:19 PM, Dave Chinner wrote:
>>> On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
>>>>> Hi Avi,
>>>>>
>>>>>>> else is going to execute in our place until this thread can make
>>>>>>> progress.
>>>>>> For us, nothing else can execute in our place, we usually have exactly one
>>>>>> thread per logical core.  So we are heavily dependent on io_submit not
>>>>>> sleeping.
>>>>>>
>>>>>> The case of a contended lock is, to me, less worrying.  It can be reduced by
>>>>>> using more allocation groups, which is apparently the shared resource under
>>>>>> contention.
>>>>>>
>>>>> I apologize if I misread your previous comments, but, IIRC you said you can't
>>>>> change the directory structure your application is using, and IIRC your
>>>>> application does not spread files across several directories.
>>>> I miswrote somewhat: the application writes data files and commitlog
>>>> files.  The data file directory structure is fixed due to
>>>> compatibility concerns (it is not a single directory, but some
>>>> workloads will see most access on files in a single directory.  The
>>>> commitlog directory structure is more relaxed, and we can split it
>>>> to a directory per shard (=cpu) or something else.
>>>>
>>>> If worst comes to worst, we'll hack around this and distribute the
>>>> data files into more directories, and provide some hack for
>>>> compatibility.
>>>>
>>>>> XFS spread files across the allocation groups, based on the directory these
>>>>> files are created,
>>>> Idea: create the files in some subdirectory, and immediately move
>>>> them to their required location.
>>> See xfs_fsr.
>> Can you elaborate?  I don't see how it is applicable.
> Just pointing out that this is what xfs_fsr does to control locality
> of allocation for files it is defragmenting. Except that rather than
> moving files, it uses XFS_IOC_SWAPEXT to switch the data between two
> inodes atomically...

Ok, thanks.

>
>> My hack involves creating the file in a random directory, and while
>> it is still zero sized, move it to its final directory.  This is
>> simply to defeat the ag selection heuristic.
> Which you really don't want to do.

Why not?  For my directory structure, files in the same directory do not 
share temporal locality.  What does the ag selection heuristic give me?



>
>>>>>   trying to keep files as close as possible from their
>>>>> metadata.
>>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>>>> nonrotational media instead.
>>> Actually, no, it is not pointless. SSDs do not require optimisation
>>> for minimal seek time, but data locality is still just as important
>>> as spinning disks, if not moreso. Why? Because the garbage
>>> collection routines in the SSDs are all about locality and we can't
>>> drive garbage collection effectively via discard operations if the
>>> filesystem is not keeping temporally related files close together in
>>> it's block address space.
>> In my case, files in the same directory are not temporally related.
>> But I understand where the heuristic comes from.
>>
>> Maybe an ioctl to set a directory attribute "the files in this
>> directory are not temporally related"?
> And exactly what does that gain us?

I have a directory with commitlog files that are constantly and rapidly 
being created, appended to, and removed, from all logical cores in the 
system.  Does this not put pressure on that allocation group's locks?

> Exactly what problem are you
> trying to solve by manipulating file locality that can't be solved
> by existing knobs and config options?

I admit I don't know much about the existing knobs and config options.  
Pointers are appreciated.


>
> Perhaps you'd like to read up on how the inode32 allocator behaves?

Indeed I would, pointers are appreciated.

>
>>> e.g. If the files in a directory are all close together, and the
>>> directory is removed, we then leave a big empty contiguous region in
>>> the filesystem free space map, and when we send discards over that
>>> we end up with a single big trim and the drive handles that far more
>> Would this not be defeated if a directory that happens to share the
>> allocation group gets populated simultaneously?
> Sure. But this sort of thing is rare in the real world, and when
> they do occur, it generally only takes small tweaks to algorithms
> and layouts make them go away.  I don't care to bikeshed about
> theoretical problems - I'm in the business of finding the root cause
> of the problems users are having and solving those problems. So far
> what you've given us is a ball of "there's blocking in AIO
> submission", and the only one that is clear cut is the timestamp
> update.
>
> Go back and categorise the types of blocking that you are seeing -
> whether it be on the AGIs during inode manipulation, one the AGFs
> becuse of concurrent extent allocation, on log forces because of
> slow discards in the transcation completion, on the transaction
> subsystem because of a lack of log space for concurrent
> reservations, etc. And then determine if changing the layout of the
> filesystem (e.g. number of AGs, size of log, etc) and different
> mount options (e.g. turning off discard, using inode32 allocator,
> etc) make any difference to the blocking issues you are seeing.
>
> Once we know which of the different algorithms is causing the
> blocking issues, we'll know a lot more about why we're having
> problems and a better idea of what problems we actually need to
> solve.

I'm happy to hack off the lowest hanging fruit and then go after the 
next one.  I understand you're annoyed at having to defend against what 
may be non-problems; but for me it is an opportunity to learn about the 
file system.  For us it is the weakest spot in our system, because on 
the one hand we heavily depend on async behavior and on the other hand 
Linux is notoriously bad at it.  So we are very nervous when blocking 
happens.

>
>>> effectively than lots of little trims (i.e. one per file) that the
>>> drive cannot do anything useful with because they are all smaller
>>> than the internal SSD page/block sizes and so get ignored.  This is
>>> one of the reasons fstrim is so much more efficient and effective
>>> than using the discard mount option.
>> In my use case, the files are fairly large, and there is constant
>> rewriting (not in-place: files are read, merged, and written back).
>> So I'm worried an fstrim can happen too late.
> Have you measured the SSD performance degradation over time due to
> large overwrites? If not, then again it is a good chance you are
> trying to solve a theoretical problem rather than a real problem....
>

I'm not worried about that (maybe I should be) but about the SSD 
reaching internal ENOSPC due to the fstrim happening too late.

Consider this scenario, which is quite typical for us:

1. Fill 1/3rd of the disk with a few large files.
2. Copy/merge the data into a new file, occupying another 1/3rd of the disk.
3. Repeat 1+2.

If this is repeated few times, the disk can see 100% of its space 
occupied (depending on how free space is allocated), even if from a 
user's perspective it is never more than 2/3rds full.

Maybe a simple countermeasure is to issue an fstrim every time we write 
10%-20% of the disk's capacity.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02  9:02                                   ` Avi Kivity
@ 2015-12-02 12:57                                     ` Carlos Maiolino
  2015-12-02 23:19                                     ` Dave Chinner
  1 sibling, 0 replies; 58+ messages in thread
From: Carlos Maiolino @ 2015-12-02 12:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

> >>>>compatibility.
> >>>>
> >>>>>XFS spread files across the allocation groups, based on the directory these
> >>>>>files are created,
> >>>>Idea: create the files in some subdirectory, and immediately move
> >>>>them to their required location.
> >>>See xfs_fsr.
> >>Can you elaborate?  I don't see how it is applicable.
> >Just pointing out that this is what xfs_fsr does to control locality
> >of allocation for files it is defragmenting. Except that rather than
> >moving files, it uses XFS_IOC_SWAPEXT to switch the data between two
> >inodes atomically...
> 
> Ok, thanks.
> 
> >
> >>My hack involves creating the file in a random directory, and while
> >>it is still zero sized, move it to its final directory.  This is
> >>simply to defeat the ag selection heuristic.
> >Which you really don't want to do.
> 
> Why not?  For my directory structure, files in the same directory do not
> share temporal locality.  What does the ag selection heuristic give me?
> 
> 
> 
> >
> >>>>>  trying to keep files as close as possible from their
> >>>>>metadata.
> >>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on
> >>>>nonrotational media instead.
> >>>Actually, no, it is not pointless. SSDs do not require optimisation
> >>>for minimal seek time, but data locality is still just as important
> >>>as spinning disks, if not moreso. Why? Because the garbage
> >>>collection routines in the SSDs are all about locality and we can't
> >>>drive garbage collection effectively via discard operations if the
> >>>filesystem is not keeping temporally related files close together in
> >>>it's block address space.
> >>In my case, files in the same directory are not temporally related.
> >>But I understand where the heuristic comes from.
> >>
> >>Maybe an ioctl to set a directory attribute "the files in this
> >>directory are not temporally related"?
> >And exactly what does that gain us?
> 
> I have a directory with commitlog files that are constantly and rapidly
> being created, appended to, and removed, from all logical cores in the
> system.  Does this not put pressure on that allocation group's locks?
> 
> >Exactly what problem are you
> >trying to solve by manipulating file locality that can't be solved
> >by existing knobs and config options?
> 
> I admit I don't know much about the existing knobs and config options.
> Pointers are appreciated.
> 
> 
> >
> >Perhaps you'd like to read up on how the inode32 allocator behaves?
> 
> Indeed I would, pointers are appreciated.
> 
inode32 mount option limit inode allocations to the first AG of the
filesystem, instead of allocate inodes across the whole AGs.

This exists basically because since inode numbers are based on where the inode
is allocated, inodes allocated beyond the first TB of the filesystem, will have
64bit inode numbers, which, in case, might cause compatibility problems with
applications which are not able to read 64bit inode numbers.





> >
> >>>e.g. If the files in a directory are all close together, and the
> >>>directory is removed, we then leave a big empty contiguous region in
> >>>the filesystem free space map, and when we send discards over that
> >>>we end up with a single big trim and the drive handles that far more
> >>Would this not be defeated if a directory that happens to share the
> >>allocation group gets populated simultaneously?
> >Sure. But this sort of thing is rare in the real world, and when
> >they do occur, it generally only takes small tweaks to algorithms
> >and layouts make them go away.  I don't care to bikeshed about
> >theoretical problems - I'm in the business of finding the root cause
> >of the problems users are having and solving those problems. So far
> >what you've given us is a ball of "there's blocking in AIO
> >submission", and the only one that is clear cut is the timestamp
> >update.
> >
> >Go back and categorise the types of blocking that you are seeing -
> >whether it be on the AGIs during inode manipulation, one the AGFs
> >becuse of concurrent extent allocation, on log forces because of
> >slow discards in the transcation completion, on the transaction
> >subsystem because of a lack of log space for concurrent
> >reservations, etc. And then determine if changing the layout of the
> >filesystem (e.g. number of AGs, size of log, etc) and different
> >mount options (e.g. turning off discard, using inode32 allocator,
> >etc) make any difference to the blocking issues you are seeing.
> >
> >Once we know which of the different algorithms is causing the
> >blocking issues, we'll know a lot more about why we're having
> >problems and a better idea of what problems we actually need to
> >solve.
> 
> I'm happy to hack off the lowest hanging fruit and then go after the next
> one.  I understand you're annoyed at having to defend against what may be
> non-problems; but for me it is an opportunity to learn about the file
> system.  For us it is the weakest spot in our system, because on the one
> hand we heavily depend on async behavior and on the other hand Linux is
> notoriously bad at it.  So we are very nervous when blocking happens.
> 
> >
> >>>effectively than lots of little trims (i.e. one per file) that the
> >>>drive cannot do anything useful with because they are all smaller
> >>>than the internal SSD page/block sizes and so get ignored.  This is
> >>>one of the reasons fstrim is so much more efficient and effective
> >>>than using the discard mount option.
> >>In my use case, the files are fairly large, and there is constant
> >>rewriting (not in-place: files are read, merged, and written back).
> >>So I'm worried an fstrim can happen too late.
> >Have you measured the SSD performance degradation over time due to
> >large overwrites? If not, then again it is a good chance you are
> >trying to solve a theoretical problem rather than a real problem....
> >
> 
> I'm not worried about that (maybe I should be) but about the SSD reaching
> internal ENOSPC due to the fstrim happening too late.
> 
> Consider this scenario, which is quite typical for us:
> 
> 1. Fill 1/3rd of the disk with a few large files.
> 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk.
> 3. Repeat 1+2.
> 
> If this is repeated few times, the disk can see 100% of its space occupied
> (depending on how free space is allocated), even if from a user's
> perspective it is never more than 2/3rds full.
> 
> Maybe a simple countermeasure is to issue an fstrim every time we write
> 10%-20% of the disk's capacity.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02  9:02                                   ` Avi Kivity
  2015-12-02 12:57                                     ` Carlos Maiolino
@ 2015-12-02 23:19                                     ` Dave Chinner
  2015-12-03 12:52                                       ` Avi Kivity
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-02 23:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote:
> On 12/02/2015 01:06 AM, Dave Chinner wrote:
> >On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
> >>On 12/01/2015 11:19 PM, Dave Chinner wrote:
> >>>>>XFS spread files across the allocation groups, based on the directory these
> >>>>>files are created,
> >>>>Idea: create the files in some subdirectory, and immediately move
> >>>>them to their required location.
....
> >>My hack involves creating the file in a random directory, and while
> >>it is still zero sized, move it to its final directory.  This is
> >>simply to defeat the ag selection heuristic.
> >Which you really don't want to do.
> 
> Why not?  For my directory structure, files in the same directory do
> not share temporal locality.  What does the ag selection heuristic
> give me?

Wrong question. The right question is this: what problems does
subverting the AG selection heuristic cause me?

If you can't answer that question, then you can't quantify the risks
involved with making such a behavioural change.

> >>>>>  trying to keep files as close as possible from their
> >>>>>metadata.
> >>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on
> >>>>nonrotational media instead.
> >>>Actually, no, it is not pointless. SSDs do not require optimisation
> >>>for minimal seek time, but data locality is still just as important
> >>>as spinning disks, if not moreso. Why? Because the garbage
> >>>collection routines in the SSDs are all about locality and we can't
> >>>drive garbage collection effectively via discard operations if the
> >>>filesystem is not keeping temporally related files close together in
> >>>it's block address space.
> >>In my case, files in the same directory are not temporally related.
> >>But I understand where the heuristic comes from.
> >>
> >>Maybe an ioctl to set a directory attribute "the files in this
> >>directory are not temporally related"?
> >And exactly what does that gain us?
> 
> I have a directory with commitlog files that are constantly and
> rapidly being created, appended to, and removed, from all logical
> cores in the system.  Does this not put pressure on that allocation
> group's locks?

Not usually, because if an AG is contended, the allocation algorithm
skips the contended AG and selects the next uncontended AG to
allocate in. And given that the append algorithm used by the
allocator attempts to use the last block of the last extent as the
target for the new extent (i.e. contiguous allocation) once a file
has skipped to a different AG all allocations will continue in that
new AG until it is either full or it becomes contended....

IOWs, when AG contention occurs, the filesystem automatically
spreads out the load over multiple AGs. Put simply, we optimise for
locality first, but we're willing to compromise on locality to
minimise contention when it occurs. But, also, keep in mind that
in minimising contention we are still selecting the most local of
possible alternatives, and that's something you can't do in
userspace....

> >Exactly what problem are you
> >trying to solve by manipulating file locality that can't be solved
> >by existing knobs and config options?
> 
> I admit I don't know much about the existing knobs and config
> options.  Pointers are appreciated.

You can find some work in progress here:

https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/

looks like there's some problem with xfs.org wiki, so the links
to the user/training info on this page:

http://xfs.org/index.php/XFS_Papers_and_Documentation

aren't working.

> >Perhaps you'd like to read up on how the inode32 allocator behaves?
> 
> Indeed I would, pointers are appreciated.

Inode allocation section here:

https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc

> >Once we know which of the different algorithms is causing the
> >blocking issues, we'll know a lot more about why we're having
> >problems and a better idea of what problems we actually need to
> >solve.
> 
> I'm happy to hack off the lowest hanging fruit and then go after the
> next one.  I understand you're annoyed at having to defend against
> what may be non-problems; but for me it is an opportunity to learn
> about the file system.

No, I'm not annoyed. I just don't want to be chasing ghosts and so
we need to be on the same page about how to track down these issues.
And, beleive me, you'll learn a lot about how the filesystem behaves
just by watching how the different configs react to the same
input...

> For us it is the weakest spot in our system,
> because on the one hand we heavily depend on async behavior and on
> the other hand Linux is notoriously bad at it.  So we are very
> nervous when blocking happens.

I can't disagree with you there - we really need to fix what we can
within the constraints of the OS first, then we once we have it
working as well as we can, then we can look to solving the remaining
"notoriously bad" AIO problems...

> >>>effectively than lots of little trims (i.e. one per file) that the
> >>>drive cannot do anything useful with because they are all smaller
> >>>than the internal SSD page/block sizes and so get ignored.  This is
> >>>one of the reasons fstrim is so much more efficient and effective
> >>>than using the discard mount option.
> >>In my use case, the files are fairly large, and there is constant
> >>rewriting (not in-place: files are read, merged, and written back).
> >>So I'm worried an fstrim can happen too late.
> >Have you measured the SSD performance degradation over time due to
> >large overwrites? If not, then again it is a good chance you are
> >trying to solve a theoretical problem rather than a real problem....
> >
> 
> I'm not worried about that (maybe I should be) but about the SSD
> reaching internal ENOSPC due to the fstrim happening too late.
> 
> Consider this scenario, which is quite typical for us:
> 
> 1. Fill 1/3rd of the disk with a few large files.
> 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk.
> 3. Repeat 1+2.
>
> If this is repeated few times, the disk can see 100% of its space
> occupied (depending on how free space is allocated), even if from a
> user's perspective it is never more than 2/3rds full.

I don't think that's true. SSD behaviour largely depends on how much
of the LBA space has been written to (i.e. marked used) and so that
metric tends to determine how the SSD behaves under such workloads.
This is one of the reasons that overprovisioning SSD space (e.g.
leaving 25% of the LBA space completely unused) results in better
performance under overwrite workloads - there's lots more scratch
space for the garbage collector to work with...

Hence as long as the filesystem is reusing the same LBA regions for
the files, TRIM will probably not make a significant difference to
performance because there's still 1/3rd of the LBA region that is
"unused". Hence the overwrites go into the unused 1/3rd of the SSD,
and the underlying SSD blocks associated with the "overwritten" LBA
region are immediately marked free, just like if you issued a trim
for that region before you start the overwrite.

With the way the XFS allocator works, it fills AGs from lowest to
highest blocks, and if you free lots of space down low in the AG
then that tends to get reused before the higher offset free space.
hence the XFS allocates space in the above workload would result in
roughly 1/3rd of the LBA space associated with the filesystem
remaining unused. This is another allocator behaviour designed for
spinning disks (to keep the data on the faster outer edges of
drives) that maps very well to internal SSD allocation/reclaim
algorithms....

FWIW, did you know that TRIM generally doesn't return the disk to
the performance of a pristine, empty disk?  Generally only a secure
erase will guarantee that a SSD returns to "empty disk" performance,
but that also removes all data from then entire SSD.  Hence the
baseline "sustained performance" you should be using is not "empty
disk" performance, but the performance once the disk has been
overwritten completely at least once. Only them will you tend to see
what effect TRIM will actually have.

> Maybe a simple countermeasure is to issue an fstrim every time we
> write 10%-20% of the disk's capacity.

Run the workload to steady state performance and measure the
degradation as it continues to run and overwrite the SSDs
repeatedly. To do this properly you are going to have to sacrifice
some SSDs, because you're going to need to overwrite them quite a
few times to get an idea of the degradation characteristics and
whether a periodic trim makes any difference or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02 23:19                                     ` Dave Chinner
@ 2015-12-03 12:52                                       ` Avi Kivity
  2015-12-04  3:16                                         ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-03 12:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Glauber Costa, xfs



On 12/03/2015 01:19 AM, Dave Chinner wrote:
> On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote:
>> On 12/02/2015 01:06 AM, Dave Chinner wrote:
>>> On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 11:19 PM, Dave Chinner wrote:
>>>>>>> XFS spread files across the allocation groups, based on the directory these
>>>>>>> files are created,
>>>>>> Idea: create the files in some subdirectory, and immediately move
>>>>>> them to their required location.
> ....
>>>> My hack involves creating the file in a random directory, and while
>>>> it is still zero sized, move it to its final directory.  This is
>>>> simply to defeat the ag selection heuristic.
>>> Which you really don't want to do.
>> Why not?  For my directory structure, files in the same directory do
>> not share temporal locality.  What does the ag selection heuristic
>> give me?
> Wrong question. The right question is this: what problems does
> subverting the AG selection heuristic cause me?
>
> If you can't answer that question, then you can't quantify the risks
> involved with making such a behavioural change.

Okay.  Any hint about the answer to that question?


>
>>>>>>>   trying to keep files as close as possible from their
>>>>>>> metadata.
>>>>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>>>>>> nonrotational media instead.
>>>>> Actually, no, it is not pointless. SSDs do not require optimisation
>>>>> for minimal seek time, but data locality is still just as important
>>>>> as spinning disks, if not moreso. Why? Because the garbage
>>>>> collection routines in the SSDs are all about locality and we can't
>>>>> drive garbage collection effectively via discard operations if the
>>>>> filesystem is not keeping temporally related files close together in
>>>>> it's block address space.
>>>> In my case, files in the same directory are not temporally related.
>>>> But I understand where the heuristic comes from.
>>>>
>>>> Maybe an ioctl to set a directory attribute "the files in this
>>>> directory are not temporally related"?
>>> And exactly what does that gain us?
>> I have a directory with commitlog files that are constantly and
>> rapidly being created, appended to, and removed, from all logical
>> cores in the system.  Does this not put pressure on that allocation
>> group's locks?
> Not usually, because if an AG is contended, the allocation algorithm
> skips the contended AG and selects the next uncontended AG to
> allocate in. And given that the append algorithm used by the
> allocator attempts to use the last block of the last extent as the
> target for the new extent (i.e. contiguous allocation) once a file
> has skipped to a different AG all allocations will continue in that
> new AG until it is either full or it becomes contended....
>
> IOWs, when AG contention occurs, the filesystem automatically
> spreads out the load over multiple AGs. Put simply, we optimise for
> locality first, but we're willing to compromise on locality to
> minimise contention when it occurs. But, also, keep in mind that
> in minimising contention we are still selecting the most local of
> possible alternatives, and that's something you can't do in
> userspace....

Cool.  I don't think "nearly-local" matters much for an SSD (it's either 
contiguous or it is not), but it's good to know that it's self-tuning 
wrt. contention.

In some good news, Glauber hacked our I/O engine not to throw so many 
concurrent I/Os at the filesystem, and indeed so the contention 
reduced.  So it's likely we were pushing the fs so hard all the ags were 
contended, but this is no longer the case.


>
>>> Exactly what problem are you
>>> trying to solve by manipulating file locality that can't be solved
>>> by existing knobs and config options?
>> I admit I don't know much about the existing knobs and config
>> options.  Pointers are appreciated.
> You can find some work in progress here:
>
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/
>
> looks like there's some problem with xfs.org wiki, so the links
> to the user/training info on this page:
>
> http://xfs.org/index.php/XFS_Papers_and_Documentation
>
> aren't working.
>
>>> Perhaps you'd like to read up on how the inode32 allocator behaves?
>> Indeed I would, pointers are appreciated.
> Inode allocation section here:
>
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc

Thanks for all the links, I'll study them and see what we can do to tune 
for our workload.

>>> Once we know which of the different algorithms is causing the
>>> blocking issues, we'll know a lot more about why we're having
>>> problems and a better idea of what problems we actually need to
>>> solve.
>> I'm happy to hack off the lowest hanging fruit and then go after the
>> next one.  I understand you're annoyed at having to defend against
>> what may be non-problems; but for me it is an opportunity to learn
>> about the file system.
> No, I'm not annoyed. I just don't want to be chasing ghosts and so
> we need to be on the same page about how to track down these issues.
> And, beleive me, you'll learn a lot about how the filesystem behaves
> just by watching how the different configs react to the same
> input...

Ok.  Looks like I have a lot of homework.


>
>> For us it is the weakest spot in our system,
>> because on the one hand we heavily depend on async behavior and on
>> the other hand Linux is notoriously bad at it.  So we are very
>> nervous when blocking happens.
> I can't disagree with you there - we really need to fix what we can
> within the constraints of the OS first, then we once we have it
> working as well as we can, then we can look to solving the remaining
> "notoriously bad" AIO problems...

There are lots of users who will be eternally grateful to you if you can 
get this fixed.  Linux has a very bad reputation in this area with the 
accepted wisdom that you can only use aio reliably against block 
devices.  XFS comes very close, it will make a huge impact if it can be 
used to do aio reliably, without a lot of constraints on the application.

>
>>>>> effectively than lots of little trims (i.e. one per file) that the
>>>>> drive cannot do anything useful with because they are all smaller
>>>>> than the internal SSD page/block sizes and so get ignored.  This is
>>>>> one of the reasons fstrim is so much more efficient and effective
>>>>> than using the discard mount option.
>>>> In my use case, the files are fairly large, and there is constant
>>>> rewriting (not in-place: files are read, merged, and written back).
>>>> So I'm worried an fstrim can happen too late.
>>> Have you measured the SSD performance degradation over time due to
>>> large overwrites? If not, then again it is a good chance you are
>>> trying to solve a theoretical problem rather than a real problem....
>>>
>> I'm not worried about that (maybe I should be) but about the SSD
>> reaching internal ENOSPC due to the fstrim happening too late.
>>
>> Consider this scenario, which is quite typical for us:
>>
>> 1. Fill 1/3rd of the disk with a few large files.
>> 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk.
>> 3. Repeat 1+2.
>>
>> If this is repeated few times, the disk can see 100% of its space
>> occupied (depending on how free space is allocated), even if from a
>> user's perspective it is never more than 2/3rds full.
> I don't think that's true. SSD behaviour largely depends on how much
> of the LBA space has been written to (i.e. marked used) and so that
> metric tends to determine how the SSD behaves under such workloads.
> This is one of the reasons that overprovisioning SSD space (e.g.
> leaving 25% of the LBA space completely unused) results in better
> performance under overwrite workloads - there's lots more scratch
> space for the garbage collector to work with...
>
> Hence as long as the filesystem is reusing the same LBA regions for
> the files, TRIM will probably not make a significant difference to
> performance because there's still 1/3rd of the LBA region that is
> "unused". Hence the overwrites go into the unused 1/3rd of the SSD,
> and the underlying SSD blocks associated with the "overwritten" LBA
> region are immediately marked free, just like if you issued a trim
> for that region before you start the overwrite.
>
> With the way the XFS allocator works, it fills AGs from lowest to
> highest blocks, and if you free lots of space down low in the AG
> then that tends to get reused before the higher offset free space.
> hence the XFS allocates space in the above workload would result in
> roughly 1/3rd of the LBA space associated with the filesystem
> remaining unused. This is another allocator behaviour designed for
> spinning disks (to keep the data on the faster outer edges of
> drives) that maps very well to internal SSD allocation/reclaim
> algorithms....

Cool.  So we'll keep fstrim usage to daily, or something similarly low.

>
> FWIW, did you know that TRIM generally doesn't return the disk to
> the performance of a pristine, empty disk?  Generally only a secure
> erase will guarantee that a SSD returns to "empty disk" performance,
> but that also removes all data from then entire SSD.  Hence the
> baseline "sustained performance" you should be using is not "empty
> disk" performance, but the performance once the disk has been
> overwritten completely at least once. Only them will you tend to see
> what effect TRIM will actually have.

I did not know that.  Maybe that's another factor in why cloud SSDs are 
so slow.

>
>> Maybe a simple countermeasure is to issue an fstrim every time we
>> write 10%-20% of the disk's capacity.
> Run the workload to steady state performance and measure the
> degradation as it continues to run and overwrite the SSDs
> repeatedly. To do this properly you are going to have to sacrifice
> some SSDs, because you're going to need to overwrite them quite a
> few times to get an idea of the degradation characteristics and
> whether a periodic trim makes any difference or not.

Enterprise SSDs are guaranteed for something like N full writes / day 
for several years, are they not?  So such a test can take weeks or 
months, depending on the ratio between disk size and bandwidth.

Still, I guess it has to be done.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-03 12:52                                       ` Avi Kivity
@ 2015-12-04  3:16                                         ` Dave Chinner
  2015-12-08 13:52                                           ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-04  3:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Thu, Dec 03, 2015 at 02:52:08PM +0200, Avi Kivity wrote:
> 
> 
> On 12/03/2015 01:19 AM, Dave Chinner wrote:
> >On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote:
> >>On 12/02/2015 01:06 AM, Dave Chinner wrote:
> >>>On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
> >>>>On 12/01/2015 11:19 PM, Dave Chinner wrote:
> >>>>>>>XFS spread files across the allocation groups, based on the directory these
> >>>>>>>files are created,
> >>>>>>Idea: create the files in some subdirectory, and immediately move
> >>>>>>them to their required location.
> >....
> >>>>My hack involves creating the file in a random directory, and while
> >>>>it is still zero sized, move it to its final directory.  This is
> >>>>simply to defeat the ag selection heuristic.
> >>>Which you really don't want to do.
> >>Why not?  For my directory structure, files in the same directory do
> >>not share temporal locality.  What does the ag selection heuristic
> >>give me?
> >Wrong question. The right question is this: what problems does
> >subverting the AG selection heuristic cause me?
> >
> >If you can't answer that question, then you can't quantify the risks
> >involved with making such a behavioural change.
> 
> Okay.  Any hint about the answer to that question?

If your file set is randomly distributed across the filesystem, then
it's quite likely that the filesystem will use all of the LBA space
rather than reusing the same AGs and hence LBA regions. That's going
to slowly fragment free space as metadata (which has different
lifetimes to data) and long term data gets more widely distributed.
That, in term will slowly result in the working dataset being made
up of more and smaller extents, whcih will also slowly get more
distributed over time, which them means allocation and freeing of
extents takes longer, trim becomes less effective because it's
workingwith smaller spaces, the SSD's "LBA in use" mapping becomes
more fragmented so garbage collection becomes harder, etc...

But, really, the only way to tell is to test, measure, observe and
analyse....

> >>>>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on
> >>>>>>nonrotational media instead.
> >>>>>Actually, no, it is not pointless. SSDs do not require optimisation
> >>>>>for minimal seek time, but data locality is still just as important
> >>>>>as spinning disks, if not moreso. Why? Because the garbage
> >>>>>collection routines in the SSDs are all about locality and we can't
> >>>>>drive garbage collection effectively via discard operations if the
> >>>>>filesystem is not keeping temporally related files close together in
> >>>>>it's block address space.
> >>>>In my case, files in the same directory are not temporally related.
> >>>>But I understand where the heuristic comes from.
> >>>>
> >>>>Maybe an ioctl to set a directory attribute "the files in this
> >>>>directory are not temporally related"?
> >>>And exactly what does that gain us?
> >>I have a directory with commitlog files that are constantly and
> >>rapidly being created, appended to, and removed, from all logical
> >>cores in the system.  Does this not put pressure on that allocation
> >>group's locks?
> >Not usually, because if an AG is contended, the allocation algorithm
> >skips the contended AG and selects the next uncontended AG to
> >allocate in. And given that the append algorithm used by the
> >allocator attempts to use the last block of the last extent as the
> >target for the new extent (i.e. contiguous allocation) once a file
> >has skipped to a different AG all allocations will continue in that
> >new AG until it is either full or it becomes contended....
> >
> >IOWs, when AG contention occurs, the filesystem automatically
> >spreads out the load over multiple AGs. Put simply, we optimise for
> >locality first, but we're willing to compromise on locality to
> >minimise contention when it occurs. But, also, keep in mind that
> >in minimising contention we are still selecting the most local of
> >possible alternatives, and that's something you can't do in
> >userspace....
> 
> Cool.  I don't think "nearly-local" matters much for an SSD (it's
> either contiguous or it is not), but it's good to know that it's
> self-tuning wrt. contention.

"Nearly local" matters a lot for filesystem free space management
and hence minimising the amount o LBA space the filesystem actually
uses in the long term given a relatively predicatable workload....

> In some good news, Glauber hacked our I/O engine not to throw so
> many concurrent I/Os at the filesystem, and indeed so the contention
> reduced.  So it's likely we were pushing the fs so hard all the ags
> were contended, but this is no longer the case.

What is the xfs_info output of the filesystem you tested on?

> >With the way the XFS allocator works, it fills AGs from lowest to
> >highest blocks, and if you free lots of space down low in the AG
> >then that tends to get reused before the higher offset free space.
> >hence the XFS allocates space in the above workload would result in
> >roughly 1/3rd of the LBA space associated with the filesystem
> >remaining unused. This is another allocator behaviour designed for
> >spinning disks (to keep the data on the faster outer edges of
> >drives) that maps very well to internal SSD allocation/reclaim
> >algorithms....
> 
> Cool.  So we'll keep fstrim usage to daily, or something similarly low.

Well, it's something you'll need to monitor to determine what the
best frequency is, as even fstrim doesn't come for free (esp. if the
storage does not support queued TRIM commands).

> >FWIW, did you know that TRIM generally doesn't return the disk to
> >the performance of a pristine, empty disk?  Generally only a secure
> >erase will guarantee that a SSD returns to "empty disk" performance,
> >but that also removes all data from then entire SSD.  Hence the
> >baseline "sustained performance" you should be using is not "empty
> >disk" performance, but the performance once the disk has been
> >overwritten completely at least once. Only them will you tend to see
> >what effect TRIM will actually have.
> 
> I did not know that.  Maybe that's another factor in why cloud SSDs
> are so slow.

Have a look at the random write performance consistency graphs for
the different enterprise SSDs here:

http://www.anandtech.com/show/9430/micron-m510dc-480gb-enterprise-sata-ssd-review/3

You'll see just how different sustained write load performance is to
the empty drive performance (which is only the first few hundred
seconds of each graph) across the different drives that have been
tested. The next page has similar results for mixed random
read/write workloads....

That will give you a good idea of how the current enterprise SSDs
behave under sustained write load. It's a *lot* better than the way
the 1st and 2nd generation drives performed....

> >>write 10%-20% of the disk's capacity.
> >Run the workload to steady state performance and measure the
> >degradation as it continues to run and overwrite the SSDs
> >repeatedly. To do this properly you are going to have to sacrifice
> >some SSDs, because you're going to need to overwrite them quite a
> >few times to get an idea of the degradation characteristics and
> >whether a periodic trim makes any difference or not.
> 
> Enterprise SSDs are guaranteed for something like N full writes /
> day for several years, are they not?

Yes, usually somewhere between 3-15 DWPD (Drive Writes Per Day) - it
typically works out at around 5000 full drive write cycles for
enterprise drives.  However, at both the low capacity end of the
scale or the high performance end (i.e. pcie cards capable of multiple
GB/s writes), it's not uncommon to be able to burn a DW cycle in
under 10 minutes and so you can easily burn the life out of a drive
in a couple of weeks of intense testing....

> So such a test can take weeks
> or months, depending on the ratio between disk size and bandwidth.
> Still, I guess it has to be done.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-02  8:34                               ` Avi Kivity
@ 2015-12-08  6:03                                 ` Dave Chinner
  2015-12-08 13:56                                   ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-08  6:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs

On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote:
> On 12/02/2015 02:13 AM, Brian Foster wrote:
> >Metadata is modified in-core and handed off to the logging
> >infrastructure via a transaction. The log is flushed to disk some time
> >later and metadata writeback occurs asynchronously via the xfsaild
> >thread.
> 
> Unless, I expect, if the log is full.  Since we're hammering on the
> disk quite heavily, the log would be fighting with user I/O and
> possibly losing.
> 
> Does XFS throttle user I/O in order to get the log buffers recycled faster?

No. XFS tags the metadata IO with REQ_META that the IO schedulers
can tell the difference between metadata and data IO, and schedule
them appropriately. Further. log buffers are also tagged with
REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO
schedulers again treat differently to minimise latency in the face
of bulk async IO which is not latency sensitive.

IOWs, IO prioritisation and dispatch scheduling is the job of the IO
scheduler, not the filesystem. The filesystem just tells the
scheduler how to treat the different types of IO...

> Is there any way for us to keep track of it, and reduce disk
> pressure when it gets full?

Only if you want to make more problems for yourself - second
guessing what the filesystem is going to do will only lead you to
dancing the Charlie Foxtrot on a regular basis. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-04  3:16                                         ` Dave Chinner
@ 2015-12-08 13:52                                           ` Avi Kivity
  2015-12-08 23:13                                             ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-08 13:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Glauber Costa, xfs



On 12/04/2015 05:16 AM, Dave Chinner wrote:
> On Thu, Dec 03, 2015 at 02:52:08PM +0200, Avi Kivity wrote:
>>
>> On 12/03/2015 01:19 AM, Dave Chinner wrote:
>>> On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote:
>>>> On 12/02/2015 01:06 AM, Dave Chinner wrote:
>>>>> On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
>>>>>> On 12/01/2015 11:19 PM, Dave Chinner wrote:
>>>>>>>>> XFS spread files across the allocation groups, based on the directory these
>>>>>>>>> files are created,
>>>>>>>> Idea: create the files in some subdirectory, and immediately move
>>>>>>>> them to their required location.
>>> ....
>>>>>> My hack involves creating the file in a random directory, and while
>>>>>> it is still zero sized, move it to its final directory.  This is
>>>>>> simply to defeat the ag selection heuristic.
>>>>> Which you really don't want to do.
>>>> Why not?  For my directory structure, files in the same directory do
>>>> not share temporal locality.  What does the ag selection heuristic
>>>> give me?
>>> Wrong question. The right question is this: what problems does
>>> subverting the AG selection heuristic cause me?
>>>
>>> If you can't answer that question, then you can't quantify the risks
>>> involved with making such a behavioural change.
>> Okay.  Any hint about the answer to that question?
> If your file set is randomly distributed across the filesystem,

I think that happens whether or not I break the "files in the same 
directory are related" heuristic, because I have many directories. It's 
just that some of them get churned more than others.

>   then
> it's quite likely that the filesystem will use all of the LBA space
> rather than reusing the same AGs and hence LBA regions. That's going
> to slowly fragment free space as metadata (which has different
> lifetimes to data) and long term data gets more widely distributed.
> That, in term will slowly result in the working dataset being made
> up of more and smaller extents, whcih will also slowly get more
> distributed over time, which them means allocation and freeing of
> extents takes longer, trim becomes less effective because it's
> workingwith smaller spaces, the SSD's "LBA in use" mapping becomes
> more fragmented so garbage collection becomes harder, etc...
>
> But, really, the only way to tell is to test, measure, observe and
> analyse....

Sure.

>
>>>>>>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>>>>>>>> nonrotational media instead.
>>>>>>> Actually, no, it is not pointless. SSDs do not require optimisation
>>>>>>> for minimal seek time, but data locality is still just as important
>>>>>>> as spinning disks, if not moreso. Why? Because the garbage
>>>>>>> collection routines in the SSDs are all about locality and we can't
>>>>>>> drive garbage collection effectively via discard operations if the
>>>>>>> filesystem is not keeping temporally related files close together in
>>>>>>> it's block address space.
>>>>>> In my case, files in the same directory are not temporally related.
>>>>>> But I understand where the heuristic comes from.
>>>>>>
>>>>>> Maybe an ioctl to set a directory attribute "the files in this
>>>>>> directory are not temporally related"?
>>>>> And exactly what does that gain us?
>>>> I have a directory with commitlog files that are constantly and
>>>> rapidly being created, appended to, and removed, from all logical
>>>> cores in the system.  Does this not put pressure on that allocation
>>>> group's locks?
>>> Not usually, because if an AG is contended, the allocation algorithm
>>> skips the contended AG and selects the next uncontended AG to
>>> allocate in. And given that the append algorithm used by the
>>> allocator attempts to use the last block of the last extent as the
>>> target for the new extent (i.e. contiguous allocation) once a file
>>> has skipped to a different AG all allocations will continue in that
>>> new AG until it is either full or it becomes contended....
>>>
>>> IOWs, when AG contention occurs, the filesystem automatically
>>> spreads out the load over multiple AGs. Put simply, we optimise for
>>> locality first, but we're willing to compromise on locality to
>>> minimise contention when it occurs. But, also, keep in mind that
>>> in minimising contention we are still selecting the most local of
>>> possible alternatives, and that's something you can't do in
>>> userspace....
>> Cool.  I don't think "nearly-local" matters much for an SSD (it's
>> either contiguous or it is not), but it's good to know that it's
>> self-tuning wrt. contention.
> "Nearly local" matters a lot for filesystem free space management
> and hence minimising the amount o LBA space the filesystem actually
> uses in the long term given a relatively predicatable workload....
>
>> In some good news, Glauber hacked our I/O engine not to throw so
>> many concurrent I/Os at the filesystem, and indeed so the contention
>> reduced.  So it's likely we were pushing the fs so hard all the ags
>> were contended, but this is no longer the case.
> What is the xfs_info output of the filesystem you tested on?

It was a cloud disk so someone else now has the pleasure...

>
>>> With the way the XFS allocator works, it fills AGs from lowest to
>>> highest blocks, and if you free lots of space down low in the AG
>>> then that tends to get reused before the higher offset free space.
>>> hence the XFS allocates space in the above workload would result in
>>> roughly 1/3rd of the LBA space associated with the filesystem
>>> remaining unused. This is another allocator behaviour designed for
>>> spinning disks (to keep the data on the faster outer edges of
>>> drives) that maps very well to internal SSD allocation/reclaim
>>> algorithms....
>> Cool.  So we'll keep fstrim usage to daily, or something similarly low.
> Well, it's something you'll need to monitor to determine what the
> best frequency is, as even fstrim doesn't come for free (esp. if the
> storage does not support queued TRIM commands).

I was able to trigger a load where discard caused io_submit to sleep 
even on my super-fast nvme drive.

The bad news is, disabling discard and running fstrim in parallel with 
this load also caused io_submit to sleep.

>
>>> FWIW, did you know that TRIM generally doesn't return the disk to
>>> the performance of a pristine, empty disk?  Generally only a secure
>>> erase will guarantee that a SSD returns to "empty disk" performance,
>>> but that also removes all data from then entire SSD.  Hence the
>>> baseline "sustained performance" you should be using is not "empty
>>> disk" performance, but the performance once the disk has been
>>> overwritten completely at least once. Only them will you tend to see
>>> what effect TRIM will actually have.
>> I did not know that.  Maybe that's another factor in why cloud SSDs
>> are so slow.
> Have a look at the random write performance consistency graphs for
> the different enterprise SSDs here:
>
> http://www.anandtech.com/show/9430/micron-m510dc-480gb-enterprise-sata-ssd-review/3
>
> You'll see just how different sustained write load performance is to
> the empty drive performance (which is only the first few hundred
> seconds of each graph) across the different drives that have been
> tested. The next page has similar results for mixed random
> read/write workloads....
>
> That will give you a good idea of how the current enterprise SSDs
> behave under sustained write load. It's a *lot* better than the way
> the 1st and 2nd generation drives performed....
>
>>>> write 10%-20% of the disk's capacity.
>>> Run the workload to steady state performance and measure the
>>> degradation as it continues to run and overwrite the SSDs
>>> repeatedly. To do this properly you are going to have to sacrifice
>>> some SSDs, because you're going to need to overwrite them quite a
>>> few times to get an idea of the degradation characteristics and
>>> whether a periodic trim makes any difference or not.
>> Enterprise SSDs are guaranteed for something like N full writes /
>> day for several years, are they not?
> Yes, usually somewhere between 3-15 DWPD (Drive Writes Per Day) - it
> typically works out at around 5000 full drive write cycles for
> enterprise drives.  However, at both the low capacity end of the
> scale or the high performance end (i.e. pcie cards capable of multiple
> GB/s writes), it's not uncommon to be able to burn a DW cycle in
> under 10 minutes and so you can easily burn the life out of a drive
> in a couple of weeks of intense testing....
>
>> So such a test can take weeks
>> or months, depending on the ratio between disk size and bandwidth.
>> Still, I guess it has to be done.
> *nod*
>
> Cheers,
>
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-08  6:03                                 ` Dave Chinner
@ 2015-12-08 13:56                                   ` Avi Kivity
  2015-12-08 23:32                                     ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Avi Kivity @ 2015-12-08 13:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs



On 12/08/2015 08:03 AM, Dave Chinner wrote:
> On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote:
>> On 12/02/2015 02:13 AM, Brian Foster wrote:
>>> Metadata is modified in-core and handed off to the logging
>>> infrastructure via a transaction. The log is flushed to disk some time
>>> later and metadata writeback occurs asynchronously via the xfsaild
>>> thread.
>> Unless, I expect, if the log is full.  Since we're hammering on the
>> disk quite heavily, the log would be fighting with user I/O and
>> possibly losing.
>>
>> Does XFS throttle user I/O in order to get the log buffers recycled faster?
> No. XFS tags the metadata IO with REQ_META that the IO schedulers
> can tell the difference between metadata and data IO, and schedule
> them appropriately. Further. log buffers are also tagged with
> REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO
> schedulers again treat differently to minimise latency in the face
> of bulk async IO which is not latency sensitive.
>
> IOWs, IO prioritisation and dispatch scheduling is the job of the IO
> scheduler, not the filesystem. The filesystem just tells the
> scheduler how to treat the different types of IO...
>
>> Is there any way for us to keep track of it, and reduce disk
>> pressure when it gets full?
> Only if you want to make more problems for yourself - second
> guessing what the filesystem is going to do will only lead you to
> dancing the Charlie Foxtrot on a regular basis. :/

So far the best approach I found that doesn't conflict with this is to 
limit io_submit iodepth to the natural disk iodepth (or a small multiple 
thereof).  This seems to keep XFS in its comfort zone, and is good for 
latency anyway.

The only issue is that the only way to obtain this parameter is to 
measure it.

I wrote a small tool to do this [1], but it's a hassle for users.

[1] https://github.com/avikivity/diskplorer

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-08 13:52                                           ` Avi Kivity
@ 2015-12-08 23:13                                             ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2015-12-08 23:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Glauber Costa, xfs

On Tue, Dec 08, 2015 at 03:52:52PM +0200, Avi Kivity wrote:
> >>>With the way the XFS allocator works, it fills AGs from lowest to
> >>>highest blocks, and if you free lots of space down low in the AG
> >>>then that tends to get reused before the higher offset free space.
> >>>hence the XFS allocates space in the above workload would result in
> >>>roughly 1/3rd of the LBA space associated with the filesystem
> >>>remaining unused. This is another allocator behaviour designed for
> >>>spinning disks (to keep the data on the faster outer edges of
> >>>drives) that maps very well to internal SSD allocation/reclaim
> >>>algorithms....
> >>Cool.  So we'll keep fstrim usage to daily, or something similarly low.
> >Well, it's something you'll need to monitor to determine what the
> >best frequency is, as even fstrim doesn't come for free (esp. if the
> >storage does not support queued TRIM commands).
> 
> I was able to trigger a load where discard caused io_submit to sleep
> even on my super-fast nvme drive.
> 
> The bad news is, disabling discard and running fstrim in parallel
> with this load also caused io_submit to sleep.

Well, yes.  fstrim is not a magic bullet that /prevents/ discard
from interrupting your application's IO - it's just a method under
which the impact can be /somewhat controlled/ as it can be scheduled
for periods where the impact has minimal interruption (e.g. when
load is likely to be light, such as at 3am just before nightly
backups are run).

Regardless, it sounds like your steady state load could be described
as "throwing as much IO as we possible can at the device", but you
are then then having "blocking trouble" when maintenance (expensive)
operations like TRIM need to be are run. I'm not sure this
"blocking" can be prevented completely, because it assumes that you
have a device of infinite IO capacity.

That is, if you exceed the device's command queue depth and the IO
scheduler request queue depth, the block layer will block in the IO
scheduler waiting for a request queue slot to come free. Put simply:
if you overload the IO subsystem, it will block.  There's nothing we
can do in the filesystem about this - this is the way the block
layer works, and it's architected this way to provide the necessary
feedback control for buffered write IO throttling and other
congestion control mechanisms in the kernel.

Sure, you can set the IO scheduler request queue depth to be really
deep to avoid blocking, but this then simply increases your average
and worst-case IO latency in overload situations. At some point you
have to consider the IO subsystem is overloaded and the application
driving it needs to back off. Something has to block when this
happens...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-08 13:56                                   ` Avi Kivity
@ 2015-12-08 23:32                                     ` Dave Chinner
  2015-12-09  8:37                                       ` Avi Kivity
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2015-12-08 23:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs

On Tue, Dec 08, 2015 at 03:56:52PM +0200, Avi Kivity wrote:
> On 12/08/2015 08:03 AM, Dave Chinner wrote:
> >On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote:
> >>On 12/02/2015 02:13 AM, Brian Foster wrote:
> >>>Metadata is modified in-core and handed off to the logging
> >>>infrastructure via a transaction. The log is flushed to disk some time
> >>>later and metadata writeback occurs asynchronously via the xfsaild
> >>>thread.
> >>Unless, I expect, if the log is full.  Since we're hammering on the
> >>disk quite heavily, the log would be fighting with user I/O and
> >>possibly losing.
> >>
> >>Does XFS throttle user I/O in order to get the log buffers recycled faster?
> >No. XFS tags the metadata IO with REQ_META that the IO schedulers
> >can tell the difference between metadata and data IO, and schedule
> >them appropriately. Further. log buffers are also tagged with
> >REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO
> >schedulers again treat differently to minimise latency in the face
> >of bulk async IO which is not latency sensitive.
> >
> >IOWs, IO prioritisation and dispatch scheduling is the job of the IO
> >scheduler, not the filesystem. The filesystem just tells the
> >scheduler how to treat the different types of IO...
> >
> >>Is there any way for us to keep track of it, and reduce disk
> >>pressure when it gets full?
> >Only if you want to make more problems for yourself - second
> >guessing what the filesystem is going to do will only lead you to
> >dancing the Charlie Foxtrot on a regular basis. :/
> 
> So far the best approach I found that doesn't conflict with this is
> to limit io_submit iodepth to the natural disk iodepth (or a small
> multiple thereof).  This seems to keep XFS in its comfort zone, and
> is good for latency anyway.

That's pretty much what I just explained in my previous reply.  ;)

> The only issue is that the only way to obtain this parameter is to
> measure it.

Yup, exactly what I've been saying ;)

However, You can get a pretty good guess on max concurrency from the
device characteristics in sysfs:

/sys/block/<dev>/queue/nr_requests

gives you the maximum IO scheduler request queue depth, and

/sys/block/<dev>/device/queue_depth

gives you the hardware command queue depth.

E.g. a random iscsi device I have attached to a test VM:

$ cat /sys/block/sdc/device/queue_depth 
32
$ cat /sys/block/sdc/queue/nr_requests
127

Which means 32 physical IOs can be in flight concurrently, and the
IO scheduler will queue up to roughly another 100 discrete IOs
before it starts blocking incoming IO requests (127 is the typical
io scheduler queue depth default). That means maximum non-blocking
concurrency is going to be around 100-130 IOs in flight at once.

> I wrote a small tool to do this [1], but it's a hassle for users.
> 
> [1] https://github.com/avikivity/diskplorer

I note that the NVMe device you tested in the description hits
maximum performance with concurrency at around 110-120 read IOs in
flight. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: sleeps and waits during io_submit
  2015-12-08 23:32                                     ` Dave Chinner
@ 2015-12-09  8:37                                       ` Avi Kivity
  0 siblings, 0 replies; 58+ messages in thread
From: Avi Kivity @ 2015-12-09  8:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs



On 12/09/2015 01:32 AM, Dave Chinner wrote:
> On Tue, Dec 08, 2015 at 03:56:52PM +0200, Avi Kivity wrote:
>> On 12/08/2015 08:03 AM, Dave Chinner wrote:
>>> On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote:
>>>> On 12/02/2015 02:13 AM, Brian Foster wrote:
>>>>> Metadata is modified in-core and handed off to the logging
>>>>> infrastructure via a transaction. The log is flushed to disk some time
>>>>> later and metadata writeback occurs asynchronously via the xfsaild
>>>>> thread.
>>>> Unless, I expect, if the log is full.  Since we're hammering on the
>>>> disk quite heavily, the log would be fighting with user I/O and
>>>> possibly losing.
>>>>
>>>> Does XFS throttle user I/O in order to get the log buffers recycled faster?
>>> No. XFS tags the metadata IO with REQ_META that the IO schedulers
>>> can tell the difference between metadata and data IO, and schedule
>>> them appropriately. Further. log buffers are also tagged with
>>> REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO
>>> schedulers again treat differently to minimise latency in the face
>>> of bulk async IO which is not latency sensitive.
>>>
>>> IOWs, IO prioritisation and dispatch scheduling is the job of the IO
>>> scheduler, not the filesystem. The filesystem just tells the
>>> scheduler how to treat the different types of IO...
>>>
>>>> Is there any way for us to keep track of it, and reduce disk
>>>> pressure when it gets full?
>>> Only if you want to make more problems for yourself - second
>>> guessing what the filesystem is going to do will only lead you to
>>> dancing the Charlie Foxtrot on a regular basis. :/
>> So far the best approach I found that doesn't conflict with this is
>> to limit io_submit iodepth to the natural disk iodepth (or a small
>> multiple thereof).  This seems to keep XFS in its comfort zone, and
>> is good for latency anyway.
> That's pretty much what I just explained in my previous reply.  ;)
>
>> The only issue is that the only way to obtain this parameter is to
>> measure it.
> Yup, exactly what I've been saying ;)
>
> However, You can get a pretty good guess on max concurrency from the
> device characteristics in sysfs:
>
> /sys/block/<dev>/queue/nr_requests

That's just a fixed number. AFAICT, it isn't derived from the actual device.

"measure it" is better than nothing, but when you want to distribute 
software that works out of the box and does not need extensive tuning, 
it leaves something to be desired.

I'm thinking about detecting the limit dynamically (below the limit, 
throughput is roughly proportional to concurrency; above the limit, 
throughput is fixed while latency is proportional to concurrency). The 
problem is that the measurement is very noisy, the more so because we 
are two layers above the hardware, and driving it from cores that try 
very hard not to communicate.

The right place to do this is the block layer.

> gives you the maximum IO scheduler request queue depth, and
>
> /sys/block/<dev>/device/queue_depth
>
> gives you the hardware command queue depth.

That's more useful, but it really describes the bus/link/protocol rather 
than the device itself.

I don't have this queue_depth attribute for my nvme0n1 device (4.1.7).

>
> E.g. a random iscsi device I have attached to a test VM:
>
> $ cat /sys/block/sdc/device/queue_depth
> 32
> $ cat /sys/block/sdc/queue/nr_requests
> 127
>
> Which means 32 physical IOs can be in flight concurrently, and the
> IO scheduler will queue up to roughly another 100 discrete IOs
> before it starts blocking incoming IO requests (127 is the typical
> io scheduler queue depth default). That means maximum non-blocking
> concurrency is going to be around 100-130 IOs in flight at once.
>
>> I wrote a small tool to do this [1], but it's a hassle for users.
>>
>> [1] https://github.com/avikivity/diskplorer
> I note that the NVMe device you tested in the description hits
> maximum performance with concurrency at around 110-120 read IOs in
> flight. :)
>
>

We increased nr_requests for the test so it wouldn't block.  So it's the 
actual device characteristics, not an artifact of the software stack.  
If you consider a RAID of these, you can easily need a few hundred 
concurrent ops.

IIRC nvme's maximum iodepth is 64k.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2015-12-09  8:37 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-28  2:43 sleeps and waits during io_submit Glauber Costa
2015-11-30 14:10 ` Brian Foster
2015-11-30 14:29   ` Avi Kivity
2015-11-30 16:14     ` Brian Foster
2015-12-01  9:08       ` Avi Kivity
2015-12-01 13:11         ` Brian Foster
2015-12-01 13:58           ` Avi Kivity
2015-12-01 14:01             ` Glauber Costa
2015-12-01 14:37               ` Avi Kivity
2015-12-01 20:45               ` Dave Chinner
2015-12-01 20:56                 ` Avi Kivity
2015-12-01 23:41                   ` Dave Chinner
2015-12-02  8:23                     ` Avi Kivity
2015-12-01 14:56             ` Brian Foster
2015-12-01 15:22               ` Avi Kivity
2015-12-01 16:01                 ` Brian Foster
2015-12-01 16:08                   ` Avi Kivity
2015-12-01 16:29                     ` Brian Foster
2015-12-01 17:09                       ` Avi Kivity
2015-12-01 18:03                         ` Carlos Maiolino
2015-12-01 19:07                           ` Avi Kivity
2015-12-01 21:19                             ` Dave Chinner
2015-12-01 21:38                               ` Avi Kivity
2015-12-01 23:06                                 ` Dave Chinner
2015-12-02  9:02                                   ` Avi Kivity
2015-12-02 12:57                                     ` Carlos Maiolino
2015-12-02 23:19                                     ` Dave Chinner
2015-12-03 12:52                                       ` Avi Kivity
2015-12-04  3:16                                         ` Dave Chinner
2015-12-08 13:52                                           ` Avi Kivity
2015-12-08 23:13                                             ` Dave Chinner
2015-12-01 18:51                         ` Brian Foster
2015-12-01 19:07                           ` Glauber Costa
2015-12-01 19:35                             ` Brian Foster
2015-12-01 19:45                               ` Avi Kivity
2015-12-01 19:26                           ` Avi Kivity
2015-12-01 19:41                             ` Christoph Hellwig
2015-12-01 19:50                               ` Avi Kivity
2015-12-02  0:13                             ` Brian Foster
2015-12-02  0:57                               ` Dave Chinner
2015-12-02  8:38                                 ` Avi Kivity
2015-12-02  8:34                               ` Avi Kivity
2015-12-08  6:03                                 ` Dave Chinner
2015-12-08 13:56                                   ` Avi Kivity
2015-12-08 23:32                                     ` Dave Chinner
2015-12-09  8:37                                       ` Avi Kivity
2015-12-01 21:04                 ` Dave Chinner
2015-12-01 21:10                   ` Glauber Costa
2015-12-01 21:39                     ` Dave Chinner
2015-12-01 21:24                   ` Avi Kivity
2015-12-01 21:31                     ` Glauber Costa
2015-11-30 15:49   ` Glauber Costa
2015-12-01 13:11     ` Brian Foster
2015-12-01 13:39       ` Glauber Costa
2015-12-01 14:02         ` Brian Foster
2015-11-30 23:10 ` Dave Chinner
2015-11-30 23:51   ` Glauber Costa
2015-12-01 20:30     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.